diff mbox series

Chaining is dead

Message ID Z-itc_Qd5LLn19pH@gondor.apana.org.au
State New
Headers show
Series Chaining is dead | expand

Commit Message

Herbert Xu March 30, 2025, 2:33 a.m. UTC
On Wed, Mar 26, 2025 at 11:52:05AM +0800, Herbert Xu wrote:
>
> they don't need it.  Take ext4 as an example:
> 
> 	ext4 calls verity
> 	schedule_work(verity_work);
> 	return asynchronously!
> 
> verity_work:
> 	do the crypto work
> 	__read_end_io(bio);

I went ahead and removed the work queue for fsverity and fscrypt
(except for the reading of the Merkle tree which is still done in
a work queue because I'm too lazy to make that async), and it
actually turned out to be slower than using a work queue.

I was testing with an encrypted 8GB file over ext4 mounted over a
loopback device in tmpfs.  The encryption is with xts-vaes.  It turns
out that not using a work queue actually made reading the entire file
go from 2.4s to 2.5s.

I then tried passing the whole bio (256KB per crypto request in my
test as opposed to the data unit size of 4KB per crypto request)
through using chaining to skcipher, with xts-vaes doing the requests
one-by-one.  Against my expectations, this didn't speed things up at
all (but at least it didn't slow things down either).  All the
benefits of aggregating the data were offset by the extra setup cost
of creating the chained requests.

So chaining is clearly not the way to go because it involves cutting
up into data units at the start of the process, rather than the end.

Finally I hacked up a patch (this goes on top of the skcipher branch
in cryptodev) to pass the whole bio through the Crypto API all the
way to xts-vaes which then unbundled it.  This turned out to be a
winner, taking the read time for 8GB from 2.4s down to 2.1s.

In view of this result, I'm going to throw away chaining, and instead
work on an interface that can take a whole bio (or folio), then cut
it up into the specified data unit size before processing.

The bottom-end of the interface should be able to feed two (or whatever
number you fancy) data units to the actual algorithm.

This should work just as well for compression, since their batching
input is simply a order-N folio.  The compression output is a bit
harder because the data unit size is not constant, but I think I
have a way of making it work by adding a bit to the scatterlist data
structure to indicate the end of each data unit.

PS For fsverity a 256KB bio size equates to 64 units of hash input.
My strategy is to allocate the whole thing if we can (2KB or 4KB
depending on your digest size), and if that fails, fall back to
a stack buffer of 512 bytes (or whatever number that keeps the
compiler quiet regarding stack usage).  Even if we're on the stack,
it should still give more than enough to data to satiate your
multibuffer hash code.

Cheers,

Comments

Eric Biggers March 31, 2025, 4:56 p.m. UTC | #1
On Sun, Mar 30, 2025 at 10:33:23AM +0800, Herbert Xu wrote:
> On Wed, Mar 26, 2025 at 11:52:05AM +0800, Herbert Xu wrote:
> >
> > they don't need it.  Take ext4 as an example:
> > 
> > 	ext4 calls verity
> > 	schedule_work(verity_work);
> > 	return asynchronously!
> > 
> > verity_work:
> > 	do the crypto work
> > 	__read_end_io(bio);
> 
> I went ahead and removed the work queue for fsverity and fscrypt
> (except for the reading of the Merkle tree which is still done in
> a work queue because I'm too lazy to make that async), and it
> actually turned out to be slower than using a work queue.
> 
> I was testing with an encrypted 8GB file over ext4 mounted over a
> loopback device in tmpfs.  The encryption is with xts-vaes.  It turns
> out that not using a work queue actually made reading the entire file
> go from 2.4s to 2.5s.
> 
> I then tried passing the whole bio (256KB per crypto request in my
> test as opposed to the data unit size of 4KB per crypto request)
> through using chaining to skcipher, with xts-vaes doing the requests
> one-by-one.  Against my expectations, this didn't speed things up at
> all (but at least it didn't slow things down either).  All the
> benefits of aggregating the data were offset by the extra setup cost
> of creating the chained requests.

Yes, your chaining API has poor performance and is difficult to test, as I've
been saying all along.

> So chaining is clearly not the way to go because it involves cutting
> up into data units at the start of the process, rather than the end.

Certainly agreed that chaining is not the way to go, but I think you're
overlooking that Linus's suggestion to use the libraries directly would also
solve this, while also not being restricted to bios and folios (note that not
all filesystems are block-based, for example...).  That would avoid the
per-request overhead from the generic crypto infrastructure, which is the real
source of the problem.

> Finally I hacked up a patch (this goes on top of the skcipher branch
> in cryptodev) to pass the whole bio through the Crypto API all the
> way to xts-vaes which then unbundled it.  This turned out to be a
> winner, taking the read time for 8GB from 2.4s down to 2.1s.
> 
> In view of this result, I'm going to throw away chaining, and instead
> work on an interface that can take a whole bio (or folio), then cut
> it up into the specified data unit size before processing.
> 
> The bottom-end of the interface should be able to feed two (or whatever
> number you fancy) data units to the actual algorithm.
> 
> This should work just as well for compression, since their batching
> input is simply a order-N folio.  The compression output is a bit
> harder because the data unit size is not constant, but I think I
> have a way of making it work by adding a bit to the scatterlist data
> structure to indicate the end of each data unit.
> 
> PS For fsverity a 256KB bio size equates to 64 units of hash input.
> My strategy is to allocate the whole thing if we can (2KB or 4KB
> depending on your digest size), and if that fails, fall back to
> a stack buffer of 512 bytes (or whatever number that keeps the
> compiler quiet regarding stack usage).  Even if we're on the stack,
> it should still give more than enough to data to satiate your
> multibuffer hash code.

Extending the generic crypto infrastructure to support bios and folios is an
interesting idea.

But TBH I think it's worse than Linus's suggestion of just extending lib/crypto/
to support the needed functionality and using that directly.  Your proposal is
again solving a problem created by the generic crypto infrastructure being too
complex, by making the generic crypto infrastructure even more complex.

With the bio and folio support in the generic crypto infrastructure, there would
be lots of work to do with adding support in all the underlying algorithms, and
adding tests for all the new APIs.

For hashing, users would need to allocate an array to hold the digest for every
block in the bio or folio.  That would add an additional memory allocation to
every I/O.  You said you'd like to fall back to a smaller buffer if the memory
allocation fails.  But that's silly; if we have to support that anyway, we might
as well do it that way only.  In which case the bio interface is pointless.

Also note that the kernel also *already* has an abstraction layer that allows
doing en/decryption on bios.  It's called blk-crypto, and it makes it possible
to do the en/decryption using either inline encryption hardware (i.e., the newer
style of crypto accelerator that is actually commonly used and doesn't use the
Crypto API at all) or the Crypto API.  I have plans to remove the fs-layer bio
en/decryption code from fscrypt and always use blk-crypto instead.

Adding bio support to the Crypto API feels duplicative of blk-crypto, and we'd
end up with too many abstraction layers.  I think my preferred approach is that
blk-crypto-fallback would directly call the library functions.  The legacy
Crypto API really has no useful role to play anymore.

FWIW, there are also people thinking about developing inline hashing hardware,
in which case something similar would apply to blk-integrity.

- Eric
Herbert Xu April 1, 2025, 2:44 a.m. UTC | #2
On Mon, Mar 31, 2025 at 04:56:30PM +0000, Eric Biggers wrote:
>
> With the bio and folio support in the generic crypto infrastructure, there would
> be lots of work to do with adding support in all the underlying algorithms, and
> adding tests for all the new APIs.

It doesn't need to be all that complex.  My plan is to add a fallback
template at the top, which could then be implemented by strategic
algorithms at the bottom through an extension of the skcipher walk
mechanism.

So you'd allocate "fscrypt(xts(aes))" instead of "xts(aes)", and
the xts-vaes code could simply use the normal skcipher walker with
zero changes.  The only changes needed in the aesni module is to
hook up the xts code to the new "fscrypt(xts(aes))" algorithm.

The real reason why I think this is the way to go is that the same
speed-up would apply everywhere.  So just as I could gain a 15%
speed-up with xts in fscrypt without any actual multibuffer code,
a similar speed-up is expected with gcm in networking.  In fact
if anything the speed-up could be greater there because the data
unit size is smaller at under 1500 bytes.

You see the Crypto API as the problem here, but for me the problem
is the legacy data unit size.  It's 2025 and there is no reason why
anyone should be dealing with units of 1500 or 4096 bytes.

> For hashing, users would need to allocate an array to hold the digest for every
> block in the bio or folio.  That would add an additional memory allocation to
> every I/O.  You said you'd like to fall back to a smaller buffer if the memory
> allocation fails.  But that's silly; if we have to support that anyway, we might
> as well do it that way only.  In which case the bio interface is pointless.

Sure if you decide to go down the lib/crypto route then there is
no gain.  All it means is that you can't support hardware offload,
but neither of us really care about that.

> Also note that the kernel also *already* has an abstraction layer that allows
> doing en/decryption on bios.  It's called blk-crypto, and it makes it possible
> to do the en/decryption using either inline encryption hardware (i.e., the newer
> style of crypto accelerator that is actually commonly used and doesn't use the
> Crypto API at all) or the Crypto API.  I have plans to remove the fs-layer bio
> en/decryption code from fscrypt and always use blk-crypto instead.

Thanks for the pointer, I wasn't aware of its existence.  Yes
there should definitely be only one code path for this.  So
what's stopping you from removing fscrypt right now? IOW what's
missing from blk-crypto?

> Adding bio support to the Crypto API feels duplicative of blk-crypto, and we'd
> end up with too many abstraction layers.  I think my preferred approach is that
> blk-crypto-fallback would directly call the library functions.  The legacy
> Crypto API really has no useful role to play anymore.

I'd certainly like to see that :)

> FWIW, there are also people thinking about developing inline hashing hardware,
> in which case something similar would apply to blk-integrity.

Offloading a whole bio rather than a page or two is definitely
the way to go.

Cheers,
Eric Biggers April 1, 2025, 3:33 a.m. UTC | #3
On Tue, Apr 01, 2025 at 10:44:34AM +0800, Herbert Xu wrote:
> Thanks for the pointer, I wasn't aware of its existence.  Yes
> there should definitely be only one code path for this.  So
> what's stopping you from removing fscrypt right now? IOW what's
> missing from blk-crypto?

Well, fscrypt (ext4/f2fs/ubifs/ceph encryption) wouldn't be removed; its
implementation would just change on ext4 and f2fs.  Instead of providing a
choice of whether to use blk-crypto or fs-layer crypto for file contents
en/decryption, ext4 and f2fs would always use blk-crypto.  Supporting code such
as fscrypt_decrypt_bio() that would become unused by that would be removed.

A few reasons I've waited so long:

- The fs-layer file contents en/decryption code was there first, and there
  hasn't been a strong need to remove it yet
- Much of the file contents en/decryption code in fs/crypto/ would still be
  needed, since ubifs and ceph still use it as they are not block-based
- It would make CONFIG_BLK_INLINE_ENCRYPTION, which adds a field to struct bio,
  required on more systems
- It would add the overhead of keyslot management to software crypto
- blk-crypto currently always uses inline encryption hardware when it's
  available; but, I'd like to preserve ext4's and f2fs's existing behavior where
  the use of inline encryption hardware is opt-in via a mount option.

But I'm thinking it's finally time, especially with the conversions of
filesystems to operate on folios that's going on.

dm-crypt could of course use blk-crypto too, but the dm people haven't been
super comfortable so far with delegating en/decryption to the block layer.

- Eric
Herbert Xu April 1, 2025, 3:55 a.m. UTC | #4
On Mon, Mar 31, 2025 at 08:33:03PM -0700, Eric Biggers wrote:
>
> - It would add the overhead of keyslot management to software crypto

That seems to be a design error in blk_crypto.  Why should we
model the inadequacies of hardware in software?

If we're going through the software crypto path in blk_crypto,
it should be done as a first-clsas citizen, and not as a poor
man's version of hardware crypto.

Cheers,
Herbert Xu April 1, 2025, 4:14 a.m. UTC | #5
On Mon, Mar 31, 2025 at 09:08:52PM -0700, Eric Biggers wrote:
>
> Interesting seeing this argument coming from you when the whole Crypto API is
> built around forcing software crypto to use interfaces designed for hardware.

Perhaps you should take your rose-coloured glasses off? :)

> aes_expandkey() if we switch to that) for every I/O request.  The blk-crypto
> interface could be reworked to support pre-expansion of the key, but that would
> differ from what actual inline encryption hardware needs.  So this is just
> another case where the needs of hardware vs. software diverge...

If we're going to converge on one interface, then it better put
the needs of software crypto first and foremost.  Now that doesn't
mean throwing out support for hardware altogether, but hardware
does need to take a backseat every now and then.

Cheers,
Milan Broz April 1, 2025, 7:20 a.m. UTC | #6
On 4/1/25 5:33 AM, Eric Biggers wrote:
> On Tue, Apr 01, 2025 at 10:44:34AM +0800, Herbert Xu wrote:
>> Thanks for the pointer, I wasn't aware of its existence.  Yes
>> there should definitely be only one code path for this.  So
>> what's stopping you from removing fscrypt right now? IOW what's
>> missing from blk-crypto?
> 
> Well, fscrypt (ext4/f2fs/ubifs/ceph encryption) wouldn't be removed; its
> implementation would just change on ext4 and f2fs.  Instead of providing a
> choice of whether to use blk-crypto or fs-layer crypto for file contents
> en/decryption, ext4 and f2fs would always use blk-crypto.  Supporting code such
> as fscrypt_decrypt_bio() that would become unused by that would be removed.
> 
> A few reasons I've waited so long:
> 
> - The fs-layer file contents en/decryption code was there first, and there
>    hasn't been a strong need to remove it yet
> - Much of the file contents en/decryption code in fs/crypto/ would still be
>    needed, since ubifs and ceph still use it as they are not block-based
> - It would make CONFIG_BLK_INLINE_ENCRYPTION, which adds a field to struct bio,
>    required on more systems
> - It would add the overhead of keyslot management to software crypto
> - blk-crypto currently always uses inline encryption hardware when it's
>    available; but, I'd like to preserve ext4's and f2fs's existing behavior where
>    the use of inline encryption hardware is opt-in via a mount option.
> 
> But I'm thinking it's finally time, especially with the conversions of
> filesystems to operate on folios that's going on.
> 
> dm-crypt could of course use blk-crypto too, but the dm people haven't been
> super comfortable so far with delegating en/decryption to the block layer.

Hi,

I cannot speak for device-mapper maintainers, but as it was me who complained about
block layer inline crypto introduction in dm-crypt, perhaps some clarification here:

I have no problem if there is a different block-layer/crypto API that guarantees
that the content of the bio is encrypted/decrypted, it could simplify dm-crypt a lot.
But it must not send plaintext to a random hardware device underneath by default
as it changes the dm-crypt threat model (and I see you mention the opt-in hw option
for fs mount as well).

However, dm-crypt also needs AEAD (authenticated encryption) support.
This is becoming important for devices that natively support additional per-sector metadata.
If we can access all these features through ublk in userspace one day, even better :)

Milan
diff mbox series

Patch

diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index 4f721760ebf1..57d149c223bd 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -17,6 +17,7 @@ 
  * Copyright 2024 Google LLC
  */
 
+#include <linux/bio.h>
 #include <linux/hardirq.h>
 #include <linux/types.h>
 #include <linux/module.h>
@@ -480,7 +481,7 @@  xts_crypt_slowpath(struct skcipher_request *req, xts_crypt_func crypt_func)
 
 /* __always_inline to avoid indirect call in fastpath */
 static __always_inline int
-xts_crypt(struct skcipher_request *req, xts_encrypt_iv_func encrypt_iv,
+xts_crypt_one(struct skcipher_request *req, xts_encrypt_iv_func encrypt_iv,
 	  xts_crypt_func crypt_func)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
@@ -511,6 +512,42 @@  xts_crypt(struct skcipher_request *req, xts_encrypt_iv_func encrypt_iv,
 	return xts_crypt_slowpath(req, crypt_func);
 }
 
+static __always_inline int
+xts_crypt(struct skcipher_request *req, xts_encrypt_iv_func encrypt_iv,
+	  xts_crypt_func crypt_func)
+{
+	unsigned int du_bits = req->cryptlen;
+	unsigned int du_size = 1U << du_bits;
+	__le64 *iv = (void *)req->iv;
+	struct folio_iter fi;
+	struct bio *bio;
+	int err;
+
+	if (!(req->base.flags & CRYPTO_SKCIPHER_REQ_BIO))
+		return xts_crypt_one(req, encrypt_iv, crypt_func);
+
+	bio = (void *)req->src;
+
+	for (bio_first_folio(&fi, bio, 0); fi.folio; bio_next_folio(&fi, bio)) {
+		size_t i = fi.offset;
+
+		for (; i < fi.offset + fi.length; i += du_size) {
+			skcipher_request_set_folio(req, fi.folio, i, fi.folio, i, du_size, iv);
+			err = xts_crypt_one(req, encrypt_iv, crypt_func);
+			if (err)
+				goto out;
+
+			*iv = cpu_to_le64(le64_to_cpu(*iv) + 1);
+		}
+	}
+
+out:
+	req->src = (void *)bio;
+	req->dst = (void *)bio;
+	req->cryptlen = du_bits;
+	return err;
+}
+
 static void aesni_xts_encrypt_iv(const struct crypto_aes_ctx *tweak_key,
 				 u8 iv[AES_BLOCK_SIZE])
 {
diff --git a/fs/crypto/bio.c b/fs/crypto/bio.c
index 0ad8c30b8fa5..9f52dc7f7889 100644
--- a/fs/crypto/bio.c
+++ b/fs/crypto/bio.c
@@ -7,6 +7,7 @@ 
  * Copyright (C) 2015, Motorola Mobility
  */
 
+#include <crypto/skcipher.h>
 #include <linux/pagemap.h>
 #include <linux/module.h>
 #include <linux/bio.h>
@@ -30,16 +31,49 @@ 
  */
 bool fscrypt_decrypt_bio(struct bio *bio)
 {
+	struct folio *folio = bio_first_folio_all(bio);
+	const struct inode *inode = folio->mapping->host;
+	const struct fscrypt_inode_info *ci = inode->i_crypt_info;
+	const unsigned int du_bits = ci->ci_data_unit_bits;
+	struct crypto_skcipher *tfm = ci->ci_enc_key.tfm;
+	SKCIPHER_REQUEST_ON_STACK(req, tfm, sizeof(bio));
+	struct bio **ctx = skcipher_request_extra(req);
+	DECLARE_CRYPTO_WAIT(wait);
 	struct folio_iter fi;
+	union fscrypt_iv iv;
+	u64 index;
+	int err;
 
-	bio_for_each_folio_all(fi, bio) {
-		int err = fscrypt_decrypt_pagecache_blocks(fi.folio, fi.length,
-							   fi.offset);
+	*ctx = bio;
 
-		if (err) {
-			bio->bi_status = errno_to_blk_status(err);
-			return false;
-		}
+	bio_first_folio(&fi, bio, 0);
+	if (!fi.folio)
+		return true;
+
+	index = fi.offset;
+	index = ((u64)fi.folio->index << (PAGE_SHIFT - du_bits)) +
+		(index >> du_bits);
+	fscrypt_generate_iv(&iv, index, ci);
+
+	skcipher_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG |
+					   CRYPTO_SKCIPHER_REQ_BIO,
+				      NULL, NULL);
+	skcipher_request_set_crypt(req, (struct scatterlist *)bio,
+				   (struct scatterlist *)bio, du_bits, &iv);
+
+	err = crypto_skcipher_decrypt(req);
+	if (err == -EAGAIN) {
+		req = SKCIPHER_REQUEST_CLONE(req, GFP_ATOMIC);
+		skcipher_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG |
+						   CRYPTO_SKCIPHER_REQ_BIO,
+					      crypto_req_done, &wait);
+		err = crypto_skcipher_decrypt(req);
+	}
+	err = crypto_wait_req(err, &wait);
+	skcipher_request_free(req);
+	if (err) {
+		bio->bi_status = errno_to_blk_status(err);
+		return false;
 	}
 	return true;
 }
diff --git a/include/crypto/skcipher.h b/include/crypto/skcipher.h
index e159ea68124e..931585f864d1 100644
--- a/include/crypto/skcipher.h
+++ b/include/crypto/skcipher.h
@@ -26,6 +26,8 @@ 
 #define CRYPTO_SKCIPHER_REQ_CONT	0x00000001
 /* Set this bit if the skcipher operation is not final. */
 #define CRYPTO_SKCIPHER_REQ_NOTFINAL	0x00000002
+/* Set this bit if the skcipher is made of bio. */
+#define CRYPTO_SKCIPHER_REQ_BIO		0x00000004
 
 /**
  *	struct skcipher_request - Symmetric key cipher request