Chaining is dead

Message ID	Z-itc_Qd5LLn19pH@gondor.apana.org.au
State	New
Headers	show Received: from abb.hmeau.com (abb.hmeau.com [144.6.53.87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 81F83111A8; Sun, 30 Mar 2025 02:33:34 +0000 (UTC) Date: Sun, 30 Mar 2025 10:33:23 +0800 From: Herbert Xu <herbert@gondor.apana.org.au> To: Eric Biggers <ebiggers@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>, "David S. Miller" <davem@davemloft.net>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Linux Crypto Mailing List <linux-crypto@vger.kernel.org> Subject: Chaining is dead Message-ID: <Z-itc_Qd5LLn19pH@gondor.apana.org.au> References: <ZpkdZopjF9/9/Njx@gondor.apana.org.au> <ZuetBbpfq5X8BAwn@gondor.apana.org.au> <ZzqyAW2HKeIjGnKa@gondor.apana.org.au> <Z5Ijqi4uSDU9noZm@gondor.apana.org.au> <Z-JE2HNY-Tj8qwQw@gondor.apana.org.au> <20250325152541.GA1661@sol.localdomain> <Z-NdGvErMGS5OT7X@gondor.apana.org.au> <Z-NjarWmWSmQ97K0@gondor.apana.org.au> <20250326033404.GD1661@sol.localdomain> <Z-N55Yjve6wTnPqm@gondor.apana.org.au> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <Z-N55Yjve6wTnPqm@gondor.apana.org.au>
Series	Chaining is dead \| expand Chaining is dead

Herbert Xu March 30, 2025, 2:33 a.m. UTC

On Wed, Mar 26, 2025 at 11:52:05AM +0800, Herbert Xu wrote:
>
> they don't need it.  Take ext4 as an example:
> 
> 	ext4 calls verity
> 	schedule_work(verity_work);
> 	return asynchronously!
> 
> verity_work:
> 	do the crypto work
> 	__read_end_io(bio);

I went ahead and removed the work queue for fsverity and fscrypt
(except for the reading of the Merkle tree which is still done in
a work queue because I'm too lazy to make that async), and it
actually turned out to be slower than using a work queue.

I was testing with an encrypted 8GB file over ext4 mounted over a
loopback device in tmpfs.  The encryption is with xts-vaes.  It turns
out that not using a work queue actually made reading the entire file
go from 2.4s to 2.5s.

I then tried passing the whole bio (256KB per crypto request in my
test as opposed to the data unit size of 4KB per crypto request)
through using chaining to skcipher, with xts-vaes doing the requests
one-by-one.  Against my expectations, this didn't speed things up at
all (but at least it didn't slow things down either).  All the
benefits of aggregating the data were offset by the extra setup cost
of creating the chained requests.

So chaining is clearly not the way to go because it involves cutting
up into data units at the start of the process, rather than the end.

Finally I hacked up a patch (this goes on top of the skcipher branch
in cryptodev) to pass the whole bio through the Crypto API all the
way to xts-vaes which then unbundled it.  This turned out to be a
winner, taking the read time for 8GB from 2.4s down to 2.1s.

In view of this result, I'm going to throw away chaining, and instead
work on an interface that can take a whole bio (or folio), then cut
it up into the specified data unit size before processing.

The bottom-end of the interface should be able to feed two (or whatever
number you fancy) data units to the actual algorithm.

This should work just as well for compression, since their batching
input is simply a order-N folio.  The compression output is a bit
harder because the data unit size is not constant, but I think I
have a way of making it work by adding a bit to the scatterlist data
structure to indicate the end of each data unit.

PS For fsverity a 256KB bio size equates to 64 units of hash input.
My strategy is to allocate the whole thing if we can (2KB or 4KB
depending on your digest size), and if that fails, fall back to
a stack buffer of 512 bytes (or whatever number that keeps the
compiler quiet regarding stack usage).  Even if we're on the stack,
it should still give more than enough to data to satiate your
multibuffer hash code.

Cheers,

Eric Biggers March 31, 2025, 4:56 p.m. UTC | #1

On Sun, Mar 30, 2025 at 10:33:23AM +0800, Herbert Xu wrote:
> On Wed, Mar 26, 2025 at 11:52:05AM +0800, Herbert Xu wrote:
> >
> > they don't need it.  Take ext4 as an example:
> > 
> > 	ext4 calls verity
> > 	schedule_work(verity_work);
> > 	return asynchronously!
> > 
> > verity_work:
> > 	do the crypto work
> > 	__read_end_io(bio);
> 
> I went ahead and removed the work queue for fsverity and fscrypt
> (except for the reading of the Merkle tree which is still done in
> a work queue because I'm too lazy to make that async), and it
> actually turned out to be slower than using a work queue.
> 
> I was testing with an encrypted 8GB file over ext4 mounted over a
> loopback device in tmpfs.  The encryption is with xts-vaes.  It turns
> out that not using a work queue actually made reading the entire file
> go from 2.4s to 2.5s.
> 
> I then tried passing the whole bio (256KB per crypto request in my
> test as opposed to the data unit size of 4KB per crypto request)
> through using chaining to skcipher, with xts-vaes doing the requests
> one-by-one.  Against my expectations, this didn't speed things up at
> all (but at least it didn't slow things down either).  All the
> benefits of aggregating the data were offset by the extra setup cost
> of creating the chained requests.

Yes, your chaining API has poor performance and is difficult to test, as I've
been saying all along.

> So chaining is clearly not the way to go because it involves cutting
> up into data units at the start of the process, rather than the end.

Certainly agreed that chaining is not the way to go, but I think you're
overlooking that Linus's suggestion to use the libraries directly would also
solve this, while also not being restricted to bios and folios (note that not
all filesystems are block-based, for example...).  That would avoid the
per-request overhead from the generic crypto infrastructure, which is the real
source of the problem.

> Finally I hacked up a patch (this goes on top of the skcipher branch
> in cryptodev) to pass the whole bio through the Crypto API all the
> way to xts-vaes which then unbundled it.  This turned out to be a
> winner, taking the read time for 8GB from 2.4s down to 2.1s.
> 
> In view of this result, I'm going to throw away chaining, and instead
> work on an interface that can take a whole bio (or folio), then cut
> it up into the specified data unit size before processing.
> 
> The bottom-end of the interface should be able to feed two (or whatever
> number you fancy) data units to the actual algorithm.
> 
> This should work just as well for compression, since their batching
> input is simply a order-N folio.  The compression output is a bit
> harder because the data unit size is not constant, but I think I
> have a way of making it work by adding a bit to the scatterlist data
> structure to indicate the end of each data unit.
> 
> PS For fsverity a 256KB bio size equates to 64 units of hash input.
> My strategy is to allocate the whole thing if we can (2KB or 4KB
> depending on your digest size), and if that fails, fall back to
> a stack buffer of 512 bytes (or whatever number that keeps the
> compiler quiet regarding stack usage).  Even if we're on the stack,
> it should still give more than enough to data to satiate your
> multibuffer hash code.

Extending the generic crypto infrastructure to support bios and folios is an
interesting idea.

But TBH I think it's worse than Linus's suggestion of just extending lib/crypto/
to support the needed functionality and using that directly.  Your proposal is
again solving a problem created by the generic crypto infrastructure being too
complex, by making the generic crypto infrastructure even more complex.

With the bio and folio support in the generic crypto infrastructure, there would
be lots of work to do with adding support in all the underlying algorithms, and
adding tests for all the new APIs.

For hashing, users would need to allocate an array to hold the digest for every
block in the bio or folio.  That would add an additional memory allocation to
every I/O.  You said you'd like to fall back to a smaller buffer if the memory
allocation fails.  But that's silly; if we have to support that anyway, we might
as well do it that way only.  In which case the bio interface is pointless.

Also note that the kernel also *already* has an abstraction layer that allows
doing en/decryption on bios.  It's called blk-crypto, and it makes it possible
to do the en/decryption using either inline encryption hardware (i.e., the newer
style of crypto accelerator that is actually commonly used and doesn't use the
Crypto API at all) or the Crypto API.  I have plans to remove the fs-layer bio
en/decryption code from fscrypt and always use blk-crypto instead.

Adding bio support to the Crypto API feels duplicative of blk-crypto, and we'd
end up with too many abstraction layers.  I think my preferred approach is that
blk-crypto-fallback would directly call the library functions.  The legacy
Crypto API really has no useful role to play anymore.

FWIW, there are also people thinking about developing inline hashing hardware,
in which case something similar would apply to blk-integrity.

- Eric

Herbert Xu April 1, 2025, 2:44 a.m. UTC | #2

On Mon, Mar 31, 2025 at 04:56:30PM +0000, Eric Biggers wrote:
>
> With the bio and folio support in the generic crypto infrastructure, there would
> be lots of work to do with adding support in all the underlying algorithms, and
> adding tests for all the new APIs.

It doesn't need to be all that complex.  My plan is to add a fallback
template at the top, which could then be implemented by strategic
algorithms at the bottom through an extension of the skcipher walk
mechanism.

So you'd allocate "fscrypt(xts(aes))" instead of "xts(aes)", and
the xts-vaes code could simply use the normal skcipher walker with
zero changes.  The only changes needed in the aesni module is to
hook up the xts code to the new "fscrypt(xts(aes))" algorithm.

The real reason why I think this is the way to go is that the same
speed-up would apply everywhere.  So just as I could gain a 15%
speed-up with xts in fscrypt without any actual multibuffer code,
a similar speed-up is expected with gcm in networking.  In fact
if anything the speed-up could be greater there because the data
unit size is smaller at under 1500 bytes.

You see the Crypto API as the problem here, but for me the problem
is the legacy data unit size.  It's 2025 and there is no reason why
anyone should be dealing with units of 1500 or 4096 bytes.

> For hashing, users would need to allocate an array to hold the digest for every
> block in the bio or folio.  That would add an additional memory allocation to
> every I/O.  You said you'd like to fall back to a smaller buffer if the memory
> allocation fails.  But that's silly; if we have to support that anyway, we might
> as well do it that way only.  In which case the bio interface is pointless.

Sure if you decide to go down the lib/crypto route then there is
no gain.  All it means is that you can't support hardware offload,
but neither of us really care about that.

> Also note that the kernel also *already* has an abstraction layer that allows
> doing en/decryption on bios.  It's called blk-crypto, and it makes it possible
> to do the en/decryption using either inline encryption hardware (i.e., the newer
> style of crypto accelerator that is actually commonly used and doesn't use the
> Crypto API at all) or the Crypto API.  I have plans to remove the fs-layer bio
> en/decryption code from fscrypt and always use blk-crypto instead.

Thanks for the pointer, I wasn't aware of its existence.  Yes
there should definitely be only one code path for this.  So
what's stopping you from removing fscrypt right now? IOW what's
missing from blk-crypto?

> Adding bio support to the Crypto API feels duplicative of blk-crypto, and we'd
> end up with too many abstraction layers.  I think my preferred approach is that
> blk-crypto-fallback would directly call the library functions.  The legacy
> Crypto API really has no useful role to play anymore.

I'd certainly like to see that :)

> FWIW, there are also people thinking about developing inline hashing hardware,
> in which case something similar would apply to blk-integrity.

Offloading a whole bio rather than a page or two is definitely
the way to go.

Cheers,

Eric Biggers April 1, 2025, 3:33 a.m. UTC | #3

On Tue, Apr 01, 2025 at 10:44:34AM +0800, Herbert Xu wrote:
> Thanks for the pointer, I wasn't aware of its existence.  Yes
> there should definitely be only one code path for this.  So
> what's stopping you from removing fscrypt right now? IOW what's
> missing from blk-crypto?

Well, fscrypt (ext4/f2fs/ubifs/ceph encryption) wouldn't be removed; its
implementation would just change on ext4 and f2fs.  Instead of providing a
choice of whether to use blk-crypto or fs-layer crypto for file contents
en/decryption, ext4 and f2fs would always use blk-crypto.  Supporting code such
as fscrypt_decrypt_bio() that would become unused by that would be removed.

A few reasons I've waited so long:

- The fs-layer file contents en/decryption code was there first, and there
  hasn't been a strong need to remove it yet
- Much of the file contents en/decryption code in fs/crypto/ would still be
  needed, since ubifs and ceph still use it as they are not block-based
- It would make CONFIG_BLK_INLINE_ENCRYPTION, which adds a field to struct bio,
  required on more systems
- It would add the overhead of keyslot management to software crypto
- blk-crypto currently always uses inline encryption hardware when it's
  available; but, I'd like to preserve ext4's and f2fs's existing behavior where
  the use of inline encryption hardware is opt-in via a mount option.

But I'm thinking it's finally time, especially with the conversions of
filesystems to operate on folios that's going on.

dm-crypt could of course use blk-crypto too, but the dm people haven't been
super comfortable so far with delegating en/decryption to the block layer.

- Eric

Herbert Xu April 1, 2025, 3:55 a.m. UTC | #4

On Mon, Mar 31, 2025 at 08:33:03PM -0700, Eric Biggers wrote:
>
> - It would add the overhead of keyslot management to software crypto

That seems to be a design error in blk_crypto.  Why should we
model the inadequacies of hardware in software?

If we're going through the software crypto path in blk_crypto,
it should be done as a first-clsas citizen, and not as a poor
man's version of hardware crypto.

Cheers,

Herbert Xu April 1, 2025, 4:14 a.m. UTC | #5

On Mon, Mar 31, 2025 at 09:08:52PM -0700, Eric Biggers wrote:
>
> Interesting seeing this argument coming from you when the whole Crypto API is
> built around forcing software crypto to use interfaces designed for hardware.

Perhaps you should take your rose-coloured glasses off? :)

> aes_expandkey() if we switch to that) for every I/O request.  The blk-crypto
> interface could be reworked to support pre-expansion of the key, but that would
> differ from what actual inline encryption hardware needs.  So this is just
> another case where the needs of hardware vs. software diverge...

If we're going to converge on one interface, then it better put
the needs of software crypto first and foremost.  Now that doesn't
mean throwing out support for hardware altogether, but hardware
does need to take a backseat every now and then.

Cheers,

Milan Broz April 1, 2025, 7:20 a.m. UTC | #6

On 4/1/25 5:33 AM, Eric Biggers wrote:
> On Tue, Apr 01, 2025 at 10:44:34AM +0800, Herbert Xu wrote:
>> Thanks for the pointer, I wasn't aware of its existence.  Yes
>> there should definitely be only one code path for this.  So
>> what's stopping you from removing fscrypt right now? IOW what's
>> missing from blk-crypto?
> 
> Well, fscrypt (ext4/f2fs/ubifs/ceph encryption) wouldn't be removed; its
> implementation would just change on ext4 and f2fs.  Instead of providing a
> choice of whether to use blk-crypto or fs-layer crypto for file contents
> en/decryption, ext4 and f2fs would always use blk-crypto.  Supporting code such
> as fscrypt_decrypt_bio() that would become unused by that would be removed.
> 
> A few reasons I've waited so long:
> 
> - The fs-layer file contents en/decryption code was there first, and there
>    hasn't been a strong need to remove it yet
> - Much of the file contents en/decryption code in fs/crypto/ would still be
>    needed, since ubifs and ceph still use it as they are not block-based
> - It would make CONFIG_BLK_INLINE_ENCRYPTION, which adds a field to struct bio,
>    required on more systems
> - It would add the overhead of keyslot management to software crypto
> - blk-crypto currently always uses inline encryption hardware when it's
>    available; but, I'd like to preserve ext4's and f2fs's existing behavior where
>    the use of inline encryption hardware is opt-in via a mount option.
> 
> But I'm thinking it's finally time, especially with the conversions of
> filesystems to operate on folios that's going on.
> 
> dm-crypt could of course use blk-crypto too, but the dm people haven't been
> super comfortable so far with delegating en/decryption to the block layer.

Hi,

I cannot speak for device-mapper maintainers, but as it was me who complained about
block layer inline crypto introduction in dm-crypt, perhaps some clarification here:

I have no problem if there is a different block-layer/crypto API that guarantees
that the content of the bio is encrypted/decrypted, it could simplify dm-crypt a lot.
But it must not send plaintext to a random hardware device underneath by default
as it changes the dm-crypt threat model (and I see you mention the opt-in hw option
for fs mount as well).

However, dm-crypt also needs AEAD (authenticated encryption) support.
This is becoming important for devices that natively support additional per-sector metadata.
If we can access all these features through ublk in userspace one day, even better :)

Milan

Christoph Hellwig April 4, 2025, 8:46 a.m. UTC | #7

On Mon, Mar 31, 2025 at 04:56:30PM +0000, Eric Biggers wrote:
> > My strategy is to allocate the whole thing if we can (2KB or 4KB
> > depending on your digest size), and if that fails, fall back to
> > a stack buffer of 512 bytes (or whatever number that keeps the
> > compiler quiet regarding stack usage).  Even if we're on the stack,
> > it should still give more than enough to data to satiate your
> > multibuffer hash code.
> 
> Extending the generic crypto infrastructure to support bios and folios is an
> interesting idea.

The right way to do that is to make it work on an iov_iter.  David
Howells talked about that at LSF/MM and might even have done that work
at least at a POC level.  That way you can trivially pass in a bio
because it's just an ITER_BVEC iter, and for a folio you'd build a
single entry bvec.

The primary intent here is to avoid building the scatterlist for crypto
work in network file systems, but just about any other caller would
benefit as well, but that scatterlist isn't a good fit for any of the
callers I've looked at, it a really bad fit for software crypto and at
least once we get the new two step DMA API also a bad fit for hardware
offload.

Commit Message

Comments

Patch