diff mbox series

[07/16] crypto: sha512 - replace sha512_generic with wrapper around SHA-512 library

Message ID 20250611020923.1482701-8-ebiggers@kernel.org
State Superseded
Headers show
Series SHA-512 library functions | expand

Commit Message

Eric Biggers June 11, 2025, 2:09 a.m. UTC
From: Eric Biggers <ebiggers@google.com>

Delete crypto/sha512_generic.c, which provided "generic" SHA-384 and
SHA-512 crypto_shash algorithms.  Replace it with crypto/sha512.c which
provides SHA-384, SHA-512, HMAC-SHA384, and HMAC-SHA512 crypto_shash
algorithms using the corresponding library functions.

This is a prerequisite for migrating all the arch-optimized SHA-512 code
(which is almost 3000 lines) to lib/crypto/ rather than duplicating it.

Since the replacement crypto_shash algorithms are implemented using the
(potentially arch-optimized) library functions, give them
cra_driver_names ending with "-lib" rather than "-generic".  Update
crypto/testmgr.c and one odd driver to take this change in driver name
into account.  Besides these cases which are accounted for, there are no
known cases where the cra_driver_name was being depended on.

This change does mean that the abstract partial block handling code in
crypto/shash.c, which got added in 6.16, no longer gets used.  But
that's fine; the library has to implement the partial block handling
anyway, and it's better to do it in the library since the block size and
other properties of the algorithm are all fixed at compile time there,
resulting in more streamlined code.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/Kconfig                        |   4 +-
 crypto/Makefile                       |   2 +-
 crypto/sha512.c                       | 254 ++++++++++++++++++++++++++
 crypto/sha512_generic.c               | 217 ----------------------
 crypto/testmgr.c                      |  16 ++
 drivers/crypto/starfive/jh7110-hash.c |   8 +-
 include/crypto/sha512_base.h          |   3 -
 7 files changed, 278 insertions(+), 226 deletions(-)
 create mode 100644 crypto/sha512.c
 delete mode 100644 crypto/sha512_generic.c

Comments

Herbert Xu June 11, 2025, 2:24 a.m. UTC | #1
Eric Biggers <ebiggers@kernel.org> wrote:
>
> +       {
> +               .base.cra_name          = "sha512",
> +               .base.cra_driver_name   = "sha512-lib",
> +               .base.cra_priority      = 100,
> +               .base.cra_blocksize     = SHA512_BLOCK_SIZE,
> +               .base.cra_module        = THIS_MODULE,
> +               .digestsize             = SHA512_DIGEST_SIZE,
> +               .init                   = crypto_sha512_init,
> +               .update                 = crypto_sha512_update,
> +               .final                  = crypto_sha512_final,
> +               .digest                 = crypto_sha512_digest,
> +               .descsize               = sizeof(struct sha512_ctx),
> +       },

This changes the export format which breaks fallback support
for ahash drivers.

You need to retain the existing export format.

Cheers,
Eric Biggers June 11, 2025, 3:39 a.m. UTC | #2
On Wed, Jun 11, 2025 at 10:24:41AM +0800, Herbert Xu wrote:
> Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > +       {
> > +               .base.cra_name          = "sha512",
> > +               .base.cra_driver_name   = "sha512-lib",
> > +               .base.cra_priority      = 100,
> > +               .base.cra_blocksize     = SHA512_BLOCK_SIZE,
> > +               .base.cra_module        = THIS_MODULE,
> > +               .digestsize             = SHA512_DIGEST_SIZE,
> > +               .init                   = crypto_sha512_init,
> > +               .update                 = crypto_sha512_update,
> > +               .final                  = crypto_sha512_final,
> > +               .digest                 = crypto_sha512_digest,
> > +               .descsize               = sizeof(struct sha512_ctx),
> > +       },
> 
> This changes the export format which breaks fallback support
> for ahash drivers.
> 
> You need to retain the existing export format.

Do you have a concrete example (meaning, a specific driver) where this actually
matters?  Historically, export and import have always had to be paired for the
same transformation object, i.e. import was called only with the output of
export.  There is, and has never been, any test that tests otherwise.  This
seems like a brand new "requirement" that you've made up unnecessarily.

It also makes much more sense for the export format to simply be the struct used
by the library (e.g. sha512_ctx), not some undocumented struct generated by
pointer arithmetic.

And drivers should just use the library as their fallback, or else just do what
they did before when they must not have been depending on a particular format.

I'll add export and import functions if you insist, but it seems pointless.

Could you at least provide proper definitions for the legacy structs so that I
don't have to do pointer arithmetic to generate them?

- Eric
Eric Biggers June 11, 2025, 3:58 a.m. UTC | #3
On Wed, Jun 11, 2025 at 11:46:47AM +0800, Herbert Xu wrote:
> On Tue, Jun 10, 2025 at 08:39:57PM -0700, Eric Biggers wrote:
> >
> > Do you have a concrete example (meaning, a specific driver) where this actually
> > matters?  Historically, export and import have always had to be paired for the
> > same transformation object, i.e. import was called only with the output of
> > export.  There is, and has never been, any test that tests otherwise.  This
> > seems like a brand new "requirement" that you've made up unnecessarily.
> 
> It's not just drivers that may be using fallbacks, the ahash API
> code itself now relies on this to provide fallbacks for cases that
> drivers can't handle, such as linear addresses.
> 
> I did add the testing for it, which revealed a few problems with
> s390 so it was reverted for 6.16.  But I will be adding it back
> after the s390 issues have been resolved.

Okay, so it sounds like in practice this is specific to ahash_do_req_chain()
which you recently added.  I'm not sure what it's meant to be doing.

> > I'll add export and import functions if you insist, but it seems pointless.
> >
> > Could you at least provide proper definitions for the legacy structs so that I
> > don't have to do pointer arithmetic to generate them?
> 
> Just expose the sha512 block functions and use them as is.  There
> is no need to do the export/import dance.

We're not going to support direct access to the SHA-512 compression function as
part of the library API.  It's just unnecessary and error-prone.  crypto/ will
just use the same well-documented and well-tested public API as everyone else.

- Eric
Eric Biggers June 13, 2025, 5:36 a.m. UTC | #4
On Tue, Jun 10, 2025 at 08:58:42PM -0700, Eric Biggers wrote:
> On Wed, Jun 11, 2025 at 11:46:47AM +0800, Herbert Xu wrote:
> > On Tue, Jun 10, 2025 at 08:39:57PM -0700, Eric Biggers wrote:
> > >
> > > Do you have a concrete example (meaning, a specific driver) where this actually
> > > matters?  Historically, export and import have always had to be paired for the
> > > same transformation object, i.e. import was called only with the output of
> > > export.  There is, and has never been, any test that tests otherwise.  This
> > > seems like a brand new "requirement" that you've made up unnecessarily.
> > 
> > It's not just drivers that may be using fallbacks, the ahash API
> > code itself now relies on this to provide fallbacks for cases that
> > drivers can't handle, such as linear addresses.
> > 
> > I did add the testing for it, which revealed a few problems with
> > s390 so it was reverted for 6.16.  But I will be adding it back
> > after the s390 issues have been resolved.
> 
> Okay, so it sounds like in practice this is specific to ahash_do_req_chain()
> which you recently added.  I'm not sure what it's meant to be doing.

You do know that most of the sha512 asynchronous hash drivers use custom state
formats and not your new one, right?  So your code in ahash_do_req_chain() is
broken for most asynchronous hash drivers anyway.

- Eric
Herbert Xu June 13, 2025, 5:38 a.m. UTC | #5
On Thu, Jun 12, 2025 at 10:36:24PM -0700, Eric Biggers wrote:
>
> You do know that most of the sha512 asynchronous hash drivers use custom state
> formats and not your new one, right?  So your code in ahash_do_req_chain() is
> broken for most asynchronous hash drivers anyway.

Every driver needs to be converted by hand.  Once a driver has
been converted it'll be marked as block-only which activates
the fallback path in ahash.

Cheers,
Eric Biggers June 13, 2025, 5:54 a.m. UTC | #6
On Fri, Jun 13, 2025 at 01:38:59PM +0800, Herbert Xu wrote:
> On Thu, Jun 12, 2025 at 10:36:24PM -0700, Eric Biggers wrote:
> >
> > You do know that most of the sha512 asynchronous hash drivers use custom state
> > formats and not your new one, right?  So your code in ahash_do_req_chain() is
> > broken for most asynchronous hash drivers anyway.
> 
> Every driver needs to be converted by hand.  Once a driver has
> been converted it'll be marked as block-only which activates
> the fallback path in ahash.

Actually, crypto_ahash::base::fb is initialized if CRYPTO_ALG_NEED_FALLBACK,
which many of the drivers already set.  Then crypto_ahash_update() calls
ahash_do_req_chain() if the algorithm does *not* have
CRYPTO_AHASH_ALG_BLOCK_ONLY set.  Which then exports the driver's custom state
and tries to import it into the fallback.

As far as I can tell, it's just broken for most of the existing drivers.

- Eric
Ard Biesheuvel June 13, 2025, 7:38 a.m. UTC | #7
On Fri, 13 Jun 2025 at 07:55, Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Fri, Jun 13, 2025 at 01:38:59PM +0800, Herbert Xu wrote:
> > On Thu, Jun 12, 2025 at 10:36:24PM -0700, Eric Biggers wrote:
> > >
> > > You do know that most of the sha512 asynchronous hash drivers use custom state
> > > formats and not your new one, right?  So your code in ahash_do_req_chain() is
> > > broken for most asynchronous hash drivers anyway.
> >
> > Every driver needs to be converted by hand.  Once a driver has
> > been converted it'll be marked as block-only which activates
> > the fallback path in ahash.
>

Perhaps I am just slow, but could you please explain again what the
point is of all these changes?

Where is h/w accelerated ahash being used to the extent that it
justifies changing all this existing code to accommodate it?
Herbert Xu June 13, 2025, 8:39 a.m. UTC | #8
On Fri, Jun 13, 2025 at 09:38:11AM +0200, Ard Biesheuvel wrote:
>
> Perhaps I am just slow, but could you please explain again what the
> point is of all these changes?
> 
> Where is h/w accelerated ahash being used to the extent that it
> justifies changing all this existing code to accommodate it?

There are two separate changes.

First of all the export format is being made consistent so that
any hardware hash can switch over to a software fallback after
it has started, e.g., in the event of a memory allocation failure.

The partial block API handling on the other hand is about simplifying
the drivers so that they are less error-prone.

Cheers,
Eric Biggers June 13, 2025, 2:51 p.m. UTC | #9
On Fri, Jun 13, 2025 at 04:39:10PM +0800, Herbert Xu wrote:
> On Fri, Jun 13, 2025 at 09:38:11AM +0200, Ard Biesheuvel wrote:
> >
> > Perhaps I am just slow, but could you please explain again what the
> > point is of all these changes?
> > 
> > Where is h/w accelerated ahash being used to the extent that it
> > justifies changing all this existing code to accommodate it?
> 
> There are two separate changes.
> 
> First of all the export format is being made consistent so that
> any hardware hash can switch over to a software fallback after
> it has started, e.g., in the event of a memory allocation failure.
> 
> The partial block API handling on the other hand is about simplifying
> the drivers so that they are less error-prone.

Is it perhaps time to reconsider your plan, given that it's causing problems for
the librarification effort which is much more useful, and also most of the
legacy hardware offload drivers seem to be incompatible with it too?

- Eric
Linus Torvalds June 13, 2025, 4:35 p.m. UTC | #10
On Fri, 13 Jun 2025 at 01:39, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> First of all the export format is being made consistent so that
> any hardware hash can switch over to a software fallback after
> it has started, e.g., in the event of a memory allocation failure.

Can we please instead aim to *simplify* the crypto thing?

Just say that hw accelerators that have this kind of issue shouldn't
be used. At all. And certainly not be catered to by generic code.

The whole hw acceleration is very dubious to begin with unless it's
directly tied to the source (or destination) of the data in the first
place, so that there isn't extra data movement.

And if there are any software fallbacks, that "dubious to begin with"
pretty much becomes "entirely pointless".

If the point is that there are existing stupid hw drivers that already
do that fallback internally, then please just *keep* that kind of
idiocy and workarounds in the drivers.

It's actually *better* to have a broken garbage hardware driver - that
you can easily just disable on its own - than having a broken garbage
generic crypto layer that people just don't want to use at all because
it's such a ess.

This whole "make the mess that is the crypto layer EVEN MORE OF A
MESS" model of development is completely broken in my opinion.

There's a reason people prefer to have just the sw library without any
of the indirection or complexity of the crypto layer.

           Linus
Eric Biggers June 15, 2025, 3:18 a.m. UTC | #11
On Fri, Jun 13, 2025 at 04:51:38PM +0800, Herbert Xu wrote:
> On Thu, Jun 12, 2025 at 10:54:39PM -0700, Eric Biggers wrote:
> >
> > Actually, crypto_ahash::base::fb is initialized if CRYPTO_ALG_NEED_FALLBACK,
> > which many of the drivers already set.  Then crypto_ahash_update() calls
> > ahash_do_req_chain() if the algorithm does *not* have
> > CRYPTO_AHASH_ALG_BLOCK_ONLY set.  Which then exports the driver's custom state
> > and tries to import it into the fallback.
> > 
> > As far as I can tell, it's just broken for most of the existing drivers.
> 
> This fallback path is only meant to be used for drivers that have
> been converted.  But you're right there is a check missing in there.
> 
> Thanks,
> 
> ---8<---
> Ensure that drivers that have not been converted to the ahash API
> do not use the ahash_request_set_virt fallback path as they cannot
> use the software fallback.
> 
> Reported-by: Eric Biggers <ebiggers@kernel.org>
> Fixes: 9d7a0ab1c753 ("crypto: ahash - Handle partial blocks in API")
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Okay.  Out of curiosity I decided to actually test the Qualcomm Crypto Engine
driver on a development board that has a Qualcomm SoC, using latest mainline.

Even with your patch applied, it overflows the stack when running the crypto
self-tests, apparently due to crypto/ahash.c calling into itself recursively:

    [    9.230887] Insufficient stack space to handle exception!
    [    9.230889] ESR: 0x0000000096000047 -- DABT (current EL)
    [    9.230891] FAR: 0xffff800084927fe0
    [    9.230891] Task stack:     [0xffff800084928000..0xffff80008492c000]
    [    9.230893] IRQ stack:      [0xffff800080030000..0xffff800080034000]
    [    9.230894] Overflow stack: [0xffff000a72dd2100..0xffff000a72dd3100]
    [    9.230896] CPU: 6 UID: 0 PID: 747 Comm: cryptomgr_test Tainted: G S                  6.16.0-rc1-00237-g84ffcd88616f #7 PREEMPT 
    [    9.230900] Tainted: [S]=CPU_OUT_OF_SPEC
    [    9.230901] Hardware name: Qualcomm Technologies, Inc. SM8650 HDK (DT)
    [    9.230901] pstate: 01400005 (nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
    [    9.230903] pc : qce_ahash_update+0x4/0x1f4
    [    9.230910] lr : ahash_do_req_chain+0xb4/0x19c
    [    9.230915] sp : ffff800084928030
    [    9.230915] x29: ffff8000849281a0 x28: 0000000000000003 x27: 0000000000000001
    [    9.230918] x26: ffff0008022d8060 x25: ffff000800a33500 x24: ffff80008492b8d8
    [    9.230920] x23: ffff80008492b918 x22: 0000000000000400 x21: ffff000800a33510
    [    9.230922] x20: ffff000800b62030 x19: ffff00080122d400 x18: 00000000ffffffff
    [    9.230923] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
    [    9.230925] x14: 0000000000000001 x13: 0000000000000000 x12: 0000000000000000
    [    9.230927] x11: eee1c132902c61e2 x10: 0000000000000063 x9 : 0000000000000000
    [    9.230928] x8 : 0000000000000062 x7 : a54ff53a3c6ef372 x6 : 0000000000000400
    [    9.230930] x5 : fefefefefefefefe x4 : ffff000800a33510 x3 : 0000000000000000
    [    9.230931] x2 : ffff000805d76900 x1 : ffffcea2349738cc x0 : ffff00080122d400
    [    9.230933] Kernel panic - not syncing: kernel stack overflow
    [    9.230934] CPU: 6 UID: 0 PID: 747 Comm: cryptomgr_test Tainted: G S                  6.16.0-rc1-00237-g84ffcd88616f #7 PREEMPT 
    [    9.230936] Tainted: [S]=CPU_OUT_OF_SPEC
    [    9.230937] Hardware name: Qualcomm Technologies, Inc. SM8650 HDK (DT)
    [    9.230938] Call trace:
    [    9.230939]  show_stack+0x18/0x24 (C)
    [    9.230943]  dump_stack_lvl+0x60/0x80
    [    9.230947]  dump_stack+0x18/0x24
    [    9.230949]  panic+0x168/0x360
    [    9.230952]  add_taint+0x0/0xbc
    [    9.230955]  panic_bad_stack+0x108/0x120
    [    9.230958]  handle_bad_stack+0x34/0x40
    [    9.230962]  __bad_stack+0x80/0x84
    [    9.230963]  qce_ahash_update+0x4/0x1f4 (P)
    [    9.230965]  crypto_ahash_update+0x17c/0x18c
    [    9.230967]  crypto_ahash_finup+0x184/0x1e4
    [    9.230969]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230970]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230972]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230973]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230974]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230976]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230977]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230979]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230980]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230981]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230983]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230984]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230986]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230988]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230989]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230991]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230993]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230995]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230996]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230998]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230999]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.231001]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.231002]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.231004]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.231005]  crypto_ahash_finup+0x1ac/0x1e4

    [the above line repeated a few hundred times more...]

    [    9.231571]  test_ahash_vec_cfg+0x508/0x8f8
    [    9.231573]  test_hash_vec+0xb8/0x21c
    [    9.231575]  __alg_test_hash+0x144/0x2e0
    [    9.231577]  alg_test_hash+0xc0/0x178
    [    9.231578]  alg_test+0x148/0x5ec
    [    9.231579]  cryptomgr_test+0x24/0x40
    [    9.231581]  kthread+0x12c/0x204
    [    9.231583]  ret_from_fork+0x10/0x20
    [    9.231587] SMP: stopping secondary CPUs
    [    9.240072] Kernel Offset: 0x4ea1b2a80000 from 0xffff800080000000
    [    9.240073] PHYS_OFFSET: 0xfff1000080000000
    [    9.240074] CPU features: 0x6000,000001c0,62130cb1,357e7667
    [    9.240075] Memory Limit: none
    [   11.373410] ---[ end Kernel panic - not syncing: kernel stack overflow ]---

After disabling the crypto self-tests, I was then able to run a benchmark of
SHA-256 hashing 4096-byte messages, which fortunately didn't encounter the
recursion bug.  I got the following results:

    ARMv8 crypto extensions: 1864 MB/s
    Generic C code: 358 MB/s
    Qualcomm Crypto Engine: 55 MB/s

So just to clarify, you believe that asynchronous hash drivers like the Qualcomm
Crypto Engine one are useful, and the changes that you're requiring to the
CPU-based code are to support these drivers?

- Eric
Ard Biesheuvel June 15, 2025, 7:22 a.m. UTC | #12
On Sun, 15 Jun 2025 at 05:18, Eric Biggers <ebiggers@kernel.org> wrote:
>
...
> After disabling the crypto self-tests, I was then able to run a benchmark of
> SHA-256 hashing 4096-byte messages, which fortunately didn't encounter the
> recursion bug.  I got the following results:
>
>     ARMv8 crypto extensions: 1864 MB/s
>     Generic C code: 358 MB/s
>     Qualcomm Crypto Engine: 55 MB/s
>
> So just to clarify, you believe that asynchronous hash drivers like the Qualcomm
> Crypto Engine one are useful, and the changes that you're requiring to the
> CPU-based code are to support these drivers?
>

And this offload engine only has one internal queue, right? Whereas
the CPU results may be multiplied by the number of cores on the soc.
It would still be interesting how much of this is due to latency
rather than limited throughput but it seems highly unlikely that there
are any message sizes large enough where QCE would catch up with the
CPUs. (AIUI, the only use case we have in the kernel today for message
sizes that are substantially larger than this is kTLS, but I'm not
sure how well it works with crypto_aead compared to offload at a more
suitable level in the networking stack, and this driver does not
implement GCM in the first place)

On ARM socs, these offload engines usually exist primarily for the
benefit of the verified boot implementation in mask ROM, which
obviously needs to be minimal but doesn't have to be very fast in
order to get past the first boot stages and hand over to software.
Then, since the IP block is there, it's listed as a feature in the
data sheet, even though it is not very useful when running under the
OS.
Eric Biggers June 15, 2025, 6:46 p.m. UTC | #13
On Sun, Jun 15, 2025 at 09:22:51AM +0200, Ard Biesheuvel wrote:
> On Sun, 15 Jun 2025 at 05:18, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> ...
> > After disabling the crypto self-tests, I was then able to run a benchmark of
> > SHA-256 hashing 4096-byte messages, which fortunately didn't encounter the
> > recursion bug.  I got the following results:
> >
> >     ARMv8 crypto extensions: 1864 MB/s
> >     Generic C code: 358 MB/s
> >     Qualcomm Crypto Engine: 55 MB/s
> >
> > So just to clarify, you believe that asynchronous hash drivers like the Qualcomm
> > Crypto Engine one are useful, and the changes that you're requiring to the
> > CPU-based code are to support these drivers?
> >
> 
> And this offload engine only has one internal queue, right? Whereas
> the CPU results may be multiplied by the number of cores on the soc.
> It would still be interesting how much of this is due to latency
> rather than limited throughput but it seems highly unlikely that there
> are any message sizes large enough where QCE would catch up with the
> CPUs. (AIUI, the only use case we have in the kernel today for message
> sizes that are substantially larger than this is kTLS, but I'm not
> sure how well it works with crypto_aead compared to offload at a more
> suitable level in the networking stack, and this driver does not
> implement GCM in the first place)
> 
> On ARM socs, these offload engines usually exist primarily for the
> benefit of the verified boot implementation in mask ROM, which
> obviously needs to be minimal but doesn't have to be very fast in
> order to get past the first boot stages and hand over to software.
> Then, since the IP block is there, it's listed as a feature in the
> data sheet, even though it is not very useful when running under the
> OS.

With 1 MiB messages, I get 1913 MB/s with ARMv8 CE and 142 MB/s with QCE.

(BTW, that's single-buffer ARMv8 CE.  My two-buffer code is over 3000 MB/s.)

I then changed my benchmark code to take full advantage of the async API and
submit as many requests as the hardware can handle.  (This would be a best-case
scenario for QCE; in many real use cases this is not possible.)  Result with QCE
was 58 MB/s with 4 KiB messages or 155 MB/s for 1 MiB messages.

So yes, QCE seems to have only one queue, and even that one queue is *much*
slower than just using the CPU.  It's even slower than the generic C code.

And until I fixed it recently, the Crypto API defaulted to using QCE instead of
ARMv8 CE.

But this seems to be a common pattern among the offload engines.
I noticed a similar issue with Intel QAT, which I elaborate on in this patch:
https://lore.kernel.org/r/20250615045145.224567-1-ebiggers@kernel.org

- Eric
Linus Torvalds June 15, 2025, 7:37 p.m. UTC | #14
On Sun, 15 Jun 2025 at 11:47, Eric Biggers <ebiggers@kernel.org> wrote:
>
> So yes, QCE seems to have only one queue, and even that one queue is *much*
> slower than just using the CPU.  It's even slower than the generic C code.

Honestly, I have *NEVER* seen an external crypto accelerator that is
worth using unless it's integrated with the target IO.

Now, it's not my area of expertise either, so there may well be some
random case that I haven't heard about, but the only sensible use-case
I'm aware of is when the network card just does all the offloading and
just does the whole SSL thing (or IPsec or whatever, but if you care
about performance you'd be better off using wireguard and doing it all
on the CPU anyway)

And even then, people tend to not be happy with the results, because
the hardware is too inflexible or too rare.

(Replace "network card" with "disk controller" if that's your thing -
the basic idea is the same: it's worthwhile if it's done natively by
the IO target, not done by some third party accelerator - and while
I'm convinced encryption on the disk controller makes sense, I'm not
sure I'd actually *trust* it from a real cryptographic standpoint if
you really care about it, because some of those are most definitely
black boxes with the trust model seemingly being based on the "Trust
me, Bro" approach to security).

The other case is the "key is physically separate and isn't even under
kernel control at all", but then it's never about performance in the
first place (ie security keys etc).

Even if the hardware crypto engine is fast - and as you see, no they
aren't - any possible performance is absolutely killed by lack of
caches and the IO overhead.

This seems to also be pretty much true of async SMP crypto on the CPU
as well.  You can get better benchmarks by offloading the crypto to
other CPU's, but I'm not convinced it's actually a good trade-off in
reality. The cost of scheduling and just all the overhead of
synchronization is very very real, and the benchmarks where it looks
good tend to be the "we do nothing else, and we don't actually touch
the data anyway, it's just purely about pointless benchmarking".

Just the set-up costs for doing things asynchronously can be higher
than the cost of just doing the operation itself.

             Linus
diff mbox series

Patch

diff --git a/crypto/Kconfig b/crypto/Kconfig
index e9fee7818e270..509641ed30ce1 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -983,12 +983,14 @@  config CRYPTO_SHA256
 	  Used by the btrfs filesystem, Ceph, NFS, and SMB.
 
 config CRYPTO_SHA512
 	tristate "SHA-384 and SHA-512"
 	select CRYPTO_HASH
+	select CRYPTO_LIB_SHA512
 	help
-	  SHA-384 and SHA-512 secure hash algorithms (FIPS 180, ISO/IEC 10118-3)
+	  SHA-384 and SHA-512 secure hash algorithms (FIPS 180, ISO/IEC
+	  10118-3), including HMAC support.
 
 config CRYPTO_SHA3
 	tristate "SHA-3"
 	select CRYPTO_HASH
 	help
diff --git a/crypto/Makefile b/crypto/Makefile
index 017df3a2e4bb3..271c77462cec9 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -76,11 +76,11 @@  obj-$(CONFIG_CRYPTO_MD4) += md4.o
 obj-$(CONFIG_CRYPTO_MD5) += md5.o
 obj-$(CONFIG_CRYPTO_RMD160) += rmd160.o
 obj-$(CONFIG_CRYPTO_SHA1) += sha1_generic.o
 obj-$(CONFIG_CRYPTO_SHA256) += sha256.o
 CFLAGS_sha256.o += -DARCH=$(ARCH)
-obj-$(CONFIG_CRYPTO_SHA512) += sha512_generic.o
+obj-$(CONFIG_CRYPTO_SHA512) += sha512.o
 obj-$(CONFIG_CRYPTO_SHA3) += sha3_generic.o
 obj-$(CONFIG_CRYPTO_SM3_GENERIC) += sm3_generic.o
 obj-$(CONFIG_CRYPTO_STREEBOG) += streebog_generic.o
 obj-$(CONFIG_CRYPTO_WP512) += wp512.o
 CFLAGS_wp512.o := $(call cc-option,-fno-schedule-insns)  # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149
diff --git a/crypto/sha512.c b/crypto/sha512.c
new file mode 100644
index 0000000000000..ad9c8b2ddb129
--- /dev/null
+++ b/crypto/sha512.c
@@ -0,0 +1,254 @@ 
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Crypto API support for SHA-384, SHA-512, HMAC-SHA384, and HMAC-SHA512
+ *
+ * Copyright (c) Jean-Luc Cooke <jlcooke@certainkey.com>
+ * Copyright (c) Andrew McDonald <andrew@mcdonald.org.uk>
+ * Copyright (c) 2003 Kyle McMartin <kyle@debian.org>
+ * Copyright 2025 Google LLC
+ */
+#include <crypto/internal/hash.h>
+#include <crypto/sha2.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+/* SHA-384 */
+
+const u8 sha384_zero_message_hash[SHA384_DIGEST_SIZE] = {
+	0x38, 0xb0, 0x60, 0xa7, 0x51, 0xac, 0x96, 0x38,
+	0x4c, 0xd9, 0x32, 0x7e, 0xb1, 0xb1, 0xe3, 0x6a,
+	0x21, 0xfd, 0xb7, 0x11, 0x14, 0xbe, 0x07, 0x43,
+	0x4c, 0x0c, 0xc7, 0xbf, 0x63, 0xf6, 0xe1, 0xda,
+	0x27, 0x4e, 0xde, 0xbf, 0xe7, 0x6f, 0x65, 0xfb,
+	0xd5, 0x1a, 0xd2, 0xf1, 0x48, 0x98, 0xb9, 0x5b
+};
+EXPORT_SYMBOL_GPL(sha384_zero_message_hash);
+
+#define SHA384_CTX(desc) ((struct sha384_ctx *)shash_desc_ctx(desc))
+
+static int crypto_sha384_init(struct shash_desc *desc)
+{
+	sha384_init(SHA384_CTX(desc));
+	return 0;
+}
+
+static int crypto_sha384_update(struct shash_desc *desc,
+				const u8 *data, unsigned int len)
+{
+	sha384_update(SHA384_CTX(desc), data, len);
+	return 0;
+}
+
+static int crypto_sha384_final(struct shash_desc *desc, u8 *out)
+{
+	sha384_final(SHA384_CTX(desc), out);
+	return 0;
+}
+
+static int crypto_sha384_digest(struct shash_desc *desc,
+				const u8 *data, unsigned int len, u8 *out)
+{
+	sha384(data, len, out);
+	return 0;
+}
+
+/* SHA-512 */
+
+const u8 sha512_zero_message_hash[SHA512_DIGEST_SIZE] = {
+	0xcf, 0x83, 0xe1, 0x35, 0x7e, 0xef, 0xb8, 0xbd,
+	0xf1, 0x54, 0x28, 0x50, 0xd6, 0x6d, 0x80, 0x07,
+	0xd6, 0x20, 0xe4, 0x05, 0x0b, 0x57, 0x15, 0xdc,
+	0x83, 0xf4, 0xa9, 0x21, 0xd3, 0x6c, 0xe9, 0xce,
+	0x47, 0xd0, 0xd1, 0x3c, 0x5d, 0x85, 0xf2, 0xb0,
+	0xff, 0x83, 0x18, 0xd2, 0x87, 0x7e, 0xec, 0x2f,
+	0x63, 0xb9, 0x31, 0xbd, 0x47, 0x41, 0x7a, 0x81,
+	0xa5, 0x38, 0x32, 0x7a, 0xf9, 0x27, 0xda, 0x3e
+};
+EXPORT_SYMBOL_GPL(sha512_zero_message_hash);
+
+#define SHA512_CTX(desc) ((struct sha512_ctx *)shash_desc_ctx(desc))
+
+static int crypto_sha512_init(struct shash_desc *desc)
+{
+	sha512_init(SHA512_CTX(desc));
+	return 0;
+}
+
+static int crypto_sha512_update(struct shash_desc *desc,
+				const u8 *data, unsigned int len)
+{
+	sha512_update(SHA512_CTX(desc), data, len);
+	return 0;
+}
+
+static int crypto_sha512_final(struct shash_desc *desc, u8 *out)
+{
+	sha512_final(SHA512_CTX(desc), out);
+	return 0;
+}
+
+static int crypto_sha512_digest(struct shash_desc *desc,
+				const u8 *data, unsigned int len, u8 *out)
+{
+	sha512(data, len, out);
+	return 0;
+}
+
+/* HMAC-SHA384 */
+
+#define HMAC_SHA384_KEY(tfm) ((struct hmac_sha384_key *)crypto_shash_ctx(tfm))
+#define HMAC_SHA384_CTX(desc) ((struct hmac_sha384_ctx *)shash_desc_ctx(desc))
+
+static int crypto_hmac_sha384_setkey(struct crypto_shash *tfm,
+				     const u8 *raw_key, unsigned int keylen)
+{
+	hmac_sha384_preparekey(HMAC_SHA384_KEY(tfm), raw_key, keylen);
+	return 0;
+}
+
+static int crypto_hmac_sha384_init(struct shash_desc *desc)
+{
+	hmac_sha384_init(HMAC_SHA384_CTX(desc), HMAC_SHA384_KEY(desc->tfm));
+	return 0;
+}
+
+static int crypto_hmac_sha384_update(struct shash_desc *desc,
+				     const u8 *data, unsigned int len)
+{
+	hmac_sha384_update(HMAC_SHA384_CTX(desc), data, len);
+	return 0;
+}
+
+static int crypto_hmac_sha384_final(struct shash_desc *desc, u8 *out)
+{
+	hmac_sha384_final(HMAC_SHA384_CTX(desc), out);
+	return 0;
+}
+
+static int crypto_hmac_sha384_digest(struct shash_desc *desc,
+				     const u8 *data, unsigned int len,
+				     u8 *out)
+{
+	hmac_sha384(HMAC_SHA384_KEY(desc->tfm), data, len, out);
+	return 0;
+}
+
+/* HMAC-SHA512 */
+
+#define HMAC_SHA512_KEY(tfm) ((struct hmac_sha512_key *)crypto_shash_ctx(tfm))
+#define HMAC_SHA512_CTX(desc) ((struct hmac_sha512_ctx *)shash_desc_ctx(desc))
+
+static int crypto_hmac_sha512_setkey(struct crypto_shash *tfm,
+				     const u8 *raw_key, unsigned int keylen)
+{
+	hmac_sha512_preparekey(HMAC_SHA512_KEY(tfm), raw_key, keylen);
+	return 0;
+}
+
+static int crypto_hmac_sha512_init(struct shash_desc *desc)
+{
+	hmac_sha512_init(HMAC_SHA512_CTX(desc), HMAC_SHA512_KEY(desc->tfm));
+	return 0;
+}
+
+static int crypto_hmac_sha512_update(struct shash_desc *desc,
+				     const u8 *data, unsigned int len)
+{
+	hmac_sha512_update(HMAC_SHA512_CTX(desc), data, len);
+	return 0;
+}
+
+static int crypto_hmac_sha512_final(struct shash_desc *desc, u8 *out)
+{
+	hmac_sha512_final(HMAC_SHA512_CTX(desc), out);
+	return 0;
+}
+
+static int crypto_hmac_sha512_digest(struct shash_desc *desc,
+				     const u8 *data, unsigned int len,
+				     u8 *out)
+{
+	hmac_sha512(HMAC_SHA512_KEY(desc->tfm), data, len, out);
+	return 0;
+}
+
+/* Algorithm definitions */
+
+static struct shash_alg algs[] = {
+	{
+		.base.cra_name		= "sha384",
+		.base.cra_driver_name	= "sha384-lib",
+		.base.cra_priority	= 100,
+		.base.cra_blocksize	= SHA384_BLOCK_SIZE,
+		.base.cra_module	= THIS_MODULE,
+		.digestsize		= SHA384_DIGEST_SIZE,
+		.init			= crypto_sha384_init,
+		.update			= crypto_sha384_update,
+		.final			= crypto_sha384_final,
+		.digest			= crypto_sha384_digest,
+		.descsize		= sizeof(struct sha384_ctx),
+	},
+	{
+		.base.cra_name		= "sha512",
+		.base.cra_driver_name	= "sha512-lib",
+		.base.cra_priority	= 100,
+		.base.cra_blocksize	= SHA512_BLOCK_SIZE,
+		.base.cra_module	= THIS_MODULE,
+		.digestsize		= SHA512_DIGEST_SIZE,
+		.init			= crypto_sha512_init,
+		.update			= crypto_sha512_update,
+		.final			= crypto_sha512_final,
+		.digest			= crypto_sha512_digest,
+		.descsize		= sizeof(struct sha512_ctx),
+	},
+	{
+		.base.cra_name		= "hmac(sha384)",
+		.base.cra_driver_name	= "hmac-sha384-lib",
+		.base.cra_priority	= 100,
+		.base.cra_blocksize	= SHA384_BLOCK_SIZE,
+		.base.cra_ctxsize	= sizeof(struct hmac_sha384_key),
+		.base.cra_module	= THIS_MODULE,
+		.digestsize		= SHA384_DIGEST_SIZE,
+		.setkey			= crypto_hmac_sha384_setkey,
+		.init			= crypto_hmac_sha384_init,
+		.update			= crypto_hmac_sha384_update,
+		.final			= crypto_hmac_sha384_final,
+		.digest			= crypto_hmac_sha384_digest,
+		.descsize		= sizeof(struct hmac_sha384_ctx),
+	},
+	{
+		.base.cra_name		= "hmac(sha512)",
+		.base.cra_driver_name	= "hmac-sha512-lib",
+		.base.cra_priority	= 100,
+		.base.cra_blocksize	= SHA512_BLOCK_SIZE,
+		.base.cra_ctxsize	= sizeof(struct hmac_sha512_key),
+		.base.cra_module	= THIS_MODULE,
+		.digestsize		= SHA512_DIGEST_SIZE,
+		.setkey			= crypto_hmac_sha512_setkey,
+		.init			= crypto_hmac_sha512_init,
+		.update			= crypto_hmac_sha512_update,
+		.final			= crypto_hmac_sha512_final,
+		.digest			= crypto_hmac_sha512_digest,
+		.descsize		= sizeof(struct hmac_sha512_ctx),
+	},
+};
+
+static int __init crypto_sha512_mod_init(void)
+{
+	return crypto_register_shashes(algs, ARRAY_SIZE(algs));
+}
+module_init(crypto_sha512_mod_init);
+
+static void __exit crypto_sha512_mod_exit(void)
+{
+	crypto_unregister_shashes(algs, ARRAY_SIZE(algs));
+}
+module_exit(crypto_sha512_mod_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Crypto API support for SHA-384, SHA-512, HMAC-SHA384, and HMAC-SHA512");
+
+MODULE_ALIAS_CRYPTO("sha384");
+MODULE_ALIAS_CRYPTO("sha512");
+MODULE_ALIAS_CRYPTO("hmac(sha384)");
+MODULE_ALIAS_CRYPTO("hmac(sha512)");
diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
deleted file mode 100644
index 7368173f545eb..0000000000000
--- a/crypto/sha512_generic.c
+++ /dev/null
@@ -1,217 +0,0 @@ 
-// SPDX-License-Identifier: GPL-2.0-or-later
-/* SHA-512 code by Jean-Luc Cooke <jlcooke@certainkey.com>
- *
- * Copyright (c) Jean-Luc Cooke <jlcooke@certainkey.com>
- * Copyright (c) Andrew McDonald <andrew@mcdonald.org.uk>
- * Copyright (c) 2003 Kyle McMartin <kyle@debian.org>
- */
-#include <crypto/internal/hash.h>
-#include <crypto/sha2.h>
-#include <crypto/sha512_base.h>
-#include <linux/kernel.h>
-#include <linux/module.h>
-#include <linux/unaligned.h>
-
-const u8 sha384_zero_message_hash[SHA384_DIGEST_SIZE] = {
-	0x38, 0xb0, 0x60, 0xa7, 0x51, 0xac, 0x96, 0x38,
-	0x4c, 0xd9, 0x32, 0x7e, 0xb1, 0xb1, 0xe3, 0x6a,
-	0x21, 0xfd, 0xb7, 0x11, 0x14, 0xbe, 0x07, 0x43,
-	0x4c, 0x0c, 0xc7, 0xbf, 0x63, 0xf6, 0xe1, 0xda,
-	0x27, 0x4e, 0xde, 0xbf, 0xe7, 0x6f, 0x65, 0xfb,
-	0xd5, 0x1a, 0xd2, 0xf1, 0x48, 0x98, 0xb9, 0x5b
-};
-EXPORT_SYMBOL_GPL(sha384_zero_message_hash);
-
-const u8 sha512_zero_message_hash[SHA512_DIGEST_SIZE] = {
-	0xcf, 0x83, 0xe1, 0x35, 0x7e, 0xef, 0xb8, 0xbd,
-	0xf1, 0x54, 0x28, 0x50, 0xd6, 0x6d, 0x80, 0x07,
-	0xd6, 0x20, 0xe4, 0x05, 0x0b, 0x57, 0x15, 0xdc,
-	0x83, 0xf4, 0xa9, 0x21, 0xd3, 0x6c, 0xe9, 0xce,
-	0x47, 0xd0, 0xd1, 0x3c, 0x5d, 0x85, 0xf2, 0xb0,
-	0xff, 0x83, 0x18, 0xd2, 0x87, 0x7e, 0xec, 0x2f,
-	0x63, 0xb9, 0x31, 0xbd, 0x47, 0x41, 0x7a, 0x81,
-	0xa5, 0x38, 0x32, 0x7a, 0xf9, 0x27, 0xda, 0x3e
-};
-EXPORT_SYMBOL_GPL(sha512_zero_message_hash);
-
-static inline u64 Ch(u64 x, u64 y, u64 z)
-{
-        return z ^ (x & (y ^ z));
-}
-
-static inline u64 Maj(u64 x, u64 y, u64 z)
-{
-        return (x & y) | (z & (x | y));
-}
-
-static const u64 sha512_K[80] = {
-        0x428a2f98d728ae22ULL, 0x7137449123ef65cdULL, 0xb5c0fbcfec4d3b2fULL,
-        0xe9b5dba58189dbbcULL, 0x3956c25bf348b538ULL, 0x59f111f1b605d019ULL,
-        0x923f82a4af194f9bULL, 0xab1c5ed5da6d8118ULL, 0xd807aa98a3030242ULL,
-        0x12835b0145706fbeULL, 0x243185be4ee4b28cULL, 0x550c7dc3d5ffb4e2ULL,
-        0x72be5d74f27b896fULL, 0x80deb1fe3b1696b1ULL, 0x9bdc06a725c71235ULL,
-        0xc19bf174cf692694ULL, 0xe49b69c19ef14ad2ULL, 0xefbe4786384f25e3ULL,
-        0x0fc19dc68b8cd5b5ULL, 0x240ca1cc77ac9c65ULL, 0x2de92c6f592b0275ULL,
-        0x4a7484aa6ea6e483ULL, 0x5cb0a9dcbd41fbd4ULL, 0x76f988da831153b5ULL,
-        0x983e5152ee66dfabULL, 0xa831c66d2db43210ULL, 0xb00327c898fb213fULL,
-        0xbf597fc7beef0ee4ULL, 0xc6e00bf33da88fc2ULL, 0xd5a79147930aa725ULL,
-        0x06ca6351e003826fULL, 0x142929670a0e6e70ULL, 0x27b70a8546d22ffcULL,
-        0x2e1b21385c26c926ULL, 0x4d2c6dfc5ac42aedULL, 0x53380d139d95b3dfULL,
-        0x650a73548baf63deULL, 0x766a0abb3c77b2a8ULL, 0x81c2c92e47edaee6ULL,
-        0x92722c851482353bULL, 0xa2bfe8a14cf10364ULL, 0xa81a664bbc423001ULL,
-        0xc24b8b70d0f89791ULL, 0xc76c51a30654be30ULL, 0xd192e819d6ef5218ULL,
-        0xd69906245565a910ULL, 0xf40e35855771202aULL, 0x106aa07032bbd1b8ULL,
-        0x19a4c116b8d2d0c8ULL, 0x1e376c085141ab53ULL, 0x2748774cdf8eeb99ULL,
-        0x34b0bcb5e19b48a8ULL, 0x391c0cb3c5c95a63ULL, 0x4ed8aa4ae3418acbULL,
-        0x5b9cca4f7763e373ULL, 0x682e6ff3d6b2b8a3ULL, 0x748f82ee5defb2fcULL,
-        0x78a5636f43172f60ULL, 0x84c87814a1f0ab72ULL, 0x8cc702081a6439ecULL,
-        0x90befffa23631e28ULL, 0xa4506cebde82bde9ULL, 0xbef9a3f7b2c67915ULL,
-        0xc67178f2e372532bULL, 0xca273eceea26619cULL, 0xd186b8c721c0c207ULL,
-        0xeada7dd6cde0eb1eULL, 0xf57d4f7fee6ed178ULL, 0x06f067aa72176fbaULL,
-        0x0a637dc5a2c898a6ULL, 0x113f9804bef90daeULL, 0x1b710b35131c471bULL,
-        0x28db77f523047d84ULL, 0x32caab7b40c72493ULL, 0x3c9ebe0a15c9bebcULL,
-        0x431d67c49c100d4cULL, 0x4cc5d4becb3e42b6ULL, 0x597f299cfc657e2aULL,
-        0x5fcb6fab3ad6faecULL, 0x6c44198c4a475817ULL,
-};
-
-#define e0(x)       (ror64(x,28) ^ ror64(x,34) ^ ror64(x,39))
-#define e1(x)       (ror64(x,14) ^ ror64(x,18) ^ ror64(x,41))
-#define s0(x)       (ror64(x, 1) ^ ror64(x, 8) ^ (x >> 7))
-#define s1(x)       (ror64(x,19) ^ ror64(x,61) ^ (x >> 6))
-
-static inline void LOAD_OP(int I, u64 *W, const u8 *input)
-{
-	W[I] = get_unaligned_be64((__u64 *)input + I);
-}
-
-static inline void BLEND_OP(int I, u64 *W)
-{
-	W[I & 15] += s1(W[(I-2) & 15]) + W[(I-7) & 15] + s0(W[(I-15) & 15]);
-}
-
-static void
-sha512_transform(u64 *state, const u8 *input)
-{
-	u64 a, b, c, d, e, f, g, h, t1, t2;
-
-	int i;
-	u64 W[16];
-
-	/* load the state into our registers */
-	a=state[0];   b=state[1];   c=state[2];   d=state[3];
-	e=state[4];   f=state[5];   g=state[6];   h=state[7];
-
-	/* now iterate */
-	for (i=0; i<80; i+=8) {
-		if (!(i & 8)) {
-			int j;
-
-			if (i < 16) {
-				/* load the input */
-				for (j = 0; j < 16; j++)
-					LOAD_OP(i + j, W, input);
-			} else {
-				for (j = 0; j < 16; j++) {
-					BLEND_OP(i + j, W);
-				}
-			}
-		}
-
-		t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i  ] + W[(i & 15)];
-		t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-		t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[(i & 15) + 1];
-		t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-		t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[(i & 15) + 2];
-		t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-		t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[(i & 15) + 3];
-		t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-		t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[(i & 15) + 4];
-		t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-		t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[(i & 15) + 5];
-		t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-		t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[(i & 15) + 6];
-		t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-		t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[(i & 15) + 7];
-		t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
-	}
-
-	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
-	state[4] += e; state[5] += f; state[6] += g; state[7] += h;
-}
-
-void sha512_generic_block_fn(struct sha512_state *sst, u8 const *src,
-			     int blocks)
-{
-	do {
-		sha512_transform(sst->state, src);
-		src += SHA512_BLOCK_SIZE;
-	} while (--blocks);
-}
-EXPORT_SYMBOL_GPL(sha512_generic_block_fn);
-
-static int crypto_sha512_update(struct shash_desc *desc, const u8 *data,
-				unsigned int len)
-{
-	return sha512_base_do_update_blocks(desc, data, len,
-					    sha512_generic_block_fn);
-}
-
-static int crypto_sha512_finup(struct shash_desc *desc, const u8 *data,
-			       unsigned int len, u8 *hash)
-{
-	sha512_base_do_finup(desc, data, len, sha512_generic_block_fn);
-	return sha512_base_finish(desc, hash);
-}
-
-static struct shash_alg sha512_algs[2] = { {
-	.digestsize	=	SHA512_DIGEST_SIZE,
-	.init		=	sha512_base_init,
-	.update		=	crypto_sha512_update,
-	.finup		=	crypto_sha512_finup,
-	.descsize	=	SHA512_STATE_SIZE,
-	.base		=	{
-		.cra_name	=	"sha512",
-		.cra_driver_name =	"sha512-generic",
-		.cra_priority	=	100,
-		.cra_flags	=	CRYPTO_AHASH_ALG_BLOCK_ONLY |
-					CRYPTO_AHASH_ALG_FINUP_MAX,
-		.cra_blocksize	=	SHA512_BLOCK_SIZE,
-		.cra_module	=	THIS_MODULE,
-	}
-}, {
-	.digestsize	=	SHA384_DIGEST_SIZE,
-	.init		=	sha384_base_init,
-	.update		=	crypto_sha512_update,
-	.finup		=	crypto_sha512_finup,
-	.descsize	=	SHA512_STATE_SIZE,
-	.base		=	{
-		.cra_name	=	"sha384",
-		.cra_driver_name =	"sha384-generic",
-		.cra_priority	=	100,
-		.cra_flags	=	CRYPTO_AHASH_ALG_BLOCK_ONLY |
-					CRYPTO_AHASH_ALG_FINUP_MAX,
-		.cra_blocksize	=	SHA384_BLOCK_SIZE,
-		.cra_module	=	THIS_MODULE,
-	}
-} };
-
-static int __init sha512_generic_mod_init(void)
-{
-	return crypto_register_shashes(sha512_algs, ARRAY_SIZE(sha512_algs));
-}
-
-static void __exit sha512_generic_mod_fini(void)
-{
-	crypto_unregister_shashes(sha512_algs, ARRAY_SIZE(sha512_algs));
-}
-
-module_init(sha512_generic_mod_init);
-module_exit(sha512_generic_mod_fini);
-
-MODULE_LICENSE("GPL");
-MODULE_DESCRIPTION("SHA-512 and SHA-384 Secure Hash Algorithms");
-
-MODULE_ALIAS_CRYPTO("sha384");
-MODULE_ALIAS_CRYPTO("sha384-generic");
-MODULE_ALIAS_CRYPTO("sha512");
-MODULE_ALIAS_CRYPTO("sha512-generic");
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 72005074a5c26..9b4235adcb036 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4304,59 +4304,69 @@  static const struct alg_test_desc alg_test_descs[] = {
 		.alg = "authenc(hmac(sha256),rfc3686(ctr(aes)))",
 		.test = alg_test_null,
 		.fips_allowed = 1,
 	}, {
 		.alg = "authenc(hmac(sha384),cbc(des))",
+		.generic_driver = "authenc(hmac-sha384-lib,cbc(des-generic))",
 		.test = alg_test_aead,
 		.suite = {
 			.aead = __VECS(hmac_sha384_des_cbc_tv_temp)
 		}
 	}, {
 		.alg = "authenc(hmac(sha384),cbc(des3_ede))",
+		.generic_driver = "authenc(hmac-sha384-lib,cbc(des3_ede-generic))",
 		.test = alg_test_aead,
 		.suite = {
 			.aead = __VECS(hmac_sha384_des3_ede_cbc_tv_temp)
 		}
 	}, {
 		.alg = "authenc(hmac(sha384),ctr(aes))",
+		.generic_driver = "authenc(hmac-sha384-lib,ctr(aes-generic))",
 		.test = alg_test_null,
 		.fips_allowed = 1,
 	}, {
 		.alg = "authenc(hmac(sha384),cts(cbc(aes)))",
+		.generic_driver = "authenc(hmac-sha384-lib,cts(cbc(aes-generic)))",
 		.test = alg_test_aead,
 		.suite = {
 			.aead = __VECS(krb5_test_aes256_cts_hmac_sha384_192)
 		}
 	}, {
 		.alg = "authenc(hmac(sha384),rfc3686(ctr(aes)))",
+		.generic_driver = "authenc(hmac-sha384-lib,rfc3686(ctr(aes-generic)))",
 		.test = alg_test_null,
 		.fips_allowed = 1,
 	}, {
 		.alg = "authenc(hmac(sha512),cbc(aes))",
+		.generic_driver = "authenc(hmac-sha512-lib,cbc(aes-generic))",
 		.fips_allowed = 1,
 		.test = alg_test_aead,
 		.suite = {
 			.aead = __VECS(hmac_sha512_aes_cbc_tv_temp)
 		}
 	}, {
 		.alg = "authenc(hmac(sha512),cbc(des))",
+		.generic_driver = "authenc(hmac-sha512-lib,cbc(des-generic))",
 		.test = alg_test_aead,
 		.suite = {
 			.aead = __VECS(hmac_sha512_des_cbc_tv_temp)
 		}
 	}, {
 		.alg = "authenc(hmac(sha512),cbc(des3_ede))",
+		.generic_driver = "authenc(hmac-sha512-lib,cbc(des3_ede-generic))",
 		.test = alg_test_aead,
 		.suite = {
 			.aead = __VECS(hmac_sha512_des3_ede_cbc_tv_temp)
 		}
 	}, {
 		.alg = "authenc(hmac(sha512),ctr(aes))",
+		.generic_driver = "authenc(hmac-sha512-lib,ctr(aes-generic))",
 		.test = alg_test_null,
 		.fips_allowed = 1,
 	}, {
 		.alg = "authenc(hmac(sha512),rfc3686(ctr(aes)))",
+		.generic_driver = "authenc(hmac-sha512-lib,rfc3686(ctr(aes-generic)))",
 		.test = alg_test_null,
 		.fips_allowed = 1,
 	}, {
 		.alg = "blake2b-160",
 		.test = alg_test_hash,
@@ -5146,17 +5156,19 @@  static const struct alg_test_desc alg_test_descs[] = {
 		.suite = {
 			.hash = __VECS(hmac_sha3_512_tv_template)
 		}
 	}, {
 		.alg = "hmac(sha384)",
+		.generic_driver = "hmac-sha384-lib",
 		.test = alg_test_hash,
 		.fips_allowed = 1,
 		.suite = {
 			.hash = __VECS(hmac_sha384_tv_template)
 		}
 	}, {
 		.alg = "hmac(sha512)",
+		.generic_driver = "hmac-sha512-lib",
 		.test = alg_test_hash,
 		.fips_allowed = 1,
 		.suite = {
 			.hash = __VECS(hmac_sha512_tv_template)
 		}
@@ -5338,14 +5350,16 @@  static const struct alg_test_desc alg_test_descs[] = {
 		.alg = "pkcs1(rsa,sha3-512)",
 		.test = alg_test_null,
 		.fips_allowed = 1,
 	}, {
 		.alg = "pkcs1(rsa,sha384)",
+		.generic_driver = "pkcs1(rsa,sha384-lib)",
 		.test = alg_test_null,
 		.fips_allowed = 1,
 	}, {
 		.alg = "pkcs1(rsa,sha512)",
+		.generic_driver = "pkcs1(rsa,sha512-lib)",
 		.test = alg_test_null,
 		.fips_allowed = 1,
 	}, {
 		.alg = "pkcs1pad(rsa)",
 		.test = alg_test_null,
@@ -5482,17 +5496,19 @@  static const struct alg_test_desc alg_test_descs[] = {
 		.suite = {
 			.hash = __VECS(sha3_512_tv_template)
 		}
 	}, {
 		.alg = "sha384",
+		.generic_driver = "sha384-lib",
 		.test = alg_test_hash,
 		.fips_allowed = 1,
 		.suite = {
 			.hash = __VECS(sha384_tv_template)
 		}
 	}, {
 		.alg = "sha512",
+		.generic_driver = "sha512-lib",
 		.test = alg_test_hash,
 		.fips_allowed = 1,
 		.suite = {
 			.hash = __VECS(sha512_tv_template)
 		}
diff --git a/drivers/crypto/starfive/jh7110-hash.c b/drivers/crypto/starfive/jh7110-hash.c
index 2c60a1047bc39..4abbff07412ff 100644
--- a/drivers/crypto/starfive/jh7110-hash.c
+++ b/drivers/crypto/starfive/jh7110-hash.c
@@ -503,17 +503,17 @@  static int starfive_sha256_init_tfm(struct crypto_ahash *hash)
 				      STARFIVE_HASH_SHA256, 0);
 }
 
 static int starfive_sha384_init_tfm(struct crypto_ahash *hash)
 {
-	return starfive_hash_init_tfm(hash, "sha384-generic",
+	return starfive_hash_init_tfm(hash, "sha384-lib",
 				      STARFIVE_HASH_SHA384, 0);
 }
 
 static int starfive_sha512_init_tfm(struct crypto_ahash *hash)
 {
-	return starfive_hash_init_tfm(hash, "sha512-generic",
+	return starfive_hash_init_tfm(hash, "sha512-lib",
 				      STARFIVE_HASH_SHA512, 0);
 }
 
 static int starfive_sm3_init_tfm(struct crypto_ahash *hash)
 {
@@ -533,17 +533,17 @@  static int starfive_hmac_sha256_init_tfm(struct crypto_ahash *hash)
 				      STARFIVE_HASH_SHA256, 1);
 }
 
 static int starfive_hmac_sha384_init_tfm(struct crypto_ahash *hash)
 {
-	return starfive_hash_init_tfm(hash, "hmac(sha384-generic)",
+	return starfive_hash_init_tfm(hash, "hmac-sha384-lib",
 				      STARFIVE_HASH_SHA384, 1);
 }
 
 static int starfive_hmac_sha512_init_tfm(struct crypto_ahash *hash)
 {
-	return starfive_hash_init_tfm(hash, "hmac(sha512-generic)",
+	return starfive_hash_init_tfm(hash, "hmac-sha512-lib",
 				      STARFIVE_HASH_SHA512, 1);
 }
 
 static int starfive_hmac_sm3_init_tfm(struct crypto_ahash *hash)
 {
diff --git a/include/crypto/sha512_base.h b/include/crypto/sha512_base.h
index aa814bab442d4..d1361b3eb70b0 100644
--- a/include/crypto/sha512_base.h
+++ b/include/crypto/sha512_base.h
@@ -112,9 +112,6 @@  static inline int sha512_base_finish(struct shash_desc *desc, u8 *out)
 	for (i = 0; digest_size > 0; i++, digest_size -= sizeof(__be64))
 		put_unaligned_be64(sctx->state[i], digest++);
 	return 0;
 }
 
-void sha512_generic_block_fn(struct sha512_state *sst, u8 const *src,
-			     int blocks);
-
 #endif /* _CRYPTO_SHA512_BASE_H */