mbox series

[0/5] crypto: add NEON-optimized BLAKE2b

Message ID 20201215234708.105527-1-ebiggers@kernel.org
Headers show
Series crypto: add NEON-optimized BLAKE2b | expand

Message

Eric Biggers Dec. 15, 2020, 11:47 p.m. UTC
This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM.
Patches 1-4 prepare for it by making some updates to the generic
implementation, while patch 5 adds the actual NEON implementation.

On Cortex-A7 (which these days is the most common ARM processor that
doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
SHA-256, and slightly faster than SHA-1.  It is also almost three times
as fast as the generic implementation of BLAKE2b:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.1
	sha1-neon            16.4
	sha1-asm             20.8
	blake2s-256-generic  26.1
	sha256-neon	     28.9
	sha256-asm	     32.1
	blake2b-256-generic  39.9

This implementation isn't directly based on any other implementation,
but it borrows some ideas from previous NEON code I've written as well
as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
the other NEON implementations of BLAKE2b I'm aware of (the
implementation in the BLAKE2 official repository using intrinsics, and
Andrew Moon's implementation which can be found in SUPERCOP).

NEON-optimized BLAKE2b is useful because there is interest in using
BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
these devices, the performance cost of upgrading to SHA-256 may be
unacceptable, whereas BLAKE2b-256 would actually improve performance.

Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
64-bit operations, and because BLAKE2s's block size is too small for
NEON to be helpful for it.  The best I've been able to do with BLAKE2s
on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.

(I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
they're more complex as they require running multiple hashes at once.
Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
from the smaller number of rounds, not from the extra parallelism.)

This patchset was tested on a Raspberry Pi 2, including with
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y.

Eric Biggers (5):
  crypto: blake2b - rename constants for consistency with blake2s
  crypto: blake2b - define shash_alg structs using macros
  crypto: blake2b - export helpers for optimized implementations
  crypto: blake2b - update file comment
  crypto: arm/blake2b - add NEON-optimized BLAKE2b implementation

 arch/arm/crypto/Kconfig             |  10 +
 arch/arm/crypto/Makefile            |   2 +
 arch/arm/crypto/blake2b-neon-core.S | 357 ++++++++++++++++++++++++++++
 arch/arm/crypto/blake2b-neon-glue.c | 105 ++++++++
 crypto/blake2b_generic.c            | 205 +++++++---------
 include/crypto/blake2b.h            |  54 +++++
 6 files changed, 619 insertions(+), 114 deletions(-)
 create mode 100644 arch/arm/crypto/blake2b-neon-core.S
 create mode 100644 arch/arm/crypto/blake2b-neon-glue.c
 create mode 100644 include/crypto/blake2b.h


base-commit: 3db1a3fa98808aa90f95ec3e0fa2fc7abf28f5c9

Comments

Eric Biggers Dec. 16, 2020, 8:47 p.m. UTC | #1
On Tue, Dec 15, 2020 at 03:47:03PM -0800, Eric Biggers wrote:
> This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM.

> Patches 1-4 prepare for it by making some updates to the generic

> implementation, while patch 5 adds the actual NEON implementation.

> 

> On Cortex-A7 (which these days is the most common ARM processor that

> doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as

> SHA-256, and slightly faster than SHA-1.  It is also almost three times

> as fast as the generic implementation of BLAKE2b:

> 

> 	Algorithm            Cycles per byte (on 4096-byte messages)

> 	===================  =======================================

> 	blake2b-256-neon     14.1

> 	sha1-neon            16.4

> 	sha1-asm             20.8

> 	blake2s-256-generic  26.1

> 	sha256-neon	     28.9

> 	sha256-asm	     32.1

> 	blake2b-256-generic  39.9

> 

> This implementation isn't directly based on any other implementation,

> but it borrows some ideas from previous NEON code I've written as well

> as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than

> the other NEON implementations of BLAKE2b I'm aware of (the

> implementation in the BLAKE2 official repository using intrinsics, and

> Andrew Moon's implementation which can be found in SUPERCOP).

> 

> NEON-optimized BLAKE2b is useful because there is interest in using

> BLAKE2b-256 for dm-verity on low-end Android devices (specifically,

> devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On

> these devices, the performance cost of upgrading to SHA-256 may be

> unacceptable, whereas BLAKE2b-256 would actually improve performance.

> 

> Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which

> is intended for 32-bit platforms), on 32-bit ARM processors with NEON,

> BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports

> 64-bit operations, and because BLAKE2s's block size is too small for

> NEON to be helpful for it.  The best I've been able to do with BLAKE2s

> on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.


By the way, if people are interested in having my ARM scalar implementation of
BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up
being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.1
	sha1-neon            16.4
	blake2s-256-arm      19.0
	sha1-asm             20.8
	blake2s-256-generic  26.1
	sha256-neon	     28.9
	sha256-asm	     32.1
	blake2b-256-generic  39.9
Jason A. Donenfeld Dec. 16, 2020, 10:32 p.m. UTC | #2
Hi Eric,

On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote:
> By the way, if people are interested in having my ARM scalar implementation of

> BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up

> being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case

> mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:


I'd certainly be interested in this. Any rough idea how it performs
for pretty small messages compared to the generic implementation?
100-140 byte ranges? Is the speedup about the same as for longer
messages because this doesn't parallelize across multiple blocks?

Jason
Eric Biggers Dec. 17, 2020, 3:54 a.m. UTC | #3
On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote:
> Hi Eric,

> 

> On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote:

> > By the way, if people are interested in having my ARM scalar implementation of

> > BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up

> > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case

> > mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:

> 

> I'd certainly be interested in this. Any rough idea how it performs

> for pretty small messages compared to the generic implementation?

> 100-140 byte ranges? Is the speedup about the same as for longer

> messages because this doesn't parallelize across multiple blocks?

> 


It does one block at a time, and there isn't much overhead, so yes the speedup
on short messages should be about the same as on long messages.

I did a couple quick userspace benchmarks and got (still on Cortex-A7):

	100-byte messages:
		BLAKE2s ARM:     28.9 cpb
		BLAKE2s generic: 42.4 cpb

	140-byte messages:
		BLAKE2s ARM:     29.5 cpb
		BLAKE2s generic: 44.0 cpb

The results in the kernel may differ a bit, but probably not by much.

- Eric
Jason A. Donenfeld Dec. 17, 2020, 2:01 p.m. UTC | #4
On Thu, Dec 17, 2020 at 4:54 AM Eric Biggers <ebiggers@kernel.org> wrote:
>

> On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote:

> > Hi Eric,

> >

> > On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote:

> > > By the way, if people are interested in having my ARM scalar implementation of

> > > BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up

> > > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case

> > > mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:

> >

> > I'd certainly be interested in this. Any rough idea how it performs

> > for pretty small messages compared to the generic implementation?

> > 100-140 byte ranges? Is the speedup about the same as for longer

> > messages because this doesn't parallelize across multiple blocks?

> >

>

> It does one block at a time, and there isn't much overhead, so yes the speedup

> on short messages should be about the same as on long messages.

>

> I did a couple quick userspace benchmarks and got (still on Cortex-A7):

>

>         100-byte messages:

>                 BLAKE2s ARM:     28.9 cpb

>                 BLAKE2s generic: 42.4 cpb

>

>         140-byte messages:

>                 BLAKE2s ARM:     29.5 cpb

>                 BLAKE2s generic: 44.0 cpb

>

> The results in the kernel may differ a bit, but probably not by much.


That's certainly a nice improvement though, and I'd very much welcome
the faster implementation.

Jason
David Sterba Dec. 17, 2020, 5:13 p.m. UTC | #5
On Tue, Dec 15, 2020 at 03:47:04PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>

> 

> Rename some BLAKE2b-related constants to be consistent with the names

> used in the BLAKE2s implementation (see include/crypto/blake2s.h):

> 

> 	BLAKE2B_*_DIGEST_SIZE  => BLAKE2B_*_HASH_SIZE

> 	BLAKE2B_BLOCKBYTES     => BLAKE2B_BLOCK_SIZE

> 	BLAKE2B_KEYBYTES       => BLAKE2B_KEY_SIZE

> 

> Signed-off-by: Eric Biggers <ebiggers@google.com>


Reviewed-by: David Sterba <dsterba@suse.com>
David Sterba Dec. 17, 2020, 5:15 p.m. UTC | #6
On Tue, Dec 15, 2020 at 03:47:06PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>

> 

> In preparation for adding architecture-specific implementations of

> BLAKE2b, create a header <crypto/blake2b.h> that contains common

> constants, structs, and helper functions for BLAKE2b.

> 

> Furthermore, export the BLAKE2b generic setkey(), init(), update(), and

> final() functions, and add functions __crypto_blake2b_update() and

> __crypto_blake2b_final() which take a pointer to a

> blake2b_compress_blocks_t function.

> 

> This way, optimized implementations of BLAKE2b only have to provide an

> implementation of blake2b_compress_blocks_t.  (This is modeled on how

> the nhpoly1305 implementations work.  Also, the prototype of

> blake2b_compress_blocks_t is meant to be similar to that of

> blake2s_compress_arch().)

> 

> Signed-off-by: Eric Biggers <ebiggers@google.com>


Reviewed-by: David Sterba <dsterba@suse.com>
David Sterba Dec. 17, 2020, 5:17 p.m. UTC | #7
On Tue, Dec 15, 2020 at 03:47:07PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>

> 

> The file comment for blake2b_generic.c makes it sound like it's the

> reference implementation of BLAKE2b with only minor changes.  But it's

> actually been changed a lot.  Update the comment to make this clearer.


After your changes I agree it appropriate to reword it.

> Signed-off-by: Eric Biggers <ebiggers@google.com>


Reviewed-by: David Sterba <dsterba@suse.com>
Eric Biggers Dec. 17, 2020, 10:33 p.m. UTC | #8
On Thu, Dec 17, 2020 at 06:15:40PM +0100, David Sterba wrote:
> On Tue, Dec 15, 2020 at 03:47:06PM -0800, Eric Biggers wrote:

> > From: Eric Biggers <ebiggers@google.com>

> > 

> > In preparation for adding architecture-specific implementations of

> > BLAKE2b, create a header <crypto/blake2b.h> that contains common

> > constants, structs, and helper functions for BLAKE2b.

> > 

> > Furthermore, export the BLAKE2b generic setkey(), init(), update(), and

> > final() functions, and add functions __crypto_blake2b_update() and

> > __crypto_blake2b_final() which take a pointer to a

> > blake2b_compress_blocks_t function.

> > 

> > This way, optimized implementations of BLAKE2b only have to provide an

> > implementation of blake2b_compress_blocks_t.  (This is modeled on how

> > the nhpoly1305 implementations work.  Also, the prototype of

> > blake2b_compress_blocks_t is meant to be similar to that of

> > blake2s_compress_arch().)

> > 

> > Signed-off-by: Eric Biggers <ebiggers@google.com>

> 

> Reviewed-by: David Sterba <dsterba@suse.com>


I ended up making some changes to this patch in v2, to keep things consistent
with what I decided to do for BLAKE2s, so I didn't add your Reviewed-by to this
patch.  If you could review the new version too that would be great -- thanks!

- Eric