diff mbox series

[crypto,1/2] lib/crypto: blake2s-generic: reduce code size on small systems

Message ID 20220111134934.324663-2-Jason@zx2c4.com
State New
Headers show
Series [crypto,1/2] lib/crypto: blake2s-generic: reduce code size on small systems | expand

Commit Message

Jason A. Donenfeld Jan. 11, 2022, 1:49 p.m. UTC
Re-wind the loops entirely on kernels optimized for code size. This is
really not good at all performance-wise. But on m68k, it shaves off 4k
of code size, which is apparently important.

Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 lib/crypto/blake2s-generic.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

Comments

Jason A. Donenfeld Jan. 12, 2022, 1:16 p.m. UTC | #1
Hi Geert,

Thanks for testing this. However, I've *abandoned* this patch, due to
unacceptable performance hits, and figuring out that we can accomplish
basically the same thing without as large of a hit by modifying the
obsolete sha1 implementation.

Herbert - please do not apply this patch. Instead, later versions of
this patchset (e.g. v3 [1] and potentially later if it comes to that)
are what should be applied.

Jason

[1] https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/
Eric Biggers Jan. 12, 2022, 6:31 p.m. UTC | #2
On Tue, Jan 11, 2022 at 02:49:33PM +0100, Jason A. Donenfeld wrote:
> Re-wind the loops entirely on kernels optimized for code size. This is
> really not good at all performance-wise. But on m68k, it shaves off 4k
> of code size, which is apparently important.
> 
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
>  lib/crypto/blake2s-generic.c | 30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c
> index 75ccb3e633e6..990f000e22ee 100644
> --- a/lib/crypto/blake2s-generic.c
> +++ b/lib/crypto/blake2s-generic.c
> @@ -46,7 +46,7 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
>  {
>  	u32 m[16];
>  	u32 v[16];
> -	int i;
> +	int i, j;
>  
>  	WARN_ON(IS_ENABLED(DEBUG) &&
>  		(nblocks > 1 && inc != BLAKE2S_BLOCK_SIZE));
> @@ -86,17 +86,23 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
>  	G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
>  	G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
>  } while (0)
> -		ROUND(0);
> -		ROUND(1);
> -		ROUND(2);
> -		ROUND(3);
> -		ROUND(4);
> -		ROUND(5);
> -		ROUND(6);
> -		ROUND(7);
> -		ROUND(8);
> -		ROUND(9);
> -
> +		if (IS_ENABLED(CONFIG_CC_OPTIMIZE_FOR_SIZE)) {
> +			for (i = 0; i < 10; ++i) {
> +				for (j = 0; j < 8; ++j)
> +					G(i, j, v[j % 4], v[((j + (j / 4)) % 4) + 4], v[((j + 2 * (j / 4)) % 4) + 8], v[((j + 3 * (j / 4)) % 4) + 12]);
> +			}

How about unrolling the inner loop but not the outer one?  Wouldn't that give
most of the benefit, without hurting performance as much?

If you stay with this approach and don't unroll either loop, can you use 'r' and
'i' instead of 'i' and 'j', to match the naming in G()?

Also, please wrap lines at 80 columns.

- Eric
Jason A. Donenfeld Jan. 12, 2022, 6:50 p.m. UTC | #3
On Wed, Jan 12, 2022 at 7:32 PM Eric Biggers <ebiggers@kernel.org> wrote:
> How about unrolling the inner loop but not the outer one?  Wouldn't that give
> most of the benefit, without hurting performance as much?
>
> If you stay with this approach and don't unroll either loop, can you use 'r' and
> 'i' instead of 'i' and 'j', to match the naming in G()?

All this might work, sure. But as mentioned earlier, I've abandoned
this entirely, as I don't think this patch is necessary. See the v3
patchset instead:

https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/
David Laight Jan. 12, 2022, 9:27 p.m. UTC | #4
From: Jason A. Donenfeld
> Sent: 12 January 2022 18:51
> 
> On Wed, Jan 12, 2022 at 7:32 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > How about unrolling the inner loop but not the outer one?  Wouldn't that give
> > most of the benefit, without hurting performance as much?
> >
> > If you stay with this approach and don't unroll either loop, can you use 'r' and
> > 'i' instead of 'i' and 'j', to match the naming in G()?
> 
> All this might work, sure. But as mentioned earlier, I've abandoned
> this entirely, as I don't think this patch is necessary. See the v3
> patchset instead:
> 
> https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/

I think you mentioned in another thread that the buffers (eg for IPv6
addresses) are actually often quite short.

For short buffers the 'rolled-up' loop may be of similar performance
to the unrolled one because of the time taken to read all the instructions
into the I-cache and decode them.
If the loop ends up small enough it will fit into the 'decoded loop
buffer' of modern Intel x86 cpu and won't even need decoding on
each iteration.

I really suspect that the heavily unrolled loop is only really fast
for big buffers and/or when it is already in the I-cache.
In real life I wonder how often that actually happens?
Especially for the uses the kernel is making of the code.

You need to benchmark single executions of the function
(doable on x86 with the performance monitor cycle counter)
to get typical/best clocks/byte figures rather than a
big average for repeated operation on a long buffer.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jason A. Donenfeld Jan. 12, 2022, 10 p.m. UTC | #5
Hi David,

On 1/12/22, David Laight <David.Laight@aculab.com> wrote:
> I think you mentioned in another thread that the buffers (eg for IPv6
> addresses) are actually often quite short.
>
> For short buffers the 'rolled-up' loop may be of similar performance
> to the unrolled one because of the time taken to read all the instructions
> into the I-cache and decode them.
> If the loop ends up small enough it will fit into the 'decoded loop
> buffer' of modern Intel x86 cpu and won't even need decoding on
> each iteration.
>
> I really suspect that the heavily unrolled loop is only really fast
> for big buffers and/or when it is already in the I-cache.
> In real life I wonder how often that actually happens?
> Especially for the uses the kernel is making of the code.
>
> You need to benchmark single executions of the function
> (doable on x86 with the performance monitor cycle counter)
> to get typical/best clocks/byte figures rather than a
> big average for repeated operation on a long buffer.
>
> 	David

This patch has been dropped entirely from future revisions. The latest
as of writing is at:

https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/

If you'd like to do something with blake2s, by all means submit a
patch and include various rationale and metrics and benchmarks. I do
not intend to do that myself and do not think my particular patch here
should be merged. But if you'd like to do something, feel free to CC
me for a review. However, as mentioned, I don't think much needs to be
done here.

Again, v3 is here:
https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/

Thanks,
Jason
diff mbox series

Patch

diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c
index 75ccb3e633e6..990f000e22ee 100644
--- a/lib/crypto/blake2s-generic.c
+++ b/lib/crypto/blake2s-generic.c
@@ -46,7 +46,7 @@  void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
 {
 	u32 m[16];
 	u32 v[16];
-	int i;
+	int i, j;
 
 	WARN_ON(IS_ENABLED(DEBUG) &&
 		(nblocks > 1 && inc != BLAKE2S_BLOCK_SIZE));
@@ -86,17 +86,23 @@  void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
 	G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
 	G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
 } while (0)
-		ROUND(0);
-		ROUND(1);
-		ROUND(2);
-		ROUND(3);
-		ROUND(4);
-		ROUND(5);
-		ROUND(6);
-		ROUND(7);
-		ROUND(8);
-		ROUND(9);
-
+		if (IS_ENABLED(CONFIG_CC_OPTIMIZE_FOR_SIZE)) {
+			for (i = 0; i < 10; ++i) {
+				for (j = 0; j < 8; ++j)
+					G(i, j, v[j % 4], v[((j + (j / 4)) % 4) + 4], v[((j + 2 * (j / 4)) % 4) + 8], v[((j + 3 * (j / 4)) % 4) + 12]);
+			}
+		} else {
+			ROUND(0);
+			ROUND(1);
+			ROUND(2);
+			ROUND(3);
+			ROUND(4);
+			ROUND(5);
+			ROUND(6);
+			ROUND(7);
+			ROUND(8);
+			ROUND(9);
+		}
 #undef G
 #undef ROUND