Message ID | 20241028190207.1394367-8-ardb+git@google.com |
---|---|
Headers | show |
Series | Clean up and improve ARM/arm64 CRC-T10DIF code | expand |
On Mon, Oct 28, 2024 at 08:02:09PM +0100, Ard Biesheuvel wrote: > From: Ard Biesheuvel <ardb@kernel.org> > > This is a partial revert of commit fc754c024a343b, which moved the logic > into C code which ensures that kernel mode NEON code does not hog the > CPU for too long. > > This is no longer needed now that kernel mode NEON no longer disables > preemption, so we can drop this. > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > --- > arch/arm64/crypto/crct10dif-ce-glue.c | 30 ++++---------------- > 1 file changed, 6 insertions(+), 24 deletions(-) Reviewed-by: Eric Biggers <ebiggers@google.com> - Eric
On Mon, Oct 28, 2024 at 08:02:11PM +0100, Ard Biesheuvel wrote: > From: Ard Biesheuvel <ardb@kernel.org> > > The only remaining user of the fallback implementation of 64x64 > polynomial multiplication using 8x8 PMULL instructions is the final > reduction from a 16 byte vector to a 16-bit CRC. > > The fallback code is complicated and messy, and this reduction has very > little impact on the overall performance, so instead, let's calculate > the final CRC by passing the 16 byte vector to the generic CRC-T10DIF > implementation. > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > --- > arch/arm64/crypto/crct10dif-ce-core.S | 237 +++++--------------- > arch/arm64/crypto/crct10dif-ce-glue.c | 15 +- > 2 files changed, 64 insertions(+), 188 deletions(-) For CRCs of short messages, doing a fast reduction from 128 bits can be quite important. But I agree that when only a 8x8 => 16 carryless multiplication is available, it can't really be optimized, and just falling back to the generic implementation is the right approach in that case. > diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S > index 8d99ccf61f16..1db5d1d1e2b7 100644 [...] > ad .req v14 > - > - k00_16 .req v15 > - k32_48 .req v16 > + bd .req v15 > > t3 .req v17 > t4 .req v18 > @@ -91,117 +89,7 @@ > t8 .req v22 > t9 .req v23 ad, bd, and t9 are no longer used. > + // Use Barrett reduction to compute the final CRC value. > + pmull2 v1.1q, v1.2d, fold_consts.2d // high 32 bits * floor(x^48 / G(x)) v0.2d was accidentally replaced with v1.2d above, which is causing a self-test failure in crct10dif-arm64-ce. Otherwise this patch looks good; thanks! - Eric
On Mon, Oct 28, 2024 at 08:02:13PM +0100, Ard Biesheuvel wrote: > From: Ard Biesheuvel <ardb@kernel.org> > > To allow an alternative version to be created of the PMULL based > CRC-T10DIF algorithm, turn the bulk of it into a macro, except for the > final reduction, which will only be used by the existing version. > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > --- > arch/arm/crypto/crct10dif-ce-core.S | 154 ++++++++++---------- > arch/arm/crypto/crct10dif-ce-glue.c | 10 +- > 2 files changed, 83 insertions(+), 81 deletions(-) Reviewed-by: Eric Biggers <ebiggers@google.com> - Eric
From: Ard Biesheuvel <ardb@kernel.org> I realized that the generic sequence implementing 64x64 polynomial multiply using 8x8 PMULL instructions, which is used in the CRC-T10DIF code to implement a fallback version for cores that lack the 64x64 PMULL instruction, is not very efficient. The folding coefficients that are used when processing the bulk of the data are only 16 bits wide, and so 3/4 of the partial results of all those 8x8->16 bit multiplications do not contribute anything to the end result. This means we can use a much faster implementation, producing a speedup of 3.3x on Cortex-A72 without Crypto Extensions (Raspberry Pi 4). The same logic can be ported to 32-bit ARM too, where it produces a speedup of 6.6x compared with the generic C implementation on the same platform. Ard Biesheuvel (6): crypto: arm64/crct10dif - Remove obsolete chunking logic crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl crypto: arm/crct10dif - Macroify PMULL asm code crypto: arm/crct10dif - Implement plain NEON variant arch/arm/crypto/crct10dif-ce-core.S | 201 ++++++++------ arch/arm/crypto/crct10dif-ce-glue.c | 54 +++- arch/arm64/crypto/crct10dif-ce-core.S | 282 +++++++------------- arch/arm64/crypto/crct10dif-ce-glue.c | 43 ++- 4 files changed, 274 insertions(+), 306 deletions(-)