[0/6] Clean up and improve ARM/arm64 CRC-T10DIF code

Message ID	20241028190207.1394367-8-ardb+git@google.com
Headers	show Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B886418C333 for <linux-crypto@vger.kernel.org>; Mon, 28 Oct 2024 19:02:30 +0000 (UTC) Date: Mon, 28 Oct 2024 20:02:08 +0100 Precedence: bulk Mime-Version: 1.0 Message-ID: <20241028190207.1394367-8-ardb+git@google.com> Subject: [PATCH 0/6] Clean up and improve ARM/arm64 CRC-T10DIF code From: Ard Biesheuvel <ardb+git@google.com> To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel <ardb@kernel.org> Content-Type: text/plain; charset="UTF-8"
Series	Clean up and improve ARM/arm64 CRC-T10DIF code \| expand [0/6] Clean up and improve ARM/arm64 CRC-T10DIF code [1/6] crypto: arm64/crct10dif - Remove obsolete chunking logic [2/6] crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply [3/6] crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code [4/6] crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl [5/6] crypto: arm/crct10dif - Macroify PMULL asm code [6/6] crypto: arm/crct10dif - Implement plain NEON variant

Message ID

20241028190207.1394367-8-ardb+git@google.com

Headers

Date: Mon, 28 Oct 2024 20:02:08 +0100
Precedence: bulk
Mime-Version: 1.0
Message-ID: <20241028190207.1394367-8-ardb+git@google.com>
Subject: [PATCH 0/6] Clean up and improve ARM/arm64 CRC-T10DIF code
From: Ard Biesheuvel <ardb+git@google.com>
To: linux-crypto@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org,
	herbert@gondor.apana.org.au, keescook@chromium.org,
	Ard Biesheuvel <ardb@kernel.org>
Content-Type: text/plain; charset="UTF-8"

Series

Clean up and improve ARM/arm64 CRC-T10DIF code | expand

Message

Ard Biesheuvel Oct. 28, 2024, 7:02 p.m. UTC

From: Ard Biesheuvel <ardb@kernel.org>

I realized that the generic sequence implementing 64x64 polynomial
multiply using 8x8 PMULL instructions, which is used in the CRC-T10DIF
code to implement a fallback version for cores that lack the 64x64 PMULL
instruction, is not very efficient.

The folding coefficients that are used when processing the bulk of the
data are only 16 bits wide, and so 3/4 of the partial results of all those
8x8->16 bit multiplications do not contribute anything to the end result.

This means we can use a much faster implementation, producing a speedup
of 3.3x on Cortex-A72 without Crypto Extensions (Raspberry Pi 4).

The same logic can be ported to 32-bit ARM too, where it produces a
speedup of 6.6x compared with the generic C implementation on the same
platform.

Ard Biesheuvel (6):
  crypto: arm64/crct10dif - Remove obsolete chunking logic
  crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply
  crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code
  crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl
  crypto: arm/crct10dif - Macroify PMULL asm code
  crypto: arm/crct10dif - Implement plain NEON variant

 arch/arm/crypto/crct10dif-ce-core.S   | 201 ++++++++------
 arch/arm/crypto/crct10dif-ce-glue.c   |  54 +++-
 arch/arm64/crypto/crct10dif-ce-core.S | 282 +++++++-------------
 arch/arm64/crypto/crct10dif-ce-glue.c |  43 ++-
 4 files changed, 274 insertions(+), 306 deletions(-)

Comments

Eric Biggers Oct. 30, 2024, 3:54 a.m. UTC | #1

On Mon, Oct 28, 2024 at 08:02:09PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> This is a partial revert of commit fc754c024a343b, which moved the logic
> into C code which ensures that kernel mode NEON code does not hog the
> CPU for too long.
> 
> This is no longer needed now that kernel mode NEON no longer disables
> preemption, so we can drop this.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/crypto/crct10dif-ce-glue.c | 30 ++++----------------
>  1 file changed, 6 insertions(+), 24 deletions(-)

Reviewed-by: Eric Biggers <ebiggers@google.com>

- Eric

Eric Biggers Oct. 30, 2024, 4:15 a.m. UTC | #2

On Mon, Oct 28, 2024 at 08:02:11PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> The only remaining user of the fallback implementation of 64x64
> polynomial multiplication using 8x8 PMULL instructions is the final
> reduction from a 16 byte vector to a 16-bit CRC.
> 
> The fallback code is complicated and messy, and this reduction has very
> little impact on the overall performance, so instead, let's calculate
> the final CRC by passing the 16 byte vector to the generic CRC-T10DIF
> implementation.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/crypto/crct10dif-ce-core.S | 237 +++++---------------
>  arch/arm64/crypto/crct10dif-ce-glue.c |  15 +-
>  2 files changed, 64 insertions(+), 188 deletions(-)

For CRCs of short messages, doing a fast reduction from 128 bits can be quite
important.  But I agree that when only a 8x8 => 16 carryless multiplication is
available, it can't really be optimized, and just falling back to the generic
implementation is the right approach in that case.

> diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S
> index 8d99ccf61f16..1db5d1d1e2b7 100644
[...]
>  	ad		.req	v14
> -
> -	k00_16		.req	v15
> -	k32_48		.req	v16
> +	bd		.req	v15
>  
>  	t3		.req	v17
>  	t4		.req	v18
> @@ -91,117 +89,7 @@
>  	t8		.req	v22
>  	t9		.req	v23

ad, bd, and t9 are no longer used.

> +	// Use Barrett reduction to compute the final CRC value.
> +	pmull2		v1.1q, v1.2d, fold_consts.2d	// high 32 bits * floor(x^48 / G(x))

v0.2d was accidentally replaced with v1.2d above, which is causing a self-test
failure in crct10dif-arm64-ce.

Otherwise this patch looks good; thanks!

- Eric

Eric Biggers Oct. 30, 2024, 4:31 a.m. UTC | #3

On Mon, Oct 28, 2024 at 08:02:13PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> To allow an alternative version to be created of the PMULL based
> CRC-T10DIF algorithm, turn the bulk of it into a macro, except for the
> final reduction, which will only be used by the existing version.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm/crypto/crct10dif-ce-core.S | 154 ++++++++++----------
>  arch/arm/crypto/crct10dif-ce-glue.c |  10 +-
>  2 files changed, 83 insertions(+), 81 deletions(-)

Reviewed-by: Eric Biggers <ebiggers@google.com>

- Eric