[1/7] crypto: x86/aes-gcm - code size optimization

Message ID	20241210235834.40862-2-ebiggers@kernel.org
State	Superseded
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B12CE22C36B for <linux-crypto@vger.kernel.org>; Tue, 10 Dec 2024 23:59:09 +0000 (UTC) From: Eric Biggers <ebiggers@kernel.org> To: linux-crypto@vger.kernel.org Cc: x86@kernel.org Subject: [PATCH 1/7] crypto: x86/aes-gcm - code size optimization Date: Tue, 10 Dec 2024 15:58:28 -0800 Message-ID: <20241210235834.40862-2-ebiggers@kernel.org> In-Reply-To: <20241210235834.40862-1-ebiggers@kernel.org> References: <20241210235834.40862-1-ebiggers@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	crypto: x86 - minor optimizations and cleanup to VAES code \| expand [0/7] crypto: x86 - minor optimizations and cleanup to VAES code [1/7] crypto: x86/aes-gcm - code size optimization [2/7] crypto: x86/aes-gcm - tune better for AMD CPUs [3/7] crypto: x86/aes-xts - use .irp when useful [4/7] crypto: x86/aes-xts - make the register aliases per-function [5/7] crypto: x86/aes-xts - improve some comments [6/7] crypto: x86/aes-xts - change len parameter to int [7/7] crypto: x86/aes-xts - more code size optimizations

Message ID

20241210235834.40862-2-ebiggers@kernel.org

State

Superseded

Headers

From: Eric Biggers <ebiggers@kernel.org>
To: linux-crypto@vger.kernel.org
Cc: x86@kernel.org
Subject: [PATCH 1/7] crypto: x86/aes-gcm - code size optimization
Date: Tue, 10 Dec 2024 15:58:28 -0800
Message-ID: <20241210235834.40862-2-ebiggers@kernel.org>
In-Reply-To: <20241210235834.40862-1-ebiggers@kernel.org>
References: <20241210235834.40862-1-ebiggers@kernel.org>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

crypto: x86 - minor optimizations and cleanup to VAES code | expand

Commit Message

Eric Biggers Dec. 10, 2024, 11:58 p.m. UTC

From: Eric Biggers <ebiggers@google.com>

Prefer immediates of -128 to 128, since the former fits in a signed
byte, saving 3 bytes per instruction.  Also replace a vpand and vpxor
with a vpternlogd.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/aes-gcm-avx10-x86_64.S | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/crypto/aes-gcm-avx10-x86_64.S b/arch/x86/crypto/aes-gcm-avx10-x86_64.S
index 97e0ee515fc5f..8989bf9b8384d 100644
--- a/arch/x86/crypto/aes-gcm-avx10-x86_64.S
+++ b/arch/x86/crypto/aes-gcm-avx10-x86_64.S
@@ -382,12 +382,12 @@ 
 	// wide shift instruction, so instead double each of the two 64-bit
 	// halves and incorporate the internal carry bit into the value XOR'd.
 	vpshufd		$0xd3, H_CUR_XMM, %xmm0
 	vpsrad		$31, %xmm0, %xmm0
 	vpaddq		H_CUR_XMM, H_CUR_XMM, H_CUR_XMM
-	vpand		.Lgfpoly_and_internal_carrybit(%rip), %xmm0, %xmm0
-	vpxor		%xmm0, H_CUR_XMM, H_CUR_XMM
+	// H_CUR_XMM ^= xmm0 & gfpoly_and_internal_carrybit
+	vpternlogd	$0x78, .Lgfpoly_and_internal_carrybit(%rip), %xmm0, H_CUR_XMM
 
 	// Load the gfpoly constant.
 	vbroadcasti32x4	.Lgfpoly(%rip), GFPOLY
 
 	// Square H^1 to get H^2.
@@ -711,11 +711,11 @@ 
 	// that processes 4*VL bytes of data at a time.  Otherwise skip it.
 	//
 	// Pre-subtracting 4*VL from DATALEN saves an instruction from the main
 	// loop and also ensures that at least one write always occurs to
 	// DATALEN, zero-extending it and allowing DATALEN64 to be used later.
-	sub		$4*VL, DATALEN
+	add		$-4*VL, DATALEN  // shorter than 'sub 4*VL' when VL=32
 	jl		.Lcrypt_loop_4x_done\@
 
 	// Load powers of the hash key.
 	vmovdqu8	OFFSETOFEND_H_POWERS-4*VL(KEY), H_POW4
 	vmovdqu8	OFFSETOFEND_H_POWERS-3*VL(KEY), H_POW3
@@ -758,13 +758,13 @@ 
 	vaesenclast	RNDKEYLAST3, V3, GHASHDATA3
 	vmovdqu8	GHASHDATA0, 0*VL(DST)
 	vmovdqu8	GHASHDATA1, 1*VL(DST)
 	vmovdqu8	GHASHDATA2, 2*VL(DST)
 	vmovdqu8	GHASHDATA3, 3*VL(DST)
-	add		$4*VL, SRC
-	add		$4*VL, DST
-	sub		$4*VL, DATALEN
+	sub		$-4*VL, SRC  // shorter than 'add 4*VL' when VL=32
+	sub		$-4*VL, DST
+	add		$-4*VL, DATALEN
 	jl		.Lghash_last_ciphertext_4x\@
 .endif
 
 	// Cache as many additional AES round keys as possible.
 .irp i, 9,8,7,6,5
@@ -838,13 +838,13 @@ 
 	vmovdqu8	GHASHDATA0, 0*VL(DST)
 	vmovdqu8	GHASHDATA1, 1*VL(DST)
 	vmovdqu8	GHASHDATA2, 2*VL(DST)
 	vmovdqu8	GHASHDATA3, 3*VL(DST)
 
-	add		$4*VL, SRC
-	add		$4*VL, DST
-	sub		$4*VL, DATALEN
+	sub		$-4*VL, SRC  // shorter than 'add 4*VL' when VL=32
+	sub		$-4*VL, DST
+	add		$-4*VL, DATALEN
 	jge		.Lcrypt_loop_4x\@
 
 .if \enc
 .Lghash_last_ciphertext_4x\@:
 	// Update GHASH with the last set of ciphertext blocks.
@@ -854,11 +854,11 @@ 
 .endif
 
 .Lcrypt_loop_4x_done\@:
 
 	// Undo the extra subtraction by 4*VL and check whether data remains.
-	add		$4*VL, DATALEN
+	sub		$-4*VL, DATALEN  // shorter than 'add 4*VL' when VL=32
 	jz		.Ldone\@
 
 	// The data length isn't a multiple of 4*VL.  Process the remaining data
 	// of length 1 <= DATALEN < 4*VL, up to one vector (VL bytes) at a time.
 	// Going one vector at a time may seem inefficient compared to having

[1/7] crypto: x86/aes-gcm - code size optimization

Commit Message

Patch