diff mbox series

[v2,6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation

Message ID 20240329080355.2871-7-ebiggers@kernel.org
State New
Headers show
Series Faster AES-XTS on modern x86_64 CPUs | expand

Commit Message

Eric Biggers March 29, 2024, 8:03 a.m. UTC
From: Eric Biggers <ebiggers@google.com>

Add an AES-XTS implementation "xts-aes-vaes-avx10_512" for x86_64 CPUs
with the VAES, VPCLMULQDQ, and either AVX10/512 or AVX512BW + AVX512VL
extensions.  This implementation uses zmm registers to operate on four
AES blocks at a time.  The assembly code is instantiated using a macro
so that most of the source code is shared with other implementations.

To avoid downclocking on older Intel CPU models, an exclusion list is
used to prevent this 512-bit implementation from being used by default
on some CPU models.  They will use xts-aes-vaes-avx10_256 instead.  For
now, this exclusion list is simply coded into aesni-intel_glue.c.  It
may make sense to eventually move it into a more central location.

xts-aes-vaes-avx10_512 is slightly faster than xts-aes-vaes-avx10_256 on
some current CPUs.  E.g., on AMD Zen 4, AES-256-XTS decryption
throughput increases by 13% with 4096-byte inputs, or 14% with 512-byte
inputs.  On Intel Sapphire Rapids, AES-256-XTS decryption throughput
increases by 2% with 4096-byte inputs, or 3% with 512-byte inputs.

Future CPUs may provide stronger 512-bit support, in which case a larger
benefit should be seen.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/aes-xts-avx-x86_64.S |  9 ++++++++
 arch/x86/crypto/aesni-intel_glue.c   | 32 ++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

Comments

Dave Hansen April 4, 2024, 8:34 p.m. UTC | #1
On 3/29/24 01:03, Eric Biggers wrote:
> +static const struct x86_cpu_id zmm_exclusion_list[] = {
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_SKYLAKE_X },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_X },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_D },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_L },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_NNPI },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE_L },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE },
> +	/* Allow Rocket Lake and later, and Sapphire Rapids and later. */
> +	/* Also allow AMD CPUs (starting with Zen 4, the first with AVX-512). */
> +	{},
> +};

A hard-coded model/family exclusion list is not great.

It'll break when running in guests on newer CPUs that fake any of these
models.  Some folks will also surely disagree with the kernel policy
implemented here.

Is there no way to implement this other than a hard-coded kernel policy?
Eric Biggers April 4, 2024, 11:36 p.m. UTC | #2
On Thu, Apr 04, 2024 at 01:34:04PM -0700, Dave Hansen wrote:
> On 3/29/24 01:03, Eric Biggers wrote:
> > +static const struct x86_cpu_id zmm_exclusion_list[] = {
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_SKYLAKE_X },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_X },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_D },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_L },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_NNPI },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE_L },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE },
> > +	/* Allow Rocket Lake and later, and Sapphire Rapids and later. */
> > +	/* Also allow AMD CPUs (starting with Zen 4, the first with AVX-512). */
> > +	{},
> > +};
> 
> A hard-coded model/family exclusion list is not great.
> 
> It'll break when running in guests on newer CPUs that fake any of these
> models.  Some folks will also surely disagree with the kernel policy
> implemented here.
> 
> Is there no way to implement this other than a hard-coded kernel policy?

Besides the hardcoded CPU exclusion list, the options are:

1. Never use zmm registers.

2. Ignore the issue and use zmm registers even on these CPU models.  Systemwide
   performance may suffer due to downclocking.

3. Do a runtime test to detect whether using zmm registers causes downclocking.
   This seems impractical.

4. Keep the proposed policy as the default behavior, but allow it to be
   overridden on the kernel command line.  This would be a bit more flexible;
   however, most people don't change defaults anyway.

When you write "Some folks will also surely disagree with the kernel policy
implemented here", are there any specific concerns that you anticipate?  Note
that Intel has acknowledged the zmm downclocking issues on Ice Lake and
suggested that using ymm registers instead would be reasonable:
https://lore.kernel.org/linux-crypto/e8ce1146-3952-6977-1d0e-a22758e58914@intel.com/

If there is really a controversy, my vote is that for now we just go with option
(1), i.e. drop this patch from the series.  We can reconsider the issue when a
CPU is released with better 512-bit support.

- Eric
Dave Hansen April 4, 2024, 11:53 p.m. UTC | #3
On 4/4/24 16:36, Eric Biggers wrote:
> 1. Never use zmm registers.
...
> 4. Keep the proposed policy as the default behavior, but allow it to be
>    overridden on the kernel command line.  This would be a bit more flexible;
>    however, most people don't change defaults anyway.
> 
> When you write "Some folks will also surely disagree with the kernel policy
> implemented here", are there any specific concerns that you anticipate?

Some people care less about the frequency throttling and only care about
max performance _using_ AVX512.

> Note that Intel has acknowledged the zmm downclocking issues on Ice
> Lake and suggested that using ymm registers instead would be
> reasonable:>
https://lore.kernel.org/linux-crypto/e8ce1146-3952-6977-1d0e-a22758e58914@intel.com/
> 
> If there is really a controversy, my vote is that for now we just go with option
> (1), i.e. drop this patch from the series.  We can reconsider the issue when a
> CPU is released with better 512-bit support.

(1) is fine with me.

(4) would also be fine.  But I don't think it absolutely _has_ to be a
boot-time switch.  What prevents you from registering, say,
"xts-aes-vaes-avx10" and then doing:

	if (avx512_is_desired())
		xts-aes-vaes-avx10_512(...);
	else
		xts-aes-vaes-avx10_256(...);

at runtime?

Where avx512_is_desired() can be changed willy-nilly, either with a
command-line parameter or runtime knob.  Sure, the performance might
change versus what was measured, but I don't think that's a deal breaker.

Then if folks want to do fancy benchmarks or model/family checks or
whatever, they can do it in userspace at runtime.
Eric Biggers April 5, 2024, 12:11 a.m. UTC | #4
On Thu, Apr 04, 2024 at 04:53:12PM -0700, Dave Hansen wrote:
> On 4/4/24 16:36, Eric Biggers wrote:
> > 1. Never use zmm registers.
> ...
> > 4. Keep the proposed policy as the default behavior, but allow it to be
> >    overridden on the kernel command line.  This would be a bit more flexible;
> >    however, most people don't change defaults anyway.
> > 
> > When you write "Some folks will also surely disagree with the kernel policy
> > implemented here", are there any specific concerns that you anticipate?
> 
> Some people care less about the frequency throttling and only care about
> max performance _using_ AVX512.
> 
> > Note that Intel has acknowledged the zmm downclocking issues on Ice
> > Lake and suggested that using ymm registers instead would be
> > reasonable:>
> https://lore.kernel.org/linux-crypto/e8ce1146-3952-6977-1d0e-a22758e58914@intel.com/
> > 
> > If there is really a controversy, my vote is that for now we just go with option
> > (1), i.e. drop this patch from the series.  We can reconsider the issue when a
> > CPU is released with better 512-bit support.
> 
> (1) is fine with me.
> 
> (4) would also be fine.  But I don't think it absolutely _has_ to be a
> boot-time switch.  What prevents you from registering, say,
> "xts-aes-vaes-avx10" and then doing:
> 
> 	if (avx512_is_desired())
> 		xts-aes-vaes-avx10_512(...);
> 	else
> 		xts-aes-vaes-avx10_256(...);
> 
> at runtime?
> 
> Where avx512_is_desired() can be changed willy-nilly, either with a
> command-line parameter or runtime knob.  Sure, the performance might
> change versus what was measured, but I don't think that's a deal breaker.
> 
> Then if folks want to do fancy benchmarks or model/family checks or
> whatever, they can do it in userspace at runtime.

It's certainly possible for a single crypto algorithm (using "algorithm" in the
crypto API sense of the word) to have multiple alternative code paths, and there
are examples of this in arch/x86/crypto/.  However, I think this is a poor
practice, at least as the crypto API is currently designed, because it makes it
difficult to test the different code paths.  Alternatives are best handled by
registering them as separate algorithms with different cra_priority values.

Also, I forgot one property of my patch, which is that because I made the
zmm_exclusion_list just decrease the priority of xts-aes-vaes-avx10_512 rather
than skipping registering it, the change actually can be undone at runtime by
increasing the priority of xts-aes-vaes-avx10_512 back to its original value.
Userspace can do it using the "crypto user configuration API"
(include/uapi/linux/cryptouser.h), specifically CRYPTO_MSG_UPDATEALG.

Maybe that is enough configurability already?

- Eric
Herbert Xu April 5, 2024, 7:20 a.m. UTC | #5
Eric Biggers <ebiggers@kernel.org> wrote:
>
> Also, I forgot one property of my patch, which is that because I made the
> zmm_exclusion_list just decrease the priority of xts-aes-vaes-avx10_512 rather
> than skipping registering it, the change actually can be undone at runtime by
> increasing the priority of xts-aes-vaes-avx10_512 back to its original value.
> Userspace can do it using the "crypto user configuration API"
> (include/uapi/linux/cryptouser.h), specifically CRYPTO_MSG_UPDATEALG.
> 
> Maybe that is enough configurability already?

Yes I think that's more than sufficient.

Thanks,
diff mbox series

Patch

diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S
index 71be474b22da..b8005d0205f8 100644
--- a/arch/x86/crypto/aes-xts-avx-x86_64.S
+++ b/arch/x86/crypto/aes-xts-avx-x86_64.S
@@ -824,6 +824,15 @@  SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_256)
 	_aes_xts_crypt	1
 SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_256)
 SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_256)
 	_aes_xts_crypt	0
 SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_256)
+
+.set	VL, 64
+.set	USE_AVX10, 1
+SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_512)
+	_aes_xts_crypt	1
+SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_512)
+SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_512)
+	_aes_xts_crypt	0
+SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_512)
 #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index 914cbf5d1f5c..0855ace8659c 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1298,12 +1298,33 @@  static struct simd_skcipher_alg *aes_xts_simdalg_##suffix
 
 DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500);
 #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
 DEFINE_XTS_ALG(vaes_avx2, "xts-aes-vaes-avx2", 600);
 DEFINE_XTS_ALG(vaes_avx10_256, "xts-aes-vaes-avx10_256", 700);
+DEFINE_XTS_ALG(vaes_avx10_512, "xts-aes-vaes-avx10_512", 800);
 #endif
 
+/*
+ * This is a list of CPU models that are known to suffer from downclocking when
+ * zmm registers (512-bit vectors) are used.  On these CPUs, the AES-XTS
+ * implementation with zmm registers won't be used by default.  An
+ * implementation with ymm registers (256-bit vectors) will be used instead.
+ */
+static const struct x86_cpu_id zmm_exclusion_list[] = {
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_SKYLAKE_X },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_X },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_D },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_L },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_NNPI },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE_L },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE },
+	/* Allow Rocket Lake and later, and Sapphire Rapids and later. */
+	/* Also allow AMD CPUs (starting with Zen 4, the first with AVX-512). */
+	{},
+};
+
 static int __init register_xts_algs(void)
 {
 	int err;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX))
@@ -1333,10 +1354,18 @@  static int __init register_xts_algs(void)
 
 	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_256, 1,
 					     &aes_xts_simdalg_vaes_avx10_256);
 	if (err)
 		return err;
+
+	if (x86_match_cpu(zmm_exclusion_list))
+		aes_xts_alg_vaes_avx10_512.base.cra_priority = 1;
+
+	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_512, 1,
+					     &aes_xts_simdalg_vaes_avx10_512);
+	if (err)
+		return err;
 #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
 	return 0;
 }
 
 static void unregister_xts_algs(void)
@@ -1349,10 +1378,13 @@  static void unregister_xts_algs(void)
 		simd_unregister_skciphers(&aes_xts_alg_vaes_avx2, 1,
 					  &aes_xts_simdalg_vaes_avx2);
 	if (aes_xts_simdalg_vaes_avx10_256)
 		simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_256, 1,
 					  &aes_xts_simdalg_vaes_avx10_256);
+	if (aes_xts_simdalg_vaes_avx10_512)
+		simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_512, 1,
+					  &aes_xts_simdalg_vaes_avx10_512);
 #endif
 }
 #else /* CONFIG_X86_64 */
 static int __init register_xts_algs(void)
 {