Message ID | 20240329080355.2871-7-ebiggers@kernel.org |
---|---|
State | New |
Headers | show |
Series | Faster AES-XTS on modern x86_64 CPUs | expand |
On 3/29/24 01:03, Eric Biggers wrote: > +static const struct x86_cpu_id zmm_exclusion_list[] = { > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_SKYLAKE_X }, > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_X }, > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_D }, > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE }, > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_L }, > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_NNPI }, > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE_L }, > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE }, > + /* Allow Rocket Lake and later, and Sapphire Rapids and later. */ > + /* Also allow AMD CPUs (starting with Zen 4, the first with AVX-512). */ > + {}, > +}; A hard-coded model/family exclusion list is not great. It'll break when running in guests on newer CPUs that fake any of these models. Some folks will also surely disagree with the kernel policy implemented here. Is there no way to implement this other than a hard-coded kernel policy?
On Thu, Apr 04, 2024 at 01:34:04PM -0700, Dave Hansen wrote: > On 3/29/24 01:03, Eric Biggers wrote: > > +static const struct x86_cpu_id zmm_exclusion_list[] = { > > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_SKYLAKE_X }, > > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_X }, > > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_D }, > > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE }, > > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_L }, > > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_NNPI }, > > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE_L }, > > + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE }, > > + /* Allow Rocket Lake and later, and Sapphire Rapids and later. */ > > + /* Also allow AMD CPUs (starting with Zen 4, the first with AVX-512). */ > > + {}, > > +}; > > A hard-coded model/family exclusion list is not great. > > It'll break when running in guests on newer CPUs that fake any of these > models. Some folks will also surely disagree with the kernel policy > implemented here. > > Is there no way to implement this other than a hard-coded kernel policy? Besides the hardcoded CPU exclusion list, the options are: 1. Never use zmm registers. 2. Ignore the issue and use zmm registers even on these CPU models. Systemwide performance may suffer due to downclocking. 3. Do a runtime test to detect whether using zmm registers causes downclocking. This seems impractical. 4. Keep the proposed policy as the default behavior, but allow it to be overridden on the kernel command line. This would be a bit more flexible; however, most people don't change defaults anyway. When you write "Some folks will also surely disagree with the kernel policy implemented here", are there any specific concerns that you anticipate? Note that Intel has acknowledged the zmm downclocking issues on Ice Lake and suggested that using ymm registers instead would be reasonable: https://lore.kernel.org/linux-crypto/e8ce1146-3952-6977-1d0e-a22758e58914@intel.com/ If there is really a controversy, my vote is that for now we just go with option (1), i.e. drop this patch from the series. We can reconsider the issue when a CPU is released with better 512-bit support. - Eric
On 4/4/24 16:36, Eric Biggers wrote: > 1. Never use zmm registers. ... > 4. Keep the proposed policy as the default behavior, but allow it to be > overridden on the kernel command line. This would be a bit more flexible; > however, most people don't change defaults anyway. > > When you write "Some folks will also surely disagree with the kernel policy > implemented here", are there any specific concerns that you anticipate? Some people care less about the frequency throttling and only care about max performance _using_ AVX512. > Note that Intel has acknowledged the zmm downclocking issues on Ice > Lake and suggested that using ymm registers instead would be > reasonable:> https://lore.kernel.org/linux-crypto/e8ce1146-3952-6977-1d0e-a22758e58914@intel.com/ > > If there is really a controversy, my vote is that for now we just go with option > (1), i.e. drop this patch from the series. We can reconsider the issue when a > CPU is released with better 512-bit support. (1) is fine with me. (4) would also be fine. But I don't think it absolutely _has_ to be a boot-time switch. What prevents you from registering, say, "xts-aes-vaes-avx10" and then doing: if (avx512_is_desired()) xts-aes-vaes-avx10_512(...); else xts-aes-vaes-avx10_256(...); at runtime? Where avx512_is_desired() can be changed willy-nilly, either with a command-line parameter or runtime knob. Sure, the performance might change versus what was measured, but I don't think that's a deal breaker. Then if folks want to do fancy benchmarks or model/family checks or whatever, they can do it in userspace at runtime.
On Thu, Apr 04, 2024 at 04:53:12PM -0700, Dave Hansen wrote: > On 4/4/24 16:36, Eric Biggers wrote: > > 1. Never use zmm registers. > ... > > 4. Keep the proposed policy as the default behavior, but allow it to be > > overridden on the kernel command line. This would be a bit more flexible; > > however, most people don't change defaults anyway. > > > > When you write "Some folks will also surely disagree with the kernel policy > > implemented here", are there any specific concerns that you anticipate? > > Some people care less about the frequency throttling and only care about > max performance _using_ AVX512. > > > Note that Intel has acknowledged the zmm downclocking issues on Ice > > Lake and suggested that using ymm registers instead would be > > reasonable:> > https://lore.kernel.org/linux-crypto/e8ce1146-3952-6977-1d0e-a22758e58914@intel.com/ > > > > If there is really a controversy, my vote is that for now we just go with option > > (1), i.e. drop this patch from the series. We can reconsider the issue when a > > CPU is released with better 512-bit support. > > (1) is fine with me. > > (4) would also be fine. But I don't think it absolutely _has_ to be a > boot-time switch. What prevents you from registering, say, > "xts-aes-vaes-avx10" and then doing: > > if (avx512_is_desired()) > xts-aes-vaes-avx10_512(...); > else > xts-aes-vaes-avx10_256(...); > > at runtime? > > Where avx512_is_desired() can be changed willy-nilly, either with a > command-line parameter or runtime knob. Sure, the performance might > change versus what was measured, but I don't think that's a deal breaker. > > Then if folks want to do fancy benchmarks or model/family checks or > whatever, they can do it in userspace at runtime. It's certainly possible for a single crypto algorithm (using "algorithm" in the crypto API sense of the word) to have multiple alternative code paths, and there are examples of this in arch/x86/crypto/. However, I think this is a poor practice, at least as the crypto API is currently designed, because it makes it difficult to test the different code paths. Alternatives are best handled by registering them as separate algorithms with different cra_priority values. Also, I forgot one property of my patch, which is that because I made the zmm_exclusion_list just decrease the priority of xts-aes-vaes-avx10_512 rather than skipping registering it, the change actually can be undone at runtime by increasing the priority of xts-aes-vaes-avx10_512 back to its original value. Userspace can do it using the "crypto user configuration API" (include/uapi/linux/cryptouser.h), specifically CRYPTO_MSG_UPDATEALG. Maybe that is enough configurability already? - Eric
Eric Biggers <ebiggers@kernel.org> wrote: > > Also, I forgot one property of my patch, which is that because I made the > zmm_exclusion_list just decrease the priority of xts-aes-vaes-avx10_512 rather > than skipping registering it, the change actually can be undone at runtime by > increasing the priority of xts-aes-vaes-avx10_512 back to its original value. > Userspace can do it using the "crypto user configuration API" > (include/uapi/linux/cryptouser.h), specifically CRYPTO_MSG_UPDATEALG. > > Maybe that is enough configurability already? Yes I think that's more than sufficient. Thanks,
diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S index 71be474b22da..b8005d0205f8 100644 --- a/arch/x86/crypto/aes-xts-avx-x86_64.S +++ b/arch/x86/crypto/aes-xts-avx-x86_64.S @@ -824,6 +824,15 @@ SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_256) _aes_xts_crypt 1 SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_256) SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_256) _aes_xts_crypt 0 SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_256) + +.set VL, 64 +.set USE_AVX10, 1 +SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_512) + _aes_xts_crypt 1 +SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_512) +SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_512) + _aes_xts_crypt 0 +SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_512) #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */ diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index 914cbf5d1f5c..0855ace8659c 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -1298,12 +1298,33 @@ static struct simd_skcipher_alg *aes_xts_simdalg_##suffix DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500); #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ) DEFINE_XTS_ALG(vaes_avx2, "xts-aes-vaes-avx2", 600); DEFINE_XTS_ALG(vaes_avx10_256, "xts-aes-vaes-avx10_256", 700); +DEFINE_XTS_ALG(vaes_avx10_512, "xts-aes-vaes-avx10_512", 800); #endif +/* + * This is a list of CPU models that are known to suffer from downclocking when + * zmm registers (512-bit vectors) are used. On these CPUs, the AES-XTS + * implementation with zmm registers won't be used by default. An + * implementation with ymm registers (256-bit vectors) will be used instead. + */ +static const struct x86_cpu_id zmm_exclusion_list[] = { + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_SKYLAKE_X }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_X }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_D }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_L }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_NNPI }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE_L }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE }, + /* Allow Rocket Lake and later, and Sapphire Rapids and later. */ + /* Also allow AMD CPUs (starting with Zen 4, the first with AVX-512). */ + {}, +}; + static int __init register_xts_algs(void) { int err; if (!boot_cpu_has(X86_FEATURE_AVX)) @@ -1333,10 +1354,18 @@ static int __init register_xts_algs(void) err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_256, 1, &aes_xts_simdalg_vaes_avx10_256); if (err) return err; + + if (x86_match_cpu(zmm_exclusion_list)) + aes_xts_alg_vaes_avx10_512.base.cra_priority = 1; + + err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_512, 1, + &aes_xts_simdalg_vaes_avx10_512); + if (err) + return err; #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */ return 0; } static void unregister_xts_algs(void) @@ -1349,10 +1378,13 @@ static void unregister_xts_algs(void) simd_unregister_skciphers(&aes_xts_alg_vaes_avx2, 1, &aes_xts_simdalg_vaes_avx2); if (aes_xts_simdalg_vaes_avx10_256) simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_256, 1, &aes_xts_simdalg_vaes_avx10_256); + if (aes_xts_simdalg_vaes_avx10_512) + simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_512, 1, + &aes_xts_simdalg_vaes_avx10_512); #endif } #else /* CONFIG_X86_64 */ static int __init register_xts_algs(void) {