Message ID | 20221012215931.3896-6-elliott@hpe.com |
---|---|
State | New |
Headers | show |
Series | crypto: x86 - fix RCU stalls | expand |
On Wed, Oct 12, 2022 at 04:59:17PM -0500, Robert Elliott wrote: > > @@ -170,9 +179,17 @@ static int __crc32c_pcl_intel_finup(u32 *crcp, const u8 *data, unsigned int len, > u8 *out) > { > if (len >= CRC32C_PCL_BREAKEVEN && crypto_simd_usable()) { > - kernel_fpu_begin(); > - *(__le32 *)out = ~cpu_to_le32(crc_pcl(data, len, *crcp)); > - kernel_fpu_end(); > + do { > + unsigned int chunk = min(len, FPU_BYTES); > + > + kernel_fpu_begin(); > + *crcp = crc_pcl(data, chunk, *crcp); How about storing the intermediate result in a local variable instead of overwriting *crcp? Thanks,
> -----Original Message----- > From: Herbert Xu <herbert@gondor.apana.org.au> > Sent: Wednesday, October 12, 2022 9:00 PM > To: Elliott, Robert (Servers) <elliott@hpe.com> > Cc: davem@davemloft.net; tim.c.chen@linux.intel.com; ap420073@gmail.com; > ardb@kernel.org; linux-crypto@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH v2 05/19] crypto: x86/crc - limit FPU preemption > > On Wed, Oct 12, 2022 at 04:59:17PM -0500, Robert Elliott wrote: > > > > @@ -170,9 +179,17 @@ static int __crc32c_pcl_intel_finup(u32 *crcp, const u8 > *data, unsigned int len, > > u8 *out) > > { > > if (len >= CRC32C_PCL_BREAKEVEN && crypto_simd_usable()) { > > - kernel_fpu_begin(); > > - *(__le32 *)out = ~cpu_to_le32(crc_pcl(data, len, *crcp)); > > - kernel_fpu_end(); > > + do { > > + unsigned int chunk = min(len, FPU_BYTES); > > + > > + kernel_fpu_begin(); > > + *crcp = crc_pcl(data, chunk, *crcp); > > How about storing the intermediate result in a local variable > instead of overwriting *crcp? > > Thanks, The _update function does so, and it's not marked const here, so seemed prudent to keep up to date. Do the callers understand it's no longer valid after finup, or is there any case they might treat finup like an update and try again?
From: Robert Elliott > Sent: 12 October 2022 22:59 > > As done by the ECB and CBC helpers in arch/x86/crypt/ecb_cbc_helpers.h, > limit the number of bytes processed between kernel_fpu_begin() and > kernel_fpu_end() calls. > > Those functions call preempt_disable() and preempt_enable(), so > the CPU core is unavailable for scheduling while running, leading to: > rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: ... How long were the buffers being processed when the rcu stall was reported? It looks like you are adding kernel_fpu_end(); kernel_fpu_begin() pairs every 4096 bytes. I'd guess the crc instruction runs at 4 bytes/clock (or at least gets somewhere near that). So you are talking of few thousand clocks at most. A pci read from a device can easily take much longer than that. So I'm surprised you need to do such small buffers to avoid rcu stalls. The kernel_fpu_end(); kernel_fpu_begin() pair pair will also cost. (Maybe not as much as the first kernel_fpu_begin() ?) Some performance figures might be enlightening. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Greeting, FYI, we noticed ltp.fsopen01.fail due to commit (built with gcc-11): commit: 0c664cbc906012f02c5bf128cf2dff854cca65c7 ("[PATCH v2 05/19] crypto: x86/crc - limit FPU preemption") url: https://github.com/intel-lab-lkp/linux/commits/Robert-Elliott/crypto-tcrypt-test-crc32/20221013-065919 base: https://git.kernel.org/cgit/linux/kernel/git/herbert/cryptodev-2.6.git master patch link: https://lore.kernel.org/linux-crypto/20221012215931.3896-6-elliott@hpe.com patch subject: [PATCH v2 05/19] crypto: x86/crc - limit FPU preemption in testcase: ltp version: ltp-x86_64-14c1f76-1_20221009 with following parameters: disk: 1HDD fs: ext4 test: syscalls-07 test-description: The LTP testsuite contains a collection of tools for testing the Linux kernel and related features. test-url: http://linux-test-project.github.io/ on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz (Skylake) with 32G memory caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): <<<test_start>>> tag=fsopen01 stime=1666383665 cmdline="fsopen01" contacts="" analysis=exit <<<test_output>>> ... tst_test.c:1599: TINFO: === Testing on btrfs === tst_test.c:1064: TINFO: Formatting /dev/loop0 with btrfs opts='' extra opts='' fsopen01.c:42: TFAIL: fsconfig(FSCONFIG_CMD_CREATE) failed: EINVAL (22) fsopen01.c:42: TFAIL: fsconfig(FSCONFIG_CMD_CREATE) failed: EINVAL (22) ... Summary: passed 12 failed 2 broken 0 skipped 0 warnings 0 <<<execution_status>>> initiation_status="ok" duration=3 termination_type=exited termination_id=1 corefile=no cutime=3 cstime=47 <<<test_end>>> [ 152.413919][ T4912] BTRFS: device fsid 05e51863-81c3-4c32-9e24-3d49d849f724 devid 1 transid 6 /dev/loop0 scanned by mkfs.btrfs (4912) [ 152.429076][ T4851] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm [ 152.438743][ T4851] BTRFS info (device loop0): using free space tree [ 152.449103][ T8] BTRFS warning (device loop0): checksum verify failed on logical 22036480 mirror 1 wanted 0xc4a1f4f3 found 0x76f09a51 level 0 [ 152.463363][ T35] BTRFS warning (device loop0): checksum verify failed on logical 22036480 mirror 2 wanted 0xc4a1f4f3 found 0x76f09a51 level 0 [ 152.477446][ T4851] BTRFS error (device loop0): failed to read chunk root [ 152.486164][ T4851] BTRFS error (device loop0): open_ctree failed If you fix the issue, kindly add following tag | Reported-by: kernel test robot <yujie.liu@intel.com> | Link: https://lore.kernel.org/r/202210240920.a0dfb6a3-yujie.liu@intel.com To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests sudo bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run sudo bin/lkp run generated-yaml-file # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
diff --git a/arch/x86/crypto/crc32-pclmul_asm.S b/arch/x86/crypto/crc32-pclmul_asm.S index ca53e96996ac..9abd861636c3 100644 --- a/arch/x86/crypto/crc32-pclmul_asm.S +++ b/arch/x86/crypto/crc32-pclmul_asm.S @@ -72,15 +72,15 @@ .text /** * Calculate crc32 - * BUF - buffer (16 bytes aligned) - * LEN - sizeof buffer (16 bytes aligned), LEN should be grater than 63 + * BUF - buffer - must be 16 bytes aligned + * LEN - sizeof buffer - must be multiple of 16 bytes and greater than 63 * CRC - initial crc32 * return %eax crc32 * uint crc32_pclmul_le_16(unsigned char const *buffer, * size_t len, uint crc32) */ -SYM_FUNC_START(crc32_pclmul_le_16) /* buffer and buffer size are 16 bytes aligned */ +SYM_FUNC_START(crc32_pclmul_le_16) movdqa (BUF), %xmm1 movdqa 0x10(BUF), %xmm2 movdqa 0x20(BUF), %xmm3 diff --git a/arch/x86/crypto/crc32-pclmul_glue.c b/arch/x86/crypto/crc32-pclmul_glue.c index 98cf3b4e4c9f..38539c6edfe5 100644 --- a/arch/x86/crypto/crc32-pclmul_glue.c +++ b/arch/x86/crypto/crc32-pclmul_glue.c @@ -46,6 +46,8 @@ #define SCALE_F 16L /* size of xmm register */ #define SCALE_F_MASK (SCALE_F - 1) +#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */ + u32 crc32_pclmul_le_16(unsigned char const *buffer, size_t len, u32 crc32); static u32 __attribute__((pure)) @@ -70,12 +72,19 @@ static u32 __attribute__((pure)) iquotient = len & (~SCALE_F_MASK); iremainder = len & SCALE_F_MASK; - kernel_fpu_begin(); - crc = crc32_pclmul_le_16(p, iquotient, crc); - kernel_fpu_end(); + do { + unsigned int chunk = min(iquotient, FPU_BYTES); + + kernel_fpu_begin(); + crc = crc32_pclmul_le_16(p, chunk, crc); + kernel_fpu_end(); + + iquotient -= chunk; + p += chunk; + } while (iquotient >= PCLMUL_MIN_LEN); - if (iremainder) - crc = crc32_le(crc, p + iquotient, iremainder); + if (iquotient || iremainder) + crc = crc32_le(crc, p, iquotient + iremainder); return crc; } diff --git a/arch/x86/crypto/crc32c-intel_glue.c b/arch/x86/crypto/crc32c-intel_glue.c index feccb5254c7e..ece620227057 100644 --- a/arch/x86/crypto/crc32c-intel_glue.c +++ b/arch/x86/crypto/crc32c-intel_glue.c @@ -41,6 +41,8 @@ */ #define CRC32C_PCL_BREAKEVEN 512 +#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */ + asmlinkage unsigned int crc_pcl(const u8 *buffer, int len, unsigned int crc_init); #endif /* CONFIG_X86_64 */ @@ -158,9 +160,16 @@ static int crc32c_pcl_intel_update(struct shash_desc *desc, const u8 *data, * overcome kernel fpu state save/restore overhead */ if (len >= CRC32C_PCL_BREAKEVEN && crypto_simd_usable()) { - kernel_fpu_begin(); - *crcp = crc_pcl(data, len, *crcp); - kernel_fpu_end(); + do { + unsigned int chunk = min(len, FPU_BYTES); + + kernel_fpu_begin(); + *crcp = crc_pcl(data, chunk, *crcp); + kernel_fpu_end(); + + len -= chunk; + data += chunk; + } while (len); } else *crcp = crc32c_intel_le_hw(*crcp, data, len); return 0; @@ -170,9 +179,17 @@ static int __crc32c_pcl_intel_finup(u32 *crcp, const u8 *data, unsigned int len, u8 *out) { if (len >= CRC32C_PCL_BREAKEVEN && crypto_simd_usable()) { - kernel_fpu_begin(); - *(__le32 *)out = ~cpu_to_le32(crc_pcl(data, len, *crcp)); - kernel_fpu_end(); + do { + unsigned int chunk = min(len, FPU_BYTES); + + kernel_fpu_begin(); + *crcp = crc_pcl(data, chunk, *crcp); + kernel_fpu_end(); + + len -= chunk; + data += chunk; + } while (len); + *(__le32 *)out = ~cpu_to_le32(*crcp); } else *(__le32 *)out = ~cpu_to_le32(crc32c_intel_le_hw(*crcp, data, len)); diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c index 71291d5af9f4..54a537fc88ee 100644 --- a/arch/x86/crypto/crct10dif-pclmul_glue.c +++ b/arch/x86/crypto/crct10dif-pclmul_glue.c @@ -34,6 +34,10 @@ #include <asm/cpu_device_id.h> #include <asm/simd.h> +#define PCLMUL_MIN_LEN 16U /* minimum size of buffer for crc_t10dif_pcl */ + +#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */ + asmlinkage u16 crc_t10dif_pcl(u16 init_crc, const u8 *buf, size_t len); struct chksum_desc_ctx { @@ -54,10 +58,19 @@ static int chksum_update(struct shash_desc *desc, const u8 *data, { struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); - if (length >= 16 && crypto_simd_usable()) { - kernel_fpu_begin(); - ctx->crc = crc_t10dif_pcl(ctx->crc, data, length); - kernel_fpu_end(); + if (length >= PCLMUL_MIN_LEN && crypto_simd_usable()) { + do { + unsigned int chunk = min(length, FPU_BYTES); + + kernel_fpu_begin(); + ctx->crc = crc_t10dif_pcl(ctx->crc, data, chunk); + kernel_fpu_end(); + + length -= chunk; + data += chunk; + } while (length >= PCLMUL_MIN_LEN); + if (length) + ctx->crc = crc_t10dif_generic(ctx->crc, data, length); } else ctx->crc = crc_t10dif_generic(ctx->crc, data, length); return 0; @@ -73,10 +86,20 @@ static int chksum_final(struct shash_desc *desc, u8 *out) static int __chksum_finup(__u16 crc, const u8 *data, unsigned int len, u8 *out) { - if (len >= 16 && crypto_simd_usable()) { - kernel_fpu_begin(); - *(__u16 *)out = crc_t10dif_pcl(crc, data, len); - kernel_fpu_end(); + if (len >= PCLMUL_MIN_LEN && crypto_simd_usable()) { + do { + unsigned int chunk = min(len, FPU_BYTES); + + kernel_fpu_begin(); + crc = crc_t10dif_pcl(crc, data, chunk); + kernel_fpu_end(); + + len -= chunk; + data += chunk; + } while (len >= PCLMUL_MIN_LEN); + if (len) + crc = crc_t10dif_generic(crc, data, len); + *(__u16 *)out = crc; } else *(__u16 *)out = crc_t10dif_generic(crc, data, len); return 0;
As done by the ECB and CBC helpers in arch/x86/crypt/ecb_cbc_helpers.h, limit the number of bytes processed between kernel_fpu_begin() and kernel_fpu_end() calls. Those functions call preempt_disable() and preempt_enable(), so the CPU core is unavailable for scheduling while running, leading to: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: ... Fixes: 78c37d191dd6 ("crypto: crc32 - add crc32 pclmulqdq implementation and wrappers for table implementation") Fixes: 6a8ce1ef3940 ("crypto: crc32c - Optimize CRC32C calculation with PCLMULQDQ instruction") Fixes: 0b95a7f85718 ("crypto: crct10dif - Glue code to cast accelerated CRCT10DIF assembly as a crypto transform") Suggested-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Robert Elliott <elliott@hpe.com> --- arch/x86/crypto/crc32-pclmul_asm.S | 6 ++-- arch/x86/crypto/crc32-pclmul_glue.c | 19 ++++++++---- arch/x86/crypto/crc32c-intel_glue.c | 29 ++++++++++++++---- arch/x86/crypto/crct10dif-pclmul_glue.c | 39 ++++++++++++++++++++----- 4 files changed, 71 insertions(+), 22 deletions(-)