diff mbox series

[RFC,1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs

Message ID 20250220051325.340691-2-ebiggers@kernel.org
State New
Headers show
Series [RFC,1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs | expand

Commit Message

Eric Biggers Feb. 20, 2025, 5:13 a.m. UTC
From: Eric Biggers <ebiggers@google.com>

Currently kernel-mode FPU is not always usable in softirq context on
x86, since softirqs can nest inside a kernel-mode FPU section in task
context, and nested use of kernel-mode FPU is not supported.

Therefore, x86 SIMD-optimized code that can be called in softirq context
has to sometimes fall back to non-SIMD code.  There are two options for
the fallback, both of which are pretty terrible:

  (a) Use a scalar fallback.  This can be 10-100x slower than vectorized
      code because it cannot use specialized instructions like AES, SHA,
      or carryless multiplication.

  (b) Execute the request asynchronously using a kworker.  In other
      words, use the "crypto SIMD helper" in crypto/simd.c.

Currently most of the x86 en/decryption code (skcipher and aead
algorithms) uses option (b), since this avoids the slow scalar fallback
and it is easier to wire up.  But option (b) is still really bad for its
own reasons:

  - Punting the request to a kworker is bad for performance too.

  - It forces the algorithm to be marked as asynchronous
    (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API
    users who request a synchronous algorithm.  That's another huge
    performance problem, which is especially unfortunate for users who
    don't even do en/decryption in softirq context.

  - It makes all en/decryption operations take a detour through
    crypto/simd.c.  That involves additional checks and an additional
    indirect call, which slow down en/decryption for *everyone*.

Fortunately, the skcipher and aead APIs are only usable in task and
softirq context in the first place, nor is it supported to call them
with hardirqs disabled.  Thus, if kernel-mode FPU were to be reliably
usable in softirq context, no fallback would be needed.  Indeed, other
architectures such as arm, arm64, and riscv have already done this.

Therefore, this patch updates x86 accordingly to reliably support
kernel-mode FPU in softirqs (except when hardirqs are disabled).

This is done by just disabling softirq processing in kernel-mode FPU
sections, as that prevents the nesting that was problematic.

This will delay some softirqs slightly, but only ones that would have
otherwise been nested inside a task context kernel-mode FPU section.
Any such softirqs would have taken the slow fallback path before if they
tried to do any en/decryption.  Now these softirqs will just run at the
end of the task context kernel-mode FPU section (since local_bh_enable()
runs pending softirqs) and will no longer take the slow fallback path.

To comply with the requirements of local_bh_disable and local_bh_enable,
this change also removes support for kernel-mode FPU in hardirq context
or with hardirqs disabled.  This should not be a problem, though.  There
does not appear to be any use case for kernel-mode FPU in such contexts,
and notably arm64 and riscv already have these same conditions.

Alternatives considered:

- Make kernel-mode FPU sections fully preemptible.  This would require
  growing task_struct by another struct fpstate which is more than 2K.

- Make softirqs save/restore the kernel-mode FPU state to a per-CPU
  struct fpstate when nested use is detected.  Somewhat interesting, but
  seems unnecessary when a simpler solution exists.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/include/asm/fpu/api.h | 17 +++++++---------
 arch/x86/kernel/fpu/core.c     | 37 +++++++++++-----------------------
 2 files changed, 19 insertions(+), 35 deletions(-)


base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319
prerequisite-patch-id: ec1feea7e6f4d03e4e4c64c492197b89c957611a

Comments

Xiao Liang Feb. 21, 2025, 7:38 a.m. UTC | #1
On Thu, Feb 20, 2025 at 1:16 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Currently kernel-mode FPU is not always usable in softirq context on
> x86, since softirqs can nest inside a kernel-mode FPU section in task
> context, and nested use of kernel-mode FPU is not supported.
>
> Therefore, x86 SIMD-optimized code that can be called in softirq context
> has to sometimes fall back to non-SIMD code.  There are two options for
> the fallback, both of which are pretty terrible:
>
>   (a) Use a scalar fallback.  This can be 10-100x slower than vectorized
>       code because it cannot use specialized instructions like AES, SHA,
>       or carryless multiplication.
>
>   (b) Execute the request asynchronously using a kworker.  In other
>       words, use the "crypto SIMD helper" in crypto/simd.c.
>
> Currently most of the x86 en/decryption code (skcipher and aead
> algorithms) uses option (b), since this avoids the slow scalar fallback
> and it is easier to wire up.  But option (b) is still really bad for its
> own reasons:
>
>   - Punting the request to a kworker is bad for performance too.
>
>   - It forces the algorithm to be marked as asynchronous
>     (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API
>     users who request a synchronous algorithm.  That's another huge
>     performance problem, which is especially unfortunate for users who
>     don't even do en/decryption in softirq context.
>
>   - It makes all en/decryption operations take a detour through
>     crypto/simd.c.  That involves additional checks and an additional
>     indirect call, which slow down en/decryption for *everyone*.

Thank you for the detailed information.

> Fortunately, the skcipher and aead APIs are only usable in task and
> softirq context in the first place, nor is it supported to call them
> with hardirqs disabled.  Thus, if kernel-mode FPU were to be reliably
> usable in softirq context, no fallback would be needed.  Indeed, other
> architectures such as arm, arm64, and riscv have already done this.
>
> Therefore, this patch updates x86 accordingly to reliably support
> kernel-mode FPU in softirqs (except when hardirqs are disabled).
>
> This is done by just disabling softirq processing in kernel-mode FPU
> sections, as that prevents the nesting that was problematic.
>
> This will delay some softirqs slightly, but only ones that would have
> otherwise been nested inside a task context kernel-mode FPU section.
> Any such softirqs would have taken the slow fallback path before if they
> tried to do any en/decryption.  Now these softirqs will just run at the
> end of the task context kernel-mode FPU section (since local_bh_enable()
> runs pending softirqs) and will no longer take the slow fallback path.

I think this will delay all softirqs, including those that don't use
FPU. Will there be a performance impact?
(I guess you've noticed the patch I submitted last year. And this is
the main reason why it was implemented in the way you mentioned as
the second alternative.)
David Laight Feb. 25, 2025, 10:21 p.m. UTC | #2
On Fri, 21 Feb 2025 19:31:24 +0000
Eric Biggers <ebiggers@kernel.org> wrote:

> Hi Xiao,
> 
> On Fri, Feb 21, 2025 at 03:38:27PM +0800, Xiao Liang wrote:
> > > Therefore, this patch updates x86 accordingly to reliably support
> > > kernel-mode FPU in softirqs (except when hardirqs are disabled).
> > >
> > > This is done by just disabling softirq processing in kernel-mode FPU
> > > sections, as that prevents the nesting that was problematic.
> > >
> > > This will delay some softirqs slightly, but only ones that would have
> > > otherwise been nested inside a task context kernel-mode FPU section.
> > > Any such softirqs would have taken the slow fallback path before if they
> > > tried to do any en/decryption.  Now these softirqs will just run at the
> > > end of the task context kernel-mode FPU section (since local_bh_enable()
> > > runs pending softirqs) and will no longer take the slow fallback path.  
> > 
> > I think this will delay all softirqs, including those that don't use
> > FPU. Will there be a performance impact?
> > (I guess you've noticed the patch I submitted last year. And this is
> > the main reason why it was implemented in the way you mentioned as
> > the second alternative.)  
> 
> Thanks for taking a look at this patch!  It's true that this patch makes all
> softirqs on the same CPU be delayed until the end of the current kernel-mode FPU
> section.  But, I'm a bit skeptical that it actually matters enough on x86 to go
> with a more complex solution that would allow nested kernel-mode FPU.
> Kernel-mode FPU sections are generally short; the usual cases are en/decrypting
> disk sectors or network packets that are 4 KiB or less.
> 
....
> I think it's also important to note that when local_bh_enable() re-enables
> softirq processing (when called from kernel_fpu_end()), it also immediatelly
> runs any pending softirqs.  Thus there would be no additional delay; the CPU
> will *immediately* run any pending softirqs.

I'd also have thought that anything time-critical shouldn't rely on softirq.
The network stack will run a lot of code in softirq context, a bit of time
with softirq disabled isn't going to make any difference to real-world latency.

I do wonder though whether the network napi code should be running in softint
context at all.
With the amount of data it is trivial to get through a single 'consumer' ethernet
interface it can easily cause the scheduler to mis-behave.
I'd guess that google (etc) use threaded napi, multiple rx queues and RFS to get
the network processing spread out and non contending with process code.

> 
> As for supporting nested kernel-mode FPU if we wanted to go that way: yes, your
> patch from last year
> https://lore.kernel.org/lkml/20240403140138.393825-1-shaw.leon@gmail.com/
> ostensibly did that.  However, I found some bugs in it; e.g., it didn't take
> into account that struct fpu is variable-length.  So it didn't turn out as
> simple as that patch made it seem.  Just extending fpregs_{lock,unlock}() to
> kernel-mode FPU is a simpler solution with fewer edge cases, and it avoids
> increasing the memory usage of the kernel.  So I thought I'd propose that first.

Since many kernel users don't want the traditional fpu, they just need to use
an instruction that requires an AVX register or two, is it possible for code
to specify a small save area for just two or four registers and then use just
those registers? (so treating then all as caller-saved).
I know that won't work with anything that affects the fpu status register,
but if you want a single wide register for a PCIe read (to generate a big TLP)
it is more than enough.

I'm sure there are horrid pitfalls, especially if IPI are still used to for
deferred save of fpu state.

	David
Eric Biggers Feb. 25, 2025, 10:59 p.m. UTC | #3
On Tue, Feb 25, 2025 at 10:21:33PM +0000, David Laight wrote:
> > As for supporting nested kernel-mode FPU if we wanted to go that way: yes, your
> > patch from last year
> > https://lore.kernel.org/lkml/20240403140138.393825-1-shaw.leon@gmail.com/
> > ostensibly did that.  However, I found some bugs in it; e.g., it didn't take
> > into account that struct fpu is variable-length.  So it didn't turn out as
> > simple as that patch made it seem.  Just extending fpregs_{lock,unlock}() to
> > kernel-mode FPU is a simpler solution with fewer edge cases, and it avoids
> > increasing the memory usage of the kernel.  So I thought I'd propose that first.
> 
> Since many kernel users don't want the traditional fpu, they just need to use
> an instruction that requires an AVX register or two, is it possible for code
> to specify a small save area for just two or four registers and then use just
> those registers? (so treating then all as caller-saved).
> I know that won't work with anything that affects the fpu status register,
> but if you want a single wide register for a PCIe read (to generate a big TLP)
> it is more than enough.
> 
> I'm sure there are horrid pitfalls, especially if IPI are still used to for
> deferred save of fpu state.

I'm afraid that's not an accurate summary of what uses the vector registers in
kernel mode.  The main use case is crypto, and most of the crypto code uses a
lot of vector registers.  Some of the older crypto code uses at most 8 vector
registers (xmm0-xmm7) for 32-bit compatibility, but newer code uses 16 or even
up to 32 YMM or ZMM registers.  The new AES-GCM code for example uses all 32
vector registers, and the new AES-XTS code uses 30.

In general, taking full advantage of the vector register set improves
performance, and the trend has very much been towards using more registers --
not fewer.  (And the registers have been getting larger too!)  AES by itself
tends to need about 8 registers to take advantage of the CPU's full AES
throughput, but there are other computations like GHASH or tweak computation
that need to be interleaved with AES, using more registers.  And various
constants and round keys can be cached in registers to improve performance.

If we had to save/restore a large number of vector registers in every crypto
function call (not amortized to one save/restore per return to userspace), that
would be a big performance problem.

Most of the crypto code certainly could be written to use fewer registers.  But
it would reduce performance, especially if we tried to squeeze it down to use a
really small number of registers like 2-4.  Plus any such efforts would
complicate efforts to port crypto code between the kernel and userspace, as
userspace does not have such constraints on the number of registers.

- Eric
Dave Hansen Feb. 26, 2025, 5:09 p.m. UTC | #4
On 2/25/25 14:59, Eric Biggers wrote:
> If we had to save/restore a large number of vector registers in every crypto
> function call (not amortized to one save/restore per return to userspace), that
> would be a big performance problem.

I just did a quick trace on my laptop. Looks like I have two main
kernel_fpu_begin() users: LUKS and networking. They both very much seem
to do a bunch of kernel_fpu_begin() operations but very few actual XSAVEs:

     26 : save_fpregs_to_fpstate <-kernel_fpu_begin_mask
    818 : kernel_fpu_begin_mask <-crc32c_pcl_intel_update
   4192 : kernel_fpu_begin_mask <-xts_encrypt_vaes_avx10_256

This is at least _one_ data point very much in favor of Eric's argument
here. It appears that that the cost of one XSAVE is amortized across a
bunch of kernel_fpu_begin()s.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index f86ad3335529d..f42de5f05e7eb 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -14,14 +14,13 @@ 
 
 #include <asm/fpu/types.h>
 
 /*
  * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It
- * disables preemption so be careful if you intend to use it for long periods
- * of time.
- * If you intend to use the FPU in irq/softirq you need to check first with
- * irq_fpu_usable() if it is possible.
+ * disables preemption and softirq processing, so be careful if you intend to
+ * use it for long periods of time.  Kernel-mode FPU cannot be used in all
+ * contexts -- see irq_fpu_usable() for details.
  */
 
 /* Kernel FPU states to initialize in kernel_fpu_begin_mask() */
 #define KFPU_387	_BITUL(0)	/* 387 state will be initialized */
 #define KFPU_MXCSR	_BITUL(1)	/* MXCSR will be initialized */
@@ -48,25 +47,23 @@  static inline void kernel_fpu_begin(void)
 	kernel_fpu_begin_mask(KFPU_387 | KFPU_MXCSR);
 #endif
 }
 
 /*
- * Use fpregs_lock() while editing CPU's FPU registers or fpu->fpstate.
- * A context switch will (and softirq might) save CPU's FPU registers to
- * fpu->fpstate.regs and set TIF_NEED_FPU_LOAD leaving CPU's FPU registers in
- * a random state.
+ * Use fpregs_lock() while editing CPU's FPU registers or fpu->fpstate, or while
+ * using the FPU in kernel mode.  A context switch will (and softirq might) save
+ * CPU's FPU registers to fpu->fpstate.regs and set TIF_NEED_FPU_LOAD leaving
+ * CPU's FPU registers in a random state.
  *
  * local_bh_disable() protects against both preemption and soft interrupts
  * on !RT kernels.
  *
  * On RT kernels local_bh_disable() is not sufficient because it only
  * serializes soft interrupt related sections via a local lock, but stays
  * preemptible. Disabling preemption is the right choice here as bottom
  * half processing is always in thread context on RT kernels so it
  * implicitly prevents bottom half processing as well.
- *
- * Disabling preemption also serializes against kernel_fpu_begin().
  */
 static inline void fpregs_lock(void)
 {
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
 		local_bh_disable();
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 1209c7aebb211..0f7268452bf20 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -55,35 +55,22 @@  DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
  * Can we use the FPU in kernel mode with the
  * whole "kernel_fpu_begin/end()" sequence?
  */
 bool irq_fpu_usable(void)
 {
-	if (WARN_ON_ONCE(in_nmi()))
-		return false;
-
-	/* In kernel FPU usage already active? */
-	if (this_cpu_read(in_kernel_fpu))
-		return false;
-
 	/*
-	 * When not in NMI or hard interrupt context, FPU can be used in:
-	 *
-	 * - Task context except from within fpregs_lock()'ed critical
-	 *   regions.
-	 *
-	 * - Soft interrupt processing context which cannot happen
-	 *   while in a fpregs_lock()'ed critical region.
+	 * kernel_fpu_begin() takes fpregs_lock(), which disables preemption and
+	 * softirq processing.  That prevents any other task or softirq from
+	 * trying to use the FPU.  Therefore, kernel-mode FPU can always be used
+	 * in task and softirq context, except when hardirqs are disabled which
+	 * is not compatible with disabling and enabling softirq processing, or
+	 * when kernel-mode FPU is explicitly nested (which should never
+	 * happen).  Disabling/enabling softirq processing is also not allowed
+	 * in hardirq context.  Thus, we get the following condition.
 	 */
-	if (!in_hardirq())
-		return true;
-
-	/*
-	 * In hard interrupt context it's safe when soft interrupts
-	 * are enabled, which means the interrupt did not hit in
-	 * a fpregs_lock()'ed critical region.
-	 */
-	return !softirq_count();
+	return !this_cpu_read(in_kernel_fpu) &&
+		!in_hardirq() && !irqs_disabled() && !in_nmi();
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 
 /*
  * Track AVX512 state use because it is known to slow the max clock
@@ -418,11 +405,11 @@  int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
 EXPORT_SYMBOL_GPL(fpu_copy_uabi_to_guest_fpstate);
 #endif /* CONFIG_KVM */
 
 void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 {
-	preempt_disable();
+	fpregs_lock();
 
 	WARN_ON_FPU(!irq_fpu_usable());
 	WARN_ON_FPU(this_cpu_read(in_kernel_fpu));
 
 	this_cpu_write(in_kernel_fpu, true);
@@ -446,11 +433,11 @@  EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
 void kernel_fpu_end(void)
 {
 	WARN_ON_FPU(!this_cpu_read(in_kernel_fpu));
 
 	this_cpu_write(in_kernel_fpu, false);
-	preempt_enable();
+	fpregs_unlock();
 }
 EXPORT_SYMBOL_GPL(kernel_fpu_end);
 
 /*
  * Sync the FPU register state to current's memory register state when the