Message ID | 20250516231858.27899-4-ebiggers@kernel.org |
---|---|
State | New |
Headers | show |
Series | x86: Don't support kernel-mode FPU with hardirqs disabled | expand |
* Eric Biggers <ebiggers@kernel.org> wrote: > From: Eric Biggers <ebiggers@google.com> > > Make irq_fpu_usable() return false when irqs_disabled(). That makes the > irqs_disabled() checks in kernel_fpu_begin_mask() and kernel_fpu_end() > unnecessary, so also remove those. > > Rationale: > > - There's no known use case for kernel-mode FPU when irqs_disabled(). Except EFI? > arm64 and riscv already disallow kernel-mode FPU when irqs_disabled(). > __save_processor_state() previously did expect kernel_fpu_begin() and > kernel_fpu_end() to work when irqs_disabled(), but this was a > different use case and not actual kernel-mode FPU use. > > - This is more efficient, since one call to irqs_disabled() replaces two > irqs_disabled() and one in_hardirq(). This is noise compared to the overhead of saving/restoring vector CPU context ... > - This fixes irq_fpu_usable() to correctly return false during CPU > initialization. Incorrectly returning true caused the SHA-256 library > code, which is called when loading AMD microcode, to take a > SIMD-optimized code path too early, causing a crash. By correctly > returning false from irq_fpu_usable(), the generic SHA-256 code > correctly gets used instead. (Note: SIMD-optimized SHA-256 doesn't > get enabled until subsys_initcall, but CPU hotplug can happen later.) Alternatively we could set in_kernel_fpu during CPU bootstrap, and clear it once we know the FPU is usable? This is only a relatively short early boot period, with no scheduling, right? Thanks, Ingo
On Sat, May 17, 2025 at 09:09:01AM +0200, Ingo Molnar wrote: > > * Eric Biggers <ebiggers@kernel.org> wrote: > > > From: Eric Biggers <ebiggers@google.com> > > > > Make irq_fpu_usable() return false when irqs_disabled(). That makes the > > irqs_disabled() checks in kernel_fpu_begin_mask() and kernel_fpu_end() > > unnecessary, so also remove those. > > > > Rationale: > > > > - There's no known use case for kernel-mode FPU when irqs_disabled(). > > Except EFI? Yes, I remembered that just after sending this... And EFI does want the ldmxcsr and fninit, which makes it like actual kernel-mode FPU. That implies we at least need to disable BH (and preemption) if it wasn't already disabled. But if hardirqs may or may not be disabled already, that means we either need to conditionally use local_bh_disable()/enable (or preempt_enable()/disable on PREEMPT_RT) as the current code does, or use local_irq_save()/restore. If we did the latter, then all EFI calls would run with hardirqs disabled. It looks like hardirqs are currently intentionally disabled before some of the EFI calls, but not all of them. I'm not sure what the logic is there, and whether it would be okay to just always disable them. > > > arm64 and riscv already disallow kernel-mode FPU when irqs_disabled(). > > __save_processor_state() previously did expect kernel_fpu_begin() and > > kernel_fpu_end() to work when irqs_disabled(), but this was a > > different use case and not actual kernel-mode FPU use. > > > > - This is more efficient, since one call to irqs_disabled() replaces two > > irqs_disabled() and one in_hardirq(). > > This is noise compared to the overhead of saving/restoring vector CPU > context ... In practice most calls to kernel_fpu_begin() don't actually do the save_fpregs_to_fpstate(), since either TIF_NEED_FPU_LOAD is already set or it's a kthread. So, the overhead from the other parts like the EFLAGS checks and ldmxcsr are measurable, especially when processing small amounts of data. > > - This fixes irq_fpu_usable() to correctly return false during CPU > > initialization. Incorrectly returning true caused the SHA-256 library > > code, which is called when loading AMD microcode, to take a > > SIMD-optimized code path too early, causing a crash. By correctly > > returning false from irq_fpu_usable(), the generic SHA-256 code > > correctly gets used instead. (Note: SIMD-optimized SHA-256 doesn't > > get enabled until subsys_initcall, but CPU hotplug can happen later.) > > Alternatively we could set in_kernel_fpu during CPU bootstrap, and > clear it once we know the FPU is usable? This is only a relatively > short early boot period, with no scheduling, right? Yes, if there isn't agreement on this approach we can do that instead. Say: - Replace in_kernel_fpu with kernel_fpu_supported, with the opposite meaning (so that the initial value of false means "unsupported") - fpu__init_cpu() sets it to true - cpu_disable_common() sets it to false
* Eric Biggers <ebiggers@kernel.org> wrote: > > Alternatively we could set in_kernel_fpu during CPU bootstrap, and > > clear it once we know the FPU is usable? This is only a relatively > > short early boot period, with no scheduling, right? > > Yes, if there isn't agreement on this approach we can do that > instead. Say: > > - Replace in_kernel_fpu with kernel_fpu_supported, with the opposite > meaning (so that the initial value of false means "unsupported") I'm not against simplifying the x86 FPU model to exclude IRQs-off context (especially if it also micro-optimizes some of the key runtime kernel-FPU primitives), but it has to be a full solution and we'll have to see how complicated the EFI changes get. Ie. without seeing the full cost-benefit balance it's hard to call this in advance. Mind sending a full series that addresses the EFI case too? Thanks, Ingo
On Sun, May 18, 2025 at 03:18:58PM +0200, Ard Biesheuvel wrote: > On Sun, 18 May 2025 at 08:34, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > * Eric Biggers <ebiggers@kernel.org> wrote: > > > > > > Alternatively we could set in_kernel_fpu during CPU bootstrap, and > > > > clear it once we know the FPU is usable? This is only a relatively > > > > short early boot period, with no scheduling, right? > > > > > > Yes, if there isn't agreement on this approach we can do that > > > instead. Say: > > > > > > - Replace in_kernel_fpu with kernel_fpu_supported, with the opposite > > > meaning (so that the initial value of false means "unsupported") > > > > I'm not against simplifying the x86 FPU model to exclude IRQs-off > > context (especially if it also micro-optimizes some of the key runtime > > kernel-FPU primitives), but it has to be a full solution and we'll have > > to see how complicated the EFI changes get. > > > > Ie. without seeing the full cost-benefit balance it's hard to call this > > in advance. Mind sending a full series that addresses the EFI case too? > > > > EFI services are only called with IRQs disabled in exceptional cases, > so it would be unfortunate if it prevents us from make meaningful > improvements here. In ordinary cases, they are called from a > workqueue, and I'd prefer it if we can address this without calling > all EFI services with interrupts disabled either. > > AIUI, the reason we cannot tolerate IRQs being disabled is because > re-enabling softirqs will complain if IRQs are disabled, due to the > fact that handling softirqs should not be attempted at that point? > > I don't know the history here, but I wonder if that isn't overly > pedantic? Disabling softirqs could be avoided entirely when IRQs are > off, given that they are disabled implicitly already. But why then is > it not permitted to disable and re-enable softirqs under this > condition, given that it makes no difference? Or perhaps I'm missing > something here. > > A good way to trigger such an exceptional case is running a kernel > with efi-pstore and lkdtm built-in under QEMU with OVMF, and do > > # echo PANIC > /sys/kernel/debug/provoke-crash/DIRECT > > Another case that likely executes with IRQs disabled (but I haven't > double checked) is reset_system(), which may return with an error, or > reboot/poweroff the machine and never return. That makes sense to me. preempt_disable() and preempt_enable() are already allowed when IRQs are disabled, and I'm not sure why local_bh_disable() and local_bh_enable() are different. local_bh_enable() already uses local_irq_save(flags) instead of local_irq_disable(), so it seems it's sort of intended to work when IRQs are disabled, despite the lockdep_assert_irqs_enabled(). Anyway, that would point to continuing to support kernel-mode FPU when IRQs are disabled. But also EFI needs it anyway, unless we refactor it to use kernel_fpu_begin() and kernel_fpu_end() only when irq_fpu_usable() and otherwise use different code, analogous what arm64 does. So for now I've sent https://lore.kernel.org/lkml/20250518193212.1822-1-ebiggers@kernel.org which implements the other possible fix, where we just start keeping track of whether the FPU has been initialized or not. - Eric
* Eric Biggers <ebiggers@kernel.org> wrote: > > # echo PANIC > /sys/kernel/debug/provoke-crash/DIRECT > > > > Another case that likely executes with IRQs disabled (but I haven't > > double checked) is reset_system(), which may return with an error, or > > reboot/poweroff the machine and never return. > > That makes sense to me. preempt_disable() and preempt_enable() are already > allowed when IRQs are disabled, and I'm not sure why local_bh_disable() and > local_bh_enable() are different. Because local_bh_enable() may run softirq handlers immediately if there's pending softirqs, which shouldn't be done in hardirq context. This is a key optimization of the Linux networking code, which uses BH-off/BH-on sections instead of IRQS-off/IRQS-on critical sections, for performance reasons. Thanks, Ingo
On Mon, 19 May 2025 at 10:06, Ingo Molnar <mingo@kernel.org> wrote: > > > * Eric Biggers <ebiggers@kernel.org> wrote: > > > > # echo PANIC > /sys/kernel/debug/provoke-crash/DIRECT > > > > > > Another case that likely executes with IRQs disabled (but I haven't > > > double checked) is reset_system(), which may return with an error, or > > > reboot/poweroff the machine and never return. > > > > That makes sense to me. preempt_disable() and preempt_enable() are already > > allowed when IRQs are disabled, and I'm not sure why local_bh_disable() and > > local_bh_enable() are different. > > Because local_bh_enable() may run softirq handlers immediately if > there's pending softirqs, which shouldn't be done in hardirq context. > Sure, but why is that mandatory? preempt_disable() has preempt_enable() and preempt_enable_no_resched() counterparts. Could we have a local_bh_enable_no_xxx() version that re-enables async softirq processing on the current CPU but does not kick off a synchronous processing run?
* Ard Biesheuvel <ardb@kernel.org> wrote: > On Mon, 19 May 2025 at 14:57, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > * Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > On Mon, 19 May 2025 at 10:06, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > > > > > > > * Eric Biggers <ebiggers@kernel.org> wrote: > > > > > > > > > > # echo PANIC > /sys/kernel/debug/provoke-crash/DIRECT > > > > > > > > > > > > Another case that likely executes with IRQs disabled (but I > > > > > > haven't double checked) is reset_system(), which may return > > > > > > with an error, or reboot/poweroff the machine and never > > > > > > return. > > > > > > > > > > That makes sense to me. preempt_disable() and > > > > > preempt_enable() are already allowed when IRQs are disabled, > > > > > and I'm not sure why local_bh_disable() and local_bh_enable() > > > > > are different. > > > > > > > > Because local_bh_enable() may run softirq handlers immediately > > > > if there's pending softirqs, which shouldn't be done in hardirq > > > > context. > > > > > > Sure, but why is that mandatory? > > > > > > > > > preempt_disable() has preempt_enable() and > > > preempt_enable_no_resched() counterparts. > > > > > [...] Could we have a local_bh_enable_no_xxx() version that > > > re-enables async softirq processing on the current CPU but does not > > > kick off a synchronous processing run? > > > > Yes, that's what __local_bh_enable() does, but if used it for > > kernel_fpu_end() we'd be introducing random softirq processing > > latencies. The softirq execution model is for softirqs to be > > immediately executed after local_bh_enable(), and various networking > > code is tuned to that behavior. > > > > All of that only applies when re-enabling softirqs with IRQs enabled. Yes, of course. BHs in the networking code are typically disabled/enabled when IRQs are on. It's the whole *point* of the facility: it was written as a 'lightweight' IRQs-on/off facility originally, back in the days when local_irq_save() was very expensive, especially on non-x86 platforms. > I don't see why we'd need all of that. > > Conceptually, kernel_fpu_end() would do > > if (irqs_disabled()) > local_bh_enable_no_xxx(); > else > local_bh_enable(); In normal kernel code local_bh_enable() obviously cannot be done with IRQs off, and local_bh_disable()/__local_bh_enable() is highly frowned upon because it's generally pointless: if irqs are off to begin with, why disable any BHs? What you probably mean is to only disable BHs if IRQs are not off (because hardirqs-off state already disables BHs): kernel_fpu_begin() if (!irqs_disabled) local_bh_disable(); kernel_fpu_end() if (!irqs_disabled()) local_bh_enable(); ... which is basically what the current code does: if (!irqs_disabled()) fpregs_lock(); ... if (!irqs_disabled()) fpregs_unlock(); BTW., maybe we should add a lockdep check to make sure we never enable hardirqs while kernel FPU handling is ongoing. It should be relatively straightforward and cheap. But that brings is far away from the original question: > > > > > preempt_disable() and preempt_enable() are already allowed > > > > > when IRQs are disabled, and I'm not sure why > > > > > local_bh_disable() and local_bh_enable() are different. > > > > > > > > Because local_bh_enable() may run softirq handlers immediately > > > > if there's pending softirqs, which shouldn't be done in hardirq > > > > context. To rephrase my answer: local_bh_disable()/enable() are part of the softirq execution mechanism whose primary historical purpose was to be a lighter-weight replacement hardirq disable/enable critical sections, combined with a relaxation of how long a softirq could run versus a hardirq. It makes little sense to try to nest BH handling primitives to within hardirq critical sections. Thanks, Ingo
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 476393b1d5e8f..9b3c5e17f86cd 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -66,42 +66,31 @@ struct fpu *x86_task_fpu(struct task_struct *task) * Can we use the FPU in kernel mode with the * whole "kernel_fpu_begin/end()" sequence? */ bool irq_fpu_usable(void) { - if (WARN_ON_ONCE(in_nmi())) - return false; - /* - * In kernel FPU usage already active? This detects any explicitly - * nested usage in task or softirq context, which is unsupported. It - * also detects attempted usage in a hardirq that has interrupted a - * kernel-mode FPU section. + * To ensure that (non-explicitly-nested) kernel-mode FPU is always + * usable in task and softirq contexts, kernel_fpu_begin() disables + * preemption and softirqs, and kernel_fpu_end() re-enables them. That + * is not compatible with hardirqs being disabled (including hardirq + * context), or with NMI context. Support for kernel-mode FPU is not + * needed in those contexts anyway. Return false in those contexts. + * + * Returning false when irqs_disabled() also eliminates the need to + * explicitly check whether the FPU has been initialized yet during CPU + * initialization. Before then, hardirqs are still disabled. */ + if (irqs_disabled() || WARN_ON_ONCE(in_nmi())) + return false; + + /* Catch any explicitly nested usage, which should never happen. */ if (this_cpu_read(in_kernel_fpu)) { - WARN_ON_FPU(!in_hardirq()); + WARN_ON_FPU(1); return false; } - - /* - * When not in NMI or hard interrupt context, FPU can be used in: - * - * - Task context except from within fpregs_lock()'ed critical - * regions. - * - * - Soft interrupt processing context which cannot happen - * while in a fpregs_lock()'ed critical region. - */ - if (!in_hardirq()) - return true; - - /* - * In hard interrupt context it's safe when soft interrupts - * are enabled, which means the interrupt did not hit in - * a fpregs_lock()'ed critical region. - */ - return !softirq_count(); + return true; } EXPORT_SYMBOL(irq_fpu_usable); /* * Track AVX512 state use because it is known to slow the max clock @@ -443,12 +432,11 @@ static __always_inline void __fpu_save_state(void) __cpu_invalidate_fpregs_state(); } void kernel_fpu_begin_mask(unsigned int kfpu_mask) { - if (!irqs_disabled()) - fpregs_lock(); + fpregs_lock(); WARN_ON_FPU(!irq_fpu_usable()); WARN_ON_FPU(this_cpu_read(in_kernel_fpu)); this_cpu_write(in_kernel_fpu, true); @@ -467,12 +455,11 @@ EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask); void kernel_fpu_end(void) { WARN_ON_FPU(!this_cpu_read(in_kernel_fpu)); this_cpu_write(in_kernel_fpu, false); - if (!irqs_disabled()) - fpregs_unlock(); + fpregs_unlock(); } EXPORT_SYMBOL_GPL(kernel_fpu_end); #ifdef CONFIG_PM_SLEEP /*