Message ID | 20190222185026.10973-2-will.deacon@arm.com |
---|---|
State | New |
Headers | show |
Series | Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb()) | expand |
I love removing mmiowb(), but.. On Fri, Feb 22, 2019 at 10:50 AM Will Deacon <will.deacon@arm.com> wrote: > > +#ifndef mmiowb_set_pending > +static inline void mmiowb_set_pending(void) > +{ > + __this_cpu_write(__mmiowb_state.mmiowb_pending, 1); > +} > +#endif > + > +#ifndef mmiowb_spin_lock > +static inline void mmiowb_spin_lock(void) > +{ > + if (__this_cpu_inc_return(__mmiowb_state.nesting_count) == 1) > + __this_cpu_write(__mmiowb_state.mmiowb_pending, 0); > +} > +#endif The case we want to go fast is the spin-lock and unlock case, not the "set pending" case. And the way you implemented this, it's exactly the wrong way around. So I'd suggest instead doing static inline void mmiowb_set_pending(void) { __this_cpu_write(__mmiowb_state.mmiowb_pending, __mmiowb_state.nesting_count); } and static inline void mmiowb_spin_lock(void) { __this_cpu_inc(__mmiowb_state.nesting_count); } which makes that spin-lock code much simpler and avoids the conditional there. Then the unlock case could be something like static inline void mmiowb_spin_unlock(void) { if (unlikely(__this_cpu_read(__mmiowb_state.mmiowb_pending))) { __this_cpu_write(__mmiowb_state.mmiowb_pending, 0); mmiowb(); } __this_cpu_dec(__mmiowb_state.nesting_count); } or something (xchg is generally much more expensive than read, and the common case for spinlocks is that nobody did IO inside of it). And we don't need atomicity, since even if there's an interrupt that does more IO while we're in the spinlock, _those_ IO's don't need to be serialized by that spinlock. Hmm? Entirely untested, and maybe I have a thinko somewhere, but the above would seem to be better for the common case that really matters. No? Linus
On Fri, Feb 22, 2019 at 1:49 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The case we want to go fast is the spin-lock and unlock case, not the > "set pending" case. > > And the way you implemented this, it's exactly the wrong way around. Oh, one more comment: couldn't we make that mmiowb flag be right next to the preemption count? Because that's the common case anyway, where a spinlock increments the preemption count too. If we put the mmiowb state in the same cacheline, we don't cause extra cache effects, which is what really matters, I guess. I realize this is somewhat inconvenient, because some architectures put preempt count in the thread structure, and others do it as a percpu variable. But maybe the architecture could just declare where the mmiowb state is? Linus
On Fri, Feb 22, 2019 at 01:55:20PM -0800, Linus Torvalds wrote: > On Fri, Feb 22, 2019 at 1:49 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > The case we want to go fast is the spin-lock and unlock case, not the > > "set pending" case. > > > > And the way you implemented this, it's exactly the wrong way around. > > Oh, one more comment: couldn't we make that mmiowb flag be right next > to the preemption count? > > Because that's the common case anyway, where a spinlock increments the > preemption count too. If we put the mmiowb state in the same > cacheline, we don't cause extra cache effects, which is what really > matters, I guess. > > I realize this is somewhat inconvenient, because some architectures > put preempt count in the thread structure, and others do it as a > percpu variable. But maybe the architecture could just declare where > the mmiowb state is? I think that should be doable... I'll have a play. Will
Hi Linus, Thanks for having a look. On Fri, Feb 22, 2019 at 01:49:32PM -0800, Linus Torvalds wrote: > On Fri, Feb 22, 2019 at 10:50 AM Will Deacon <will.deacon@arm.com> wrote: > > > > +#ifndef mmiowb_set_pending > > +static inline void mmiowb_set_pending(void) > > +{ > > + __this_cpu_write(__mmiowb_state.mmiowb_pending, 1); > > +} > > +#endif > > + > > +#ifndef mmiowb_spin_lock > > +static inline void mmiowb_spin_lock(void) > > +{ > > + if (__this_cpu_inc_return(__mmiowb_state.nesting_count) == 1) > > + __this_cpu_write(__mmiowb_state.mmiowb_pending, 0); > > +} > > +#endif > > The case we want to go fast is the spin-lock and unlock case, not the > "set pending" case. > > And the way you implemented this, it's exactly the wrong way around. > > So I'd suggest instead doing > > static inline void mmiowb_set_pending(void) > { > __this_cpu_write(__mmiowb_state.mmiowb_pending, > __mmiowb_state.nesting_count); > } > > and > > static inline void mmiowb_spin_lock(void) > { > __this_cpu_inc(__mmiowb_state.nesting_count); > } > > which makes that spin-lock code much simpler and avoids the conditional there. Makes sense; I'll hook that up for the next version. > Then the unlock case could be something like > > static inline void mmiowb_spin_unlock(void) > { > if (unlikely(__this_cpu_read(__mmiowb_state.mmiowb_pending))) { > __this_cpu_write(__mmiowb_state.mmiowb_pending, 0); > mmiowb(); > } > __this_cpu_dec(__mmiowb_state.nesting_count); > } > > or something (xchg is generally much more expensive than read, and the > common case for spinlocks is that nobody did IO inside of it). So I *am* using __this_cpu_xchg() here, which means the architecture can get away with plain old loads and stores (which is what RISC-V does, for example), but I see that's not the case on e.g. x86 so I'll rework using read() and write() because it doesn't hurt. Will
On Tue, Feb 26, 2019 at 06:26:24PM +0000, Will Deacon wrote: > On Fri, Feb 22, 2019 at 01:49:32PM -0800, Linus Torvalds wrote: > So I *am* using __this_cpu_xchg() here, which means the architecture can > get away with plain old loads and stores (which is what RISC-V does, for > example), but I see that's not the case on e.g. x86 so I'll rework using > read() and write() because it doesn't hurt. Right, so the problem on x86 is that XCHG has an implicit LOCK prefix, so there no !atomic variant. So even the cpu-local xchg gets the LOCK prefix penalty, even though all we really wanted is a single instruction. Arguably we could fix that for __this_cpu_xchg(), which isn't IRQ-safe.
On Tue, Feb 26, 2019 at 10:33 AM Peter Zijlstra <peterz@infradead.org> wrote: > > Arguably we could fix that for __this_cpu_xchg(), which isn't IRQ-safe. Yeah, I guess x86 _should_ really do __this_cpu_xchg() as just a read-write pair. In general, a read-write pair is probably always the right thing to do, and the only reason we can't just do it in an architecture-independent way is that we'd want to avoid doing the address generation twice (for architectures where that is an issue). Linus
On Tue, Feb 26, 2019 at 11:02:50AM -0800, Linus Torvalds wrote: > On Tue, Feb 26, 2019 at 10:33 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Arguably we could fix that for __this_cpu_xchg(), which isn't IRQ-safe. > > Yeah, I guess x86 _should_ really do __this_cpu_xchg() as just a > read-write pair. See the patches I just send. > In general, a read-write pair is probably always the right thing to > do, and the only reason we can't just do it in an > architecture-independent way is that we'd want to avoid doing the > address generation twice (for architectures where that is an issue). The generic code has this right, see include/asm-generic/percpu.h:raw_cpu_generic_xchg(). That is used in the majority of the architectures. With the patch I just send, x86 will use two gs prefixed movs.
diff --git a/include/asm-generic/mmiowb.h b/include/asm-generic/mmiowb.h new file mode 100644 index 000000000000..1cec8907806f --- /dev/null +++ b/include/asm-generic/mmiowb.h @@ -0,0 +1,60 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_GENERIC_MMIOWB_H +#define __ASM_GENERIC_MMIOWB_H + +/* + * Generic implementation of mmiowb() tracking for spinlocks. + * + * If your architecture doesn't ensure that writes to an I/O peripheral + * within two spinlocked sections on two different CPUs are seen by the + * peripheral in the order corresponding to the lock handover, then you + * need to follow these FIVE easy steps: + * + * 1. Implement mmiowb() in asm/mmiowb.h and then #include this file + * 2. Ensure your I/O write accessors call mmiowb_set_pending() + * 3. Select ARCH_HAS_MMIOWB + * 4. Untangle the resulting mess of header files + * 5. Complain to your architects + */ +#if defined(CONFIG_ARCH_HAS_MMIOWB) && defined(CONFIG_SMP) + +#include <linux/types.h> +#include <asm/percpu.h> +#include <asm/smp.h> + +struct mmiowb_state { + u16 nesting_count; + u16 mmiowb_pending; +}; +DECLARE_PER_CPU(struct mmiowb_state, __mmiowb_state); + +#ifndef mmiowb_set_pending +static inline void mmiowb_set_pending(void) +{ + __this_cpu_write(__mmiowb_state.mmiowb_pending, 1); +} +#endif + +#ifndef mmiowb_spin_lock +static inline void mmiowb_spin_lock(void) +{ + if (__this_cpu_inc_return(__mmiowb_state.nesting_count) == 1) + __this_cpu_write(__mmiowb_state.mmiowb_pending, 0); +} +#endif + +#ifndef mmiowb_spin_unlock +static inline void mmiowb_spin_unlock(void) +{ + if (__this_cpu_xchg(__mmiowb_state.mmiowb_pending, 0)) + mmiowb(); + __this_cpu_dec_return(__mmiowb_state.nesting_count); +} +#endif + +#else +#define mmiowb_set_pending() do { } while (0) +#define mmiowb_spin_lock() do { } while (0) +#define mmiowb_spin_unlock() do { } while (0) +#endif /* CONFIG_ARCH_HAS_MMIOWB && CONFIG_SMP */ +#endif /* __ASM_GENERIC_MMIOWB_H */ diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 84d882f3e299..04976ae41176 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -248,3 +248,6 @@ config ARCH_USE_QUEUED_RWLOCKS config QUEUED_RWLOCKS def_bool y if ARCH_USE_QUEUED_RWLOCKS depends on SMP + +config ARCH_HAS_MMIOWB + bool diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c index 936f3d14dd6b..cbae365d7dd1 100644 --- a/kernel/locking/spinlock.c +++ b/kernel/locking/spinlock.c @@ -22,6 +22,11 @@ #include <linux/debug_locks.h> #include <linux/export.h> +#ifdef CONFIG_ARCH_HAS_MMIOWB +DEFINE_PER_CPU(struct mmiowb_state, __mmiowb_state); +EXPORT_PER_CPU_SYMBOL(__mmiowb_state); +#endif + /* * If lockdep is enabled then we use the non-preemption spin-ops * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
In preparation for removing all explicit mmiowb() calls from driver code, implement a tracking system in asm-generic based on the PowerPC implementation. This allows architectures with a non-empty mmiowb() definition to automatically have the barrier inserted in spin_unlock() following a critical section containing an I/O write. Signed-off-by: Will Deacon <will.deacon@arm.com> --- include/asm-generic/mmiowb.h | 60 ++++++++++++++++++++++++++++++++++++++++++++ kernel/Kconfig.locks | 3 +++ kernel/locking/spinlock.c | 5 ++++ 3 files changed, 68 insertions(+) create mode 100644 include/asm-generic/mmiowb.h -- 2.11.0