diff mbox series

[RFC,01/20] asm-generic/mmiowb: Add generic implementation of mmiowb() tracking

Message ID 20190222185026.10973-2-will.deacon@arm.com
State New
Headers show
Series Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb()) | expand

Commit Message

Will Deacon Feb. 22, 2019, 6:50 p.m. UTC
In preparation for removing all explicit mmiowb() calls from driver
code, implement a tracking system in asm-generic based on the PowerPC
implementation. This allows architectures with a non-empty mmiowb()
definition to automatically have the barrier inserted in spin_unlock()
following a critical section containing an I/O write.

Signed-off-by: Will Deacon <will.deacon@arm.com>

---
 include/asm-generic/mmiowb.h | 60 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/Kconfig.locks         |  3 +++
 kernel/locking/spinlock.c    |  5 ++++
 3 files changed, 68 insertions(+)
 create mode 100644 include/asm-generic/mmiowb.h

-- 
2.11.0

Comments

Linus Torvalds Feb. 22, 2019, 9:49 p.m. UTC | #1
I love removing mmiowb(), but..

On Fri, Feb 22, 2019 at 10:50 AM Will Deacon <will.deacon@arm.com> wrote:
>

> +#ifndef mmiowb_set_pending

> +static inline void mmiowb_set_pending(void)

> +{

> +       __this_cpu_write(__mmiowb_state.mmiowb_pending, 1);

> +}

> +#endif

> +

> +#ifndef mmiowb_spin_lock

> +static inline void mmiowb_spin_lock(void)

> +{

> +       if (__this_cpu_inc_return(__mmiowb_state.nesting_count) == 1)

> +               __this_cpu_write(__mmiowb_state.mmiowb_pending, 0);

> +}

> +#endif


The case we want to go fast is the spin-lock and unlock case, not the
"set pending" case.

And the way you implemented this, it's exactly the wrong way around.

So I'd suggest instead doing

  static inline void mmiowb_set_pending(void)
  {
      __this_cpu_write(__mmiowb_state.mmiowb_pending,
__mmiowb_state.nesting_count);
  }

and

  static inline void mmiowb_spin_lock(void)
  {
      __this_cpu_inc(__mmiowb_state.nesting_count);
  }

which makes that spin-lock code much simpler and avoids the conditional there.

Then the unlock case could be something like

  static inline void mmiowb_spin_unlock(void)
  {
      if (unlikely(__this_cpu_read(__mmiowb_state.mmiowb_pending))) {
          __this_cpu_write(__mmiowb_state.mmiowb_pending, 0);
          mmiowb();
      }
      __this_cpu_dec(__mmiowb_state.nesting_count);
  }

or something (xchg is generally much more expensive than read, and the
common case for spinlocks is that nobody did IO inside of it).

And we don't need atomicity, since even if there's an interrupt that
does more IO while we're in the spinlock, _those_ IO's don't need to
be serialized by that spinlock.

Hmm? Entirely untested, and maybe I have a thinko somewhere, but the
above would seem to be better for the common case that really matters.
No?

             Linus
Linus Torvalds Feb. 22, 2019, 9:55 p.m. UTC | #2
On Fri, Feb 22, 2019 at 1:49 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>

> The case we want to go fast is the spin-lock and unlock case, not the

> "set pending" case.

>

> And the way you implemented this, it's exactly the wrong way around.


Oh, one more comment: couldn't we make that mmiowb flag be right next
to the preemption count?

Because that's the common case anyway, where a spinlock increments the
preemption count too. If we put the mmiowb state in the same
cacheline, we don't cause extra cache effects, which is what really
matters, I guess.

I realize this is somewhat inconvenient, because some architectures
put preempt count in the thread structure, and others do it as a
percpu variable. But maybe the architecture could just declare where
the mmiowb state is?

                 Linus
Will Deacon Feb. 26, 2019, 6:25 p.m. UTC | #3
On Fri, Feb 22, 2019 at 01:55:20PM -0800, Linus Torvalds wrote:
> On Fri, Feb 22, 2019 at 1:49 PM Linus Torvalds

> <torvalds@linux-foundation.org> wrote:

> >

> > The case we want to go fast is the spin-lock and unlock case, not the

> > "set pending" case.

> >

> > And the way you implemented this, it's exactly the wrong way around.

> 

> Oh, one more comment: couldn't we make that mmiowb flag be right next

> to the preemption count?

> 

> Because that's the common case anyway, where a spinlock increments the

> preemption count too. If we put the mmiowb state in the same

> cacheline, we don't cause extra cache effects, which is what really

> matters, I guess.

> 

> I realize this is somewhat inconvenient, because some architectures

> put preempt count in the thread structure, and others do it as a

> percpu variable. But maybe the architecture could just declare where

> the mmiowb state is?


I think that should be doable... I'll have a play.

Will
Will Deacon Feb. 26, 2019, 6:26 p.m. UTC | #4
Hi Linus,

Thanks for having a look.

On Fri, Feb 22, 2019 at 01:49:32PM -0800, Linus Torvalds wrote:
> On Fri, Feb 22, 2019 at 10:50 AM Will Deacon <will.deacon@arm.com> wrote:

> >

> > +#ifndef mmiowb_set_pending

> > +static inline void mmiowb_set_pending(void)

> > +{

> > +       __this_cpu_write(__mmiowb_state.mmiowb_pending, 1);

> > +}

> > +#endif

> > +

> > +#ifndef mmiowb_spin_lock

> > +static inline void mmiowb_spin_lock(void)

> > +{

> > +       if (__this_cpu_inc_return(__mmiowb_state.nesting_count) == 1)

> > +               __this_cpu_write(__mmiowb_state.mmiowb_pending, 0);

> > +}

> > +#endif

> 

> The case we want to go fast is the spin-lock and unlock case, not the

> "set pending" case.

> 

> And the way you implemented this, it's exactly the wrong way around.

> 

> So I'd suggest instead doing

> 

>   static inline void mmiowb_set_pending(void)

>   {

>       __this_cpu_write(__mmiowb_state.mmiowb_pending,

> __mmiowb_state.nesting_count);

>   }

> 

> and

> 

>   static inline void mmiowb_spin_lock(void)

>   {

>       __this_cpu_inc(__mmiowb_state.nesting_count);

>   }

> 

> which makes that spin-lock code much simpler and avoids the conditional there.


Makes sense; I'll hook that up for the next version.

> Then the unlock case could be something like

> 

>   static inline void mmiowb_spin_unlock(void)

>   {

>       if (unlikely(__this_cpu_read(__mmiowb_state.mmiowb_pending))) {

>           __this_cpu_write(__mmiowb_state.mmiowb_pending, 0);

>           mmiowb();

>       }

>       __this_cpu_dec(__mmiowb_state.nesting_count);

>   }

> 

> or something (xchg is generally much more expensive than read, and the

> common case for spinlocks is that nobody did IO inside of it).


So I *am* using __this_cpu_xchg() here, which means the architecture can
get away with plain old loads and stores (which is what RISC-V does, for
example), but I see that's not the case on e.g. x86 so I'll rework using
read() and write() because it doesn't hurt.

Will
Peter Zijlstra Feb. 26, 2019, 6:33 p.m. UTC | #5
On Tue, Feb 26, 2019 at 06:26:24PM +0000, Will Deacon wrote:
> On Fri, Feb 22, 2019 at 01:49:32PM -0800, Linus Torvalds wrote:

> So I *am* using __this_cpu_xchg() here, which means the architecture can

> get away with plain old loads and stores (which is what RISC-V does, for

> example), but I see that's not the case on e.g. x86 so I'll rework using

> read() and write() because it doesn't hurt.


Right, so the problem on x86 is that XCHG has an implicit LOCK prefix,
so there no !atomic variant. So even the cpu-local xchg gets the LOCK
prefix penalty, even though all we really wanted is a single
instruction.

Arguably we could fix that for __this_cpu_xchg(), which isn't IRQ-safe.
Linus Torvalds Feb. 26, 2019, 7:02 p.m. UTC | #6
On Tue, Feb 26, 2019 at 10:33 AM Peter Zijlstra <peterz@infradead.org> wrote:
>

> Arguably we could fix that for __this_cpu_xchg(), which isn't IRQ-safe.


Yeah, I guess x86 _should_ really do __this_cpu_xchg() as just a
read-write pair.

In general, a read-write pair is probably always the right thing to
do, and the only reason we can't just do it in an
architecture-independent way is that we'd want to avoid doing the
address generation twice (for architectures where that is an issue).

                  Linus
Peter Zijlstra Feb. 27, 2019, 10:19 a.m. UTC | #7
On Tue, Feb 26, 2019 at 11:02:50AM -0800, Linus Torvalds wrote:
> On Tue, Feb 26, 2019 at 10:33 AM Peter Zijlstra <peterz@infradead.org> wrote:

> >

> > Arguably we could fix that for __this_cpu_xchg(), which isn't IRQ-safe.

> 

> Yeah, I guess x86 _should_ really do __this_cpu_xchg() as just a

> read-write pair.


See the patches I just send.

> In general, a read-write pair is probably always the right thing to

> do, and the only reason we can't just do it in an

> architecture-independent way is that we'd want to avoid doing the

> address generation twice (for architectures where that is an issue).


The generic code has this right, see
include/asm-generic/percpu.h:raw_cpu_generic_xchg().

That is used in the majority of the architectures. With the patch I just
send, x86 will use two gs prefixed movs.
diff mbox series

Patch

diff --git a/include/asm-generic/mmiowb.h b/include/asm-generic/mmiowb.h
new file mode 100644
index 000000000000..1cec8907806f
--- /dev/null
+++ b/include/asm-generic/mmiowb.h
@@ -0,0 +1,60 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_GENERIC_MMIOWB_H
+#define __ASM_GENERIC_MMIOWB_H
+
+/*
+ * Generic implementation of mmiowb() tracking for spinlocks.
+ *
+ * If your architecture doesn't ensure that writes to an I/O peripheral
+ * within two spinlocked sections on two different CPUs are seen by the
+ * peripheral in the order corresponding to the lock handover, then you
+ * need to follow these FIVE easy steps:
+ *
+ * 	1. Implement mmiowb() in asm/mmiowb.h and then #include this file
+ *	2. Ensure your I/O write accessors call mmiowb_set_pending()
+ *	3. Select ARCH_HAS_MMIOWB
+ *	4. Untangle the resulting mess of header files
+ *	5. Complain to your architects
+ */
+#if defined(CONFIG_ARCH_HAS_MMIOWB) && defined(CONFIG_SMP)
+
+#include <linux/types.h>
+#include <asm/percpu.h>
+#include <asm/smp.h>
+
+struct mmiowb_state {
+	u16	nesting_count;
+	u16	mmiowb_pending;
+};
+DECLARE_PER_CPU(struct mmiowb_state, __mmiowb_state);
+
+#ifndef mmiowb_set_pending
+static inline void mmiowb_set_pending(void)
+{
+	__this_cpu_write(__mmiowb_state.mmiowb_pending, 1);
+}
+#endif
+
+#ifndef mmiowb_spin_lock
+static inline void mmiowb_spin_lock(void)
+{
+	if (__this_cpu_inc_return(__mmiowb_state.nesting_count) == 1)
+		__this_cpu_write(__mmiowb_state.mmiowb_pending, 0);
+}
+#endif
+
+#ifndef mmiowb_spin_unlock
+static inline void mmiowb_spin_unlock(void)
+{
+	if (__this_cpu_xchg(__mmiowb_state.mmiowb_pending, 0))
+		mmiowb();
+	__this_cpu_dec_return(__mmiowb_state.nesting_count);
+}
+#endif
+
+#else
+#define mmiowb_set_pending()		do { } while (0)
+#define mmiowb_spin_lock()		do { } while (0)
+#define mmiowb_spin_unlock()		do { } while (0)
+#endif	/* CONFIG_ARCH_HAS_MMIOWB && CONFIG_SMP */
+#endif	/* __ASM_GENERIC_MMIOWB_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 84d882f3e299..04976ae41176 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -248,3 +248,6 @@  config ARCH_USE_QUEUED_RWLOCKS
 config QUEUED_RWLOCKS
 	def_bool y if ARCH_USE_QUEUED_RWLOCKS
 	depends on SMP
+
+config ARCH_HAS_MMIOWB
+	bool
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index 936f3d14dd6b..cbae365d7dd1 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -22,6 +22,11 @@ 
 #include <linux/debug_locks.h>
 #include <linux/export.h>
 
+#ifdef CONFIG_ARCH_HAS_MMIOWB
+DEFINE_PER_CPU(struct mmiowb_state, __mmiowb_state);
+EXPORT_PER_CPU_SYMBOL(__mmiowb_state);
+#endif
+
 /*
  * If lockdep is enabled then we use the non-preemption spin-ops
  * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are