diff mbox series

[01/20] asm-generic/mmiowb: Add generic implementation of mmiowb() tracking

Message ID 20190301140348.25175-2-will.deacon@arm.com
State Superseded
Headers show
Series Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb()) | expand

Commit Message

Will Deacon March 1, 2019, 2:03 p.m. UTC
In preparation for removing all explicit mmiowb() calls from driver
code, implement a tracking system in asm-generic based loosely on the
PowerPC implementation. This allows architectures with a non-empty
mmiowb() definition to have the barrier automatically inserted in
spin_unlock() following a critical section containing an I/O write.

Signed-off-by: Will Deacon <will.deacon@arm.com>

---
 include/asm-generic/mmiowb.h       | 63 ++++++++++++++++++++++++++++++++++++++
 include/asm-generic/mmiowb_types.h | 12 ++++++++
 kernel/Kconfig.locks               |  7 +++++
 kernel/locking/spinlock.c          |  7 +++++
 4 files changed, 89 insertions(+)
 create mode 100644 include/asm-generic/mmiowb.h
 create mode 100644 include/asm-generic/mmiowb_types.h

-- 
2.11.0

Comments

Nicholas Piggin March 3, 2019, 1:43 a.m. UTC | #1
Will Deacon's on March 2, 2019 12:03 am:
> In preparation for removing all explicit mmiowb() calls from driver

> code, implement a tracking system in asm-generic based loosely on the

> PowerPC implementation. This allows architectures with a non-empty

> mmiowb() definition to have the barrier automatically inserted in

> spin_unlock() following a critical section containing an I/O write.


Is there a reason to call this "mmiowb"? We already have wmb that
orders cacheable stores vs mmio stores don't we?

Yes ia64 "sn2" is broken in that case, but that can be fixed (if
anyone really cares about the platform any more). Maybe that's
orthogonal to what you're doing here, I just don't like seeing
"mmiowb" spread.

This series works for spin locks, but you would want a driver to
be able to use wmb() to order locks vs mmio when using a bit lock
or a mutex or whatever else. Calling your wmb-if-io-is-pending
version io_mb_before_unlock() would kind of match with existing
patterns.

> +static inline void mmiowb_set_pending(void)

> +{

> +	struct mmiowb_state *ms = __mmiowb_state();

> +	ms->mmiowb_pending = ms->nesting_count;

> +}

> +

> +static inline void mmiowb_spin_lock(void)

> +{

> +	struct mmiowb_state *ms = __mmiowb_state();

> +	ms->nesting_count++;

> +}

> +

> +static inline void mmiowb_spin_unlock(void)

> +{

> +	struct mmiowb_state *ms = __mmiowb_state();

> +

> +	if (unlikely(ms->mmiowb_pending)) {

> +		ms->mmiowb_pending = 0;

> +		mmiowb();

> +	}

> +

> +	ms->nesting_count--;

> +}


Humour me for a minute and tell me what this algorithm is doing, or
what was broken about the powerpc one, which is basically:

static inline void mmiowb_set_pending(void)
{
	struct mmiowb_state *ms = __mmiowb_state();
	ms->mmiowb_pending = 1;
}

static inline void mmiowb_spin_lock(void)
{
}

static inline void mmiowb_spin_unlock(void)
{
	struct mmiowb_state *ms = __mmiowb_state();

	if (unlikely(ms->mmiowb_pending)) {
		ms->mmiowb_pending = 0;
		mmiowb();
	}
}

> diff --git a/include/asm-generic/mmiowb_types.h b/include/asm-generic/mmiowb_types.h

> new file mode 100644

> index 000000000000..8eb0095655e7

> --- /dev/null

> +++ b/include/asm-generic/mmiowb_types.h

> @@ -0,0 +1,12 @@

> +/* SPDX-License-Identifier: GPL-2.0 */

> +#ifndef __ASM_GENERIC_MMIOWB_TYPES_H

> +#define __ASM_GENERIC_MMIOWB_TYPES_H

> +

> +#include <linux/types.h>

> +

> +struct mmiowb_state {

> +	u16	nesting_count;

> +	u16	mmiowb_pending;

> +};


Really need more than 255 nested spin locks? I had the idea that 16
bit operations were a bit more costly than 8 bit on some CPUs... may
not be true, but at least the smaller size packs a bit better on
powerpc.

Thanks,
Nick
Linus Torvalds March 3, 2019, 2:18 a.m. UTC | #2
On Sat, Mar 2, 2019 at 5:43 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>

> Is there a reason to call this "mmiowb"? We already have wmb that

> orders cacheable stores vs mmio stores don't we?


Sadly no it doesn't. Not on ia64, and people tried to make that the
new rule because of the platform breakage on what some people thought
would be a major platform.

Plain wmb() was only guaranteed to order regular memory against each
other (mostly useful for dma) on some of these platforms, because they
had such broken IO synchronization.

So mmiowb() is not a new name. It's been around for a while, and the
people who wanted it have happily become irrelevant. Will is making it
go away, but the name remains for historical reasons, even if Will's
new acronym explanation for the name is much better ;)

                Linus
Nicholas Piggin March 3, 2019, 3:34 a.m. UTC | #3
Linus Torvalds's on March 3, 2019 12:18 pm:
> On Sat, Mar 2, 2019 at 5:43 PM Nicholas Piggin <npiggin@gmail.com> wrote:

>>

>> Is there a reason to call this "mmiowb"? We already have wmb that

>> orders cacheable stores vs mmio stores don't we?

> 

> Sadly no it doesn't. Not on ia64, and people tried to make that the

> new rule because of the platform breakage on what some people thought

> would be a major platform.


Let me try this again, because I was babbling a train of thought 
continuing from my past mails on the subject.

  Kill mmiowb with fire.

It was added for a niche platform that hasn't been produced for 10
years for a CPU ISA that is no longer being developed. Let's make mb/wmb
great again (aka actually possible for normal people to understand).

If something comes along again that reorders mmios from different CPUs 
in the IO controller like the Altix did, they implement wmb the slow and 
correct way. They can add a new faster primitive for the few devices 
they care about in the couple of perf critical places that matter.

It doesn't have to be done all at once with this series, obviously this 
is a big improvement on its own. But why perpetuate the nomenclature
and concept for new code added now? 

Thanks,
Nick
Michael Ellerman March 3, 2019, 9:26 a.m. UTC | #4
Nicholas Piggin <npiggin@gmail.com> writes:
> Will Deacon's on March 2, 2019 12:03 am:

>> In preparation for removing all explicit mmiowb() calls from driver

>> code, implement a tracking system in asm-generic based loosely on the

>> PowerPC implementation. This allows architectures with a non-empty

>> mmiowb() definition to have the barrier automatically inserted in

>> spin_unlock() following a critical section containing an I/O write.

>

> Is there a reason to call this "mmiowb"? We already have wmb that

> orders cacheable stores vs mmio stores don't we?

>

> Yes ia64 "sn2" is broken in that case, but that can be fixed (if

> anyone really cares about the platform any more). Maybe that's

> orthogonal to what you're doing here, I just don't like seeing

> "mmiowb" spread.

>

> This series works for spin locks, but you would want a driver to

> be able to use wmb() to order locks vs mmio when using a bit lock

> or a mutex or whatever else. Calling your wmb-if-io-is-pending

> version io_mb_before_unlock() would kind of match with existing

> patterns.

>

>> +static inline void mmiowb_set_pending(void)

>> +{

>> +	struct mmiowb_state *ms = __mmiowb_state();

>> +	ms->mmiowb_pending = ms->nesting_count;

>> +}

>> +

>> +static inline void mmiowb_spin_lock(void)

>> +{

>> +	struct mmiowb_state *ms = __mmiowb_state();

>> +	ms->nesting_count++;

>> +}

>> +

>> +static inline void mmiowb_spin_unlock(void)

>> +{

>> +	struct mmiowb_state *ms = __mmiowb_state();

>> +

>> +	if (unlikely(ms->mmiowb_pending)) {

>> +		ms->mmiowb_pending = 0;

>> +		mmiowb();

>> +	}

>> +

>> +	ms->nesting_count--;

>> +}

>

> Humour me for a minute and tell me what this algorithm is doing, or

> what was broken about the powerpc one, which is basically:

>

> static inline void mmiowb_set_pending(void)

> {

> 	struct mmiowb_state *ms = __mmiowb_state();

> 	ms->mmiowb_pending = 1;

> }

>

> static inline void mmiowb_spin_lock(void)

> {

> }


The current powerpc code clears io_sync in spin_lock().

ie, it would be equivalent to:

static inline void mmiowb_spin_lock(void)
{
 	ms->mmiowb_pending = 0;
}

Which means that:

	spin_lock(a);
        writel(x, y);
	spin_lock(b);
        ...
	spin_unlock(b);
	spin_unlock(a);

Does no barrier.

cheers
Nicholas Piggin March 3, 2019, 10:05 a.m. UTC | #5
Linus Torvalds's on March 3, 2019 2:29 pm:
> On Sat, Mar 2, 2019, 19:34 Nicholas Piggin <npiggin@gmail.com> wrote:

> 

>>

>> It doesn't have to be done all at once with this series, obviously this

>> is a big improvement on its own. But why perpetuate the nomenclature

>> and concept for new code added now?

>>

> 

> What nomenclature?

> 

> Nobody will be using mmiowb(). That's the whole point of the patch series.

> 

> It's now an entirely internal name, and nobody cares.


Why even bother with it at all, "internal" or not?  Just get rid of 
mmiowb, the concept is obsolete.

> And none of this has anything to do with wmb(), since it's about IO being

> ordered across cpu's by spin locks, not by barriers.

> 

> So I'm not seeing what you're arguing about.


Pretend ia64 doesn't exist for a minute. Now the regular mb/wmb barriers 
orders IO across CPUs with respect to their cacheable accesses.  
Regardless of whether that cacheable access is a spin lock, a bit lock, 
an atomic, a mutex... This is how it was before mmiowb came along.

Nothing wrong with this series to make spinlocks order mmio, but why 
call it mmiowb? Another patch could rename ia64's mmiowb and then the
name can be removed from the tree completely.

Thanks,
Nick
Nicholas Piggin March 3, 2019, 10:07 a.m. UTC | #6
Michael Ellerman's on March 3, 2019 7:26 pm:
> Nicholas Piggin <npiggin@gmail.com> writes:

>> Will Deacon's on March 2, 2019 12:03 am:

>>> In preparation for removing all explicit mmiowb() calls from driver

>>> code, implement a tracking system in asm-generic based loosely on the

>>> PowerPC implementation. This allows architectures with a non-empty

>>> mmiowb() definition to have the barrier automatically inserted in

>>> spin_unlock() following a critical section containing an I/O write.

>>

>> Is there a reason to call this "mmiowb"? We already have wmb that

>> orders cacheable stores vs mmio stores don't we?

>>

>> Yes ia64 "sn2" is broken in that case, but that can be fixed (if

>> anyone really cares about the platform any more). Maybe that's

>> orthogonal to what you're doing here, I just don't like seeing

>> "mmiowb" spread.

>>

>> This series works for spin locks, but you would want a driver to

>> be able to use wmb() to order locks vs mmio when using a bit lock

>> or a mutex or whatever else. Calling your wmb-if-io-is-pending

>> version io_mb_before_unlock() would kind of match with existing

>> patterns.

>>

>>> +static inline void mmiowb_set_pending(void)

>>> +{

>>> +	struct mmiowb_state *ms = __mmiowb_state();

>>> +	ms->mmiowb_pending = ms->nesting_count;

>>> +}

>>> +

>>> +static inline void mmiowb_spin_lock(void)

>>> +{

>>> +	struct mmiowb_state *ms = __mmiowb_state();

>>> +	ms->nesting_count++;

>>> +}

>>> +

>>> +static inline void mmiowb_spin_unlock(void)

>>> +{

>>> +	struct mmiowb_state *ms = __mmiowb_state();

>>> +

>>> +	if (unlikely(ms->mmiowb_pending)) {

>>> +		ms->mmiowb_pending = 0;

>>> +		mmiowb();

>>> +	}

>>> +

>>> +	ms->nesting_count--;

>>> +}

>>

>> Humour me for a minute and tell me what this algorithm is doing, or

>> what was broken about the powerpc one, which is basically:

>>

>> static inline void mmiowb_set_pending(void)

>> {

>> 	struct mmiowb_state *ms = __mmiowb_state();

>> 	ms->mmiowb_pending = 1;

>> }

>>

>> static inline void mmiowb_spin_lock(void)

>> {

>> }

> 

> The current powerpc code clears io_sync in spin_lock().

> 

> ie, it would be equivalent to:

> 

> static inline void mmiowb_spin_lock(void)

> {

>  	ms->mmiowb_pending = 0;

> }


Ah okay that's what I missed. How about we just not do that?

Thanks,
Nick
Linus Torvalds March 3, 2019, 6:48 p.m. UTC | #7
On Sun, Mar 3, 2019 at 2:05 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>

> Why even bother with it at all, "internal" or not?  Just get rid of

> mmiowb, the concept is obsolete.


It *is* gone, for chrissake!  Only the name remains as an internal
detail of "this is what we need to do".

> Pretend ia64 doesn't exist for a minute. Now the regular mb/wmb barriers

> orders IO across CPUs with respect to their cacheable accesses.


Stop with the total red herring already.

THIS HAS NOTHING TO DO WITH mb()/wmb().

As long as you keep bringing those up, you're only showing that you're
talking about the wrong thing.

> Regardless of whether that cacheable access is a spin lock, a bit lock,

> an atomic, a mutex... This is how it was before mmiowb came along.


No.

Beflore mmiowb() came along, there was one rule: do what x86 does.

And x86 orders mmio inside spinlocks.

Seriously.

Notice how there's not a single "barrier" mentioned here anywhere in
the above. No "mb()", no "wmb()", no nothing. Only "spinlocks order
IO".

That's the fundamental rule (that we broke for ia64), and all that
matters for this patch series.

Stop talking about wmb(). It's irrelevant. A spinlock does not
*contain* a wmb().

Nobody even _cares_ about wmb(). They are entirely irrelevant wrt IO,
because IO is ordered on any particular CPU anyway (which is what
wmb() enforces).

Only when you do special things like __raw_writel() etc does wmb()
matter, but at that point this whole series is entirely irrelevant,
and once again, that's still about just ordering on a single CPU.

So as long as you talk about wmb(), all you show is that you're
talking about something entirely different FROM THIS WHOLE SERIES.

And like it or not, ia64 still exists. We support it. It doesn't
_matter_ and we don't much care any more, but it still exists. Which
is why we have that concept of mmiowb().

On other platforms, mmiowb() might be a wmb(). Or it might not. It
might be some other barrier, or it might be a no-op entirely without a
barrier at all. It doesn't matter. But mmiowb() exists, and is now
happily entirely hidden inside the rule of "spinlocks order MMIO
across CPU's".

                 Linus
Michael Ellerman March 4, 2019, 1:01 a.m. UTC | #8
Nicholas Piggin <npiggin@gmail.com> writes:
> Michael Ellerman's on March 3, 2019 7:26 pm:

>> Nicholas Piggin <npiggin@gmail.com> writes:

...
>>> what was broken about the powerpc one, which is basically:

>>>

>>> static inline void mmiowb_set_pending(void)

>>> {

>>> 	struct mmiowb_state *ms = __mmiowb_state();

>>> 	ms->mmiowb_pending = 1;

>>> }

>>>

>>> static inline void mmiowb_spin_lock(void)

>>> {

>>> }

>> 

>> The current powerpc code clears io_sync in spin_lock().

>> 

>> ie, it would be equivalent to:

>> 

>> static inline void mmiowb_spin_lock(void)

>> {

>>  	ms->mmiowb_pending = 0;

>> }

>

> Ah okay that's what I missed. How about we just not do that?


Yeah I thought of that too but it's not great. We'd start semi-randomly
executing the sync in unlock depending on whether someone had done IO on
that CPU prior to the spinlock.

eg.

	writel(x, y);		// sets paca->io_sync
	...	

	<schedule>

	spin_lock(a);
        ...
        // No IO in here
        ...
        spin_unlock(a);		// sync() here because other task did writel().


Which wouldn't be *incorrect*, but would be kind of weird.

cheers
Michael Ellerman March 4, 2019, 10:24 a.m. UTC | #9
Nicholas Piggin <npiggin@gmail.com> writes:
> Will Deacon's on March 2, 2019 12:03 am:

>> In preparation for removing all explicit mmiowb() calls from driver

>> code, implement a tracking system in asm-generic based loosely on the

>> PowerPC implementation. This allows architectures with a non-empty

>> mmiowb() definition to have the barrier automatically inserted in

>> spin_unlock() following a critical section containing an I/O write.

>

> Is there a reason to call this "mmiowb"? We already have wmb that

> orders cacheable stores vs mmio stores don't we?

>

> Yes ia64 "sn2" is broken in that case, but that can be fixed (if

> anyone really cares about the platform any more). Maybe that's

> orthogonal to what you're doing here, I just don't like seeing

> "mmiowb" spread.

>

> This series works for spin locks, but you would want a driver to

> be able to use wmb() to order locks vs mmio when using a bit lock

> or a mutex or whatever else.


Without wading into the rest of the discussion, this does raise an
interesting point, ie. what about eg. rwlock's?

They're basically equivalent to spinlocks, and so could reasonably be
expected to have the same behaviour.

But we don't check the io_sync flag in arch_read/write_unlock() etc. and
both of those use lwsync.

Seems like we just forgot they existed? Commit f007cacffc88 ("[POWERPC]
Fix MMIO ops to provide expected barrier behaviour") that added the
io_sync stuff doesn't mention them at all.

Am I missing anything? AFAICS read/write locks were never built on top
of spin locks, so seems like we're just hoping drivers using rwlock do
the right barriers?

cheers
Linus Torvalds March 5, 2019, 12:19 a.m. UTC | #10
On Mon, Mar 4, 2019 at 2:24 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>

> Without wading into the rest of the discussion, this does raise an

> interesting point, ie. what about eg. rwlock's?

>

> They're basically equivalent to spinlocks, and so could reasonably be

> expected to have the same behaviour.

>

> But we don't check the io_sync flag in arch_read/write_unlock() etc. and

> both of those use lwsync.


I think technically rwlocks should do the same thing, at least when
they are used for exclusion.

Because of the exclusion argument, we can presubably limit it to just
write_unlock(), although at least in theory I guess you could have
some "one reader does IO, then a writer comes in" situation..

Perhaps more importantly, what about sleeping locks? When they
actually *block*, they get the barrier thanks to the scheduler, but
you can have a nice non-contended sequence that never does that.

I guess the fact that these cases have never even shown up as an issue
means that we could just continue to ignore it.

We could even give that approach some fancy name, and claim it as a
revolutionary new programming paradigm ("ostrich programming" to go
with "agile" and "pair programming").

                    Linus
Nicholas Piggin March 5, 2019, 12:21 a.m. UTC | #11
Linus Torvalds's on March 4, 2019 4:48 am:
> On Sun, Mar 3, 2019 at 2:05 AM Nicholas Piggin <npiggin@gmail.com> wrote:

>>

>> Why even bother with it at all, "internal" or not?  Just get rid of

>> mmiowb, the concept is obsolete.

> 

> It *is* gone, for chrissake!  Only the name remains as an internal

> detail of "this is what we need to do".

> 

>> Pretend ia64 doesn't exist for a minute. Now the regular mb/wmb barriers

>> orders IO across CPUs with respect to their cacheable accesses.

> 

> Stop with the total red herring already.

> 

> THIS HAS NOTHING TO DO WITH mb()/wmb().

> 

> As long as you keep bringing those up, you're only showing that you're

> talking about the wrong thing.


Why? I'm talking about them because they are not taken care of by this 
part of mmiowb removal. Talking about spin locks is the wrong thing
because we're already past that and everybody agrees it's the right
approach.

>> Regardless of whether that cacheable access is a spin lock, a bit lock,

>> an atomic, a mutex... This is how it was before mmiowb came along.

> 

> No.

> 

> Beflore mmiowb() came along, there was one rule: do what x86 does.

> 

> And x86 orders mmio inside spinlocks.

> 

> Seriously.

>

> Notice how there's not a single "barrier" mentioned here anywhere in

> the above. No "mb()", no "wmb()", no nothing. Only "spinlocks order

> IO".

> 

> That's the fundamental rule (that we broke for ia64), and all that

> matters for this patch series.

> 

> Stop talking about wmb(). It's irrelevant. A spinlock does not

> *contain* a wmb().


Well you don't have to talk about it but why do you want me to stop?
I don't understand. It's an open topic still after this series. I
can post a new thread about it if that would upset you less, I just
thought it would kind of fit here because we're talking about mmiowb,
I'm not trying to derail this series.

> Nobody even _cares_ about wmb(). They are entirely irrelevant wrt IO,

> because IO is ordered on any particular CPU anyway (which is what

> wmb() enforces).

> 

> Only when you do special things like __raw_writel() etc does wmb()

> matter, but at that point this whole series is entirely irrelevant,

> and once again, that's still about just ordering on a single CPU.

> 

> So as long as you talk about wmb(), all you show is that you're

> talking about something entirely different FROM THIS WHOLE SERIES.

> 

> And like it or not, ia64 still exists. We support it. It doesn't

> _matter_ and we don't much care any more, but it still exists. Which

> is why we have that concept of mmiowb().

> 

> On other platforms, mmiowb() might be a wmb(). Or it might not. It

> might be some other barrier, or it might be a no-op entirely without a

> barrier at all. It doesn't matter. But mmiowb() exists, and is now

> happily entirely hidden inside the rule of "spinlocks order MMIO

> across CPU's".


The driver writer still has to know exactly as much about mmiowb
(the concept, if not the name) before this series as afterward. That
is, sequences of mmio stores to a device from different CPUs can only
be atomic if you (put mmiowb before spin unlock | protect them with
spin locks).

I just don't understand the reason to expose the driver writer to
that additional detail. Intuitively, mb() should order stores to
all kind of memory the same as smp_mb() orders stores to cacheable
(without the detail of stores being reordered at the interconnect
or controller -- driver writer doesn't care about store queues in
the CPU or whatever details, they want the device to see IOs in
some order).

Thanks,
Nick
Nicholas Piggin March 5, 2019, 12:21 a.m. UTC | #12
Michael Ellerman's on March 4, 2019 11:01 am:
> Nicholas Piggin <npiggin@gmail.com> writes:

>> Michael Ellerman's on March 3, 2019 7:26 pm:

>>> Nicholas Piggin <npiggin@gmail.com> writes:

> ...

>>>> what was broken about the powerpc one, which is basically:

>>>>

>>>> static inline void mmiowb_set_pending(void)

>>>> {

>>>> 	struct mmiowb_state *ms = __mmiowb_state();

>>>> 	ms->mmiowb_pending = 1;

>>>> }

>>>>

>>>> static inline void mmiowb_spin_lock(void)

>>>> {

>>>> }

>>> 

>>> The current powerpc code clears io_sync in spin_lock().

>>> 

>>> ie, it would be equivalent to:

>>> 

>>> static inline void mmiowb_spin_lock(void)

>>> {

>>>  	ms->mmiowb_pending = 0;

>>> }

>>

>> Ah okay that's what I missed. How about we just not do that?

> 

> Yeah I thought of that too but it's not great. We'd start semi-randomly

> executing the sync in unlock depending on whether someone had done IO on

> that CPU prior to the spinlock.

> 

> eg.

> 

> 	writel(x, y);		// sets paca->io_sync

> 	...	

> 

> 	<schedule>

> 

> 	spin_lock(a);

>         ...

>         // No IO in here

>         ...

>         spin_unlock(a);		// sync() here because other task did writel().

> 

> 

> Which wouldn't be *incorrect*, but would be kind of weird.


schedule is probably okay, we could clear pending there. But you
possibly could get interrupts, or some lock free mmios that set the
flag. Does it matter that much? A random cache miss could have the
same effect.

It may matter slightly less for powerpc because we don't inline
spin locks, although I have been hoping to for a while, this might
put the nail in that.

We can always tinker with it later though so I won't insist.

Thanks,
Nick
Linus Torvalds March 5, 2019, 12:33 a.m. UTC | #13
On Mon, Mar 4, 2019 at 4:21 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>

> Well you don't have to talk about it but why do you want me to stop?

> I don't understand. It's an open topic still after this series. I

> can post a new thread about it if that would upset you less, I just

> thought it would kind of fit here because we're talking about mmiowb,

> I'm not trying to derail this series.


Because if anybody is doing lockless programming with IO, they deserve
whatever they get.

In other words, the whole "wmb()" issue is basically not an issue.

We already have rules like:

 - mmio is ordered wrt itself

 - mmio is ordered wrt previous memory ops (because of dma)

and while it turned out that at least alpha had broken those rules at
some point, and we had a discussion about it, that was just a bug.

So there's basically no real reason to ever use "wmb()" with any of
the normal mmio stuff.

Now, we do have __raw_writel() etc, which are explicitly not ordered,
but they also haven't been really standardized. And in fact, people
who use them often seem to want to use them together with various weak
memory remappings.

And yes, "wmb()" has been the traditional way to order those, to the
point where "wmb()" on x86 is actually a "sfence" because once you do
IO on those kinds of unordered mappings, the usual SMP rules go out
the window (a normal "smp_wmb()" is just a compiler barrier on x86).

But notice how this is *entirely* independent of the spinlock issue,
and has absolutely *nothing* to do with the whole mmiowb() thing that
was always about "normal IO vs normal RAM" (due to the ia64 breakage).

                 Linus
Michael Ellerman March 7, 2019, 12:47 a.m. UTC | #14
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, Mar 4, 2019 at 2:24 AM Michael Ellerman <mpe@ellerman.id.au> wrote:

>>

>> Without wading into the rest of the discussion, this does raise an

>> interesting point, ie. what about eg. rwlock's?

>>

>> They're basically equivalent to spinlocks, and so could reasonably be

>> expected to have the same behaviour.

>>

>> But we don't check the io_sync flag in arch_read/write_unlock() etc. and

>> both of those use lwsync.

>

> I think technically rwlocks should do the same thing, at least when

> they are used for exclusion.


OK.

> Because of the exclusion argument, we can presubably limit it to just

> write_unlock(), although at least in theory I guess you could have

> some "one reader does IO, then a writer comes in" situation..


It's a bit hard to grep for, but I did find one case:

static void netxen_nic_io_write_128M(struct netxen_adapter *adapter,
                void __iomem *addr, u32 data)
{
        read_lock(&adapter->ahw.crb_lock);
        writel(data, addr);
        read_unlock(&adapter->ahw.crb_lock);
}

It looks like that driver is using the rwlock to exclude cases that can
just do a readl()/writel() (readers) vs another case that has to reconfigure a
window or something, before doing readl()/writel() and then configuring
the window back. So that seems like a valid use for a rwlock.

Whether we want to penalise all read_unlock() usages with a mmiowb()
check just to support that one driver is another question.

> Perhaps more importantly, what about sleeping locks? When they

> actually *block*, they get the barrier thanks to the scheduler, but

> you can have a nice non-contended sequence that never does that.


Yeah.

The mutex unlock fast path is just:

	if (atomic_long_cmpxchg_release(&lock->owner, curr, 0UL) == curr)
		return true;

And because it's the "release" variant we just use lwsync, which doesn't
order MMIO. If it was just atomic_long_cmpxchg() that would work because
we use sync for those.

__up_write() uses atomic_long_sub_return_release(), so same story.

> I guess the fact that these cases have never even shown up as an issue

> means that we could just continue to ignore it.

>

> We could even give that approach some fancy name, and claim it as a

> revolutionary new programming paradigm ("ostrich programming" to go

> with "agile" and "pair programming").


Maybe. On power we have the double whammy of weaker ordering than
other arches and infinitesimal market share, which makes me worry that
there are bugs lurking that we just haven't found, it's happened before.

cheers
Linus Torvalds March 7, 2019, 1:13 a.m. UTC | #15
On Wed, Mar 6, 2019 at 4:48 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>

> It's a bit hard to grep for, but I did find one case:

>

> static void netxen_nic_io_write_128M(struct netxen_adapter *adapter,

>                 void __iomem *addr, u32 data)

> {

>         read_lock(&adapter->ahw.crb_lock);

>         writel(data, addr);

>         read_unlock(&adapter->ahw.crb_lock);

> }

>

> It looks like that driver is using the rwlock to exclude cases that can

> just do a readl()/writel() (readers) vs another case that has to reconfigure a

> window or something, before doing readl()/writel() and then configuring

> the window back. So that seems like a valid use for a rwlock.


Oh, it's actually fairly sane: the IO itself is apparently windowed on
that hardware, and the *common* case is that you'd access "window 1".

So if everybody accesses window 1, they can all work in parallel - the
read case.

But if somebody needs to access any of the other special IO windows,
they need to take the write lock, then change the window pointer to
the window they want to access, do the access, and then set it back to
the default "window 1".

So yes. That driver very much relies on exclusion of the IO through an rwlock.

I'm guessing nobody uses that hardware on Power? Or maybe the "window
1 is common" is *so* common that the other cases basically never
happen and don't really end up ever causing problems?

[ Time passes, I look at it ]

Actually, the driver probably works on Power, because *setting* the
window isn't just a write to the window register, it's always
serialized by a read _from_ the window register to verify that the
write "took". Apparently the hardware itself really needs that "don't
do accesses to the window before I've settled".

And that would basically serialize the only operation that really
needs serialization, so in the case of _that_ driver, it all looks
safe. Even if it's partly by luck.

> > Perhaps more importantly, what about sleeping locks? When they

> > actually *block*, they get the barrier thanks to the scheduler, but

> > you can have a nice non-contended sequence that never does that.

>

> Yeah.

>

> The mutex unlock fast path is just:


Yup. Both lock/unlock have fast paths that should be just trivial
atomic sequences.

But the good news is that *usually* device IO is protected by a
spinlock, since you almost always have interrupt serialization needs
too whenever you have any sequence of MMIO that isn't just some "write
single word to start the engine".

So the "use mutex to serialize IO" may be fairly unusual.

                  Linus
Peter Zijlstra March 7, 2019, 9:13 a.m. UTC | #16
On Thu, Mar 07, 2019 at 11:47:53AM +1100, Michael Ellerman wrote:
> The mutex unlock fast path is just:

> 

> 	if (atomic_long_cmpxchg_release(&lock->owner, curr, 0UL) == curr)

> 		return true;

> 

> And because it's the "release" variant we just use lwsync, which doesn't

> order MMIO. If it was just atomic_long_cmpxchg() that would work because

> we use sync for those.

> 

> __up_write() uses atomic_long_sub_return_release(), so same story.


As does spin_unlock() of course, which is a great segway into...

  my RCsc desires :-)

If all your unlocks were to have SYNC, your locks would, aside from
ordering MMIO, also be RCsc, Win-Win :-)

There is, of course, that pesky little performance detail that keeps
getting in the way.
diff mbox series

Patch

diff --git a/include/asm-generic/mmiowb.h b/include/asm-generic/mmiowb.h
new file mode 100644
index 000000000000..9439ff037b2d
--- /dev/null
+++ b/include/asm-generic/mmiowb.h
@@ -0,0 +1,63 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_GENERIC_MMIOWB_H
+#define __ASM_GENERIC_MMIOWB_H
+
+/*
+ * Generic implementation of mmiowb() tracking for spinlocks.
+ *
+ * If your architecture doesn't ensure that writes to an I/O peripheral
+ * within two spinlocked sections on two different CPUs are seen by the
+ * peripheral in the order corresponding to the lock handover, then you
+ * need to follow these FIVE easy steps:
+ *
+ * 	1. Implement mmiowb() (and arch_mmiowb_state() if you're fancy)
+ *	   in asm/mmiowb.h, then #include this file
+ *	2. Ensure your I/O write accessors call mmiowb_set_pending()
+ *	3. Select ARCH_HAS_MMIOWB
+ *	4. Untangle the resulting mess of header files
+ *	5. Complain to your architects
+ */
+#ifdef CONFIG_MMIOWB
+
+#include <linux/compiler.h>
+#include <asm-generic/mmiowb_types.h>
+
+#ifndef arch_mmiowb_state
+#include <asm/percpu.h>
+#include <asm/smp.h>
+
+DECLARE_PER_CPU(struct mmiowb_state, __mmiowb_state);
+#define __mmiowb_state()	this_cpu_ptr(&__mmiowb_state)
+#else
+#define __mmiowb_state()	arch_mmiowb_state()
+#endif	/* arch_mmiowb_state */
+
+static inline void mmiowb_set_pending(void)
+{
+	struct mmiowb_state *ms = __mmiowb_state();
+	ms->mmiowb_pending = ms->nesting_count;
+}
+
+static inline void mmiowb_spin_lock(void)
+{
+	struct mmiowb_state *ms = __mmiowb_state();
+	ms->nesting_count++;
+}
+
+static inline void mmiowb_spin_unlock(void)
+{
+	struct mmiowb_state *ms = __mmiowb_state();
+
+	if (unlikely(ms->mmiowb_pending)) {
+		ms->mmiowb_pending = 0;
+		mmiowb();
+	}
+
+	ms->nesting_count--;
+}
+#else
+#define mmiowb_set_pending()		do { } while (0)
+#define mmiowb_spin_lock()		do { } while (0)
+#define mmiowb_spin_unlock()		do { } while (0)
+#endif	/* CONFIG_MMIOWB */
+#endif	/* __ASM_GENERIC_MMIOWB_H */
diff --git a/include/asm-generic/mmiowb_types.h b/include/asm-generic/mmiowb_types.h
new file mode 100644
index 000000000000..8eb0095655e7
--- /dev/null
+++ b/include/asm-generic/mmiowb_types.h
@@ -0,0 +1,12 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_GENERIC_MMIOWB_TYPES_H
+#define __ASM_GENERIC_MMIOWB_TYPES_H
+
+#include <linux/types.h>
+
+struct mmiowb_state {
+	u16	nesting_count;
+	u16	mmiowb_pending;
+};
+
+#endif	/* __ASM_GENERIC_MMIOWB_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 84d882f3e299..82fa481ecb78 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -248,3 +248,10 @@  config ARCH_USE_QUEUED_RWLOCKS
 config QUEUED_RWLOCKS
 	def_bool y if ARCH_USE_QUEUED_RWLOCKS
 	depends on SMP
+
+config ARCH_HAS_MMIOWB
+	bool
+
+config MMIOWB
+	def_bool y if ARCH_HAS_MMIOWB
+	depends on SMP
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index 936f3d14dd6b..0ff08380f531 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -22,6 +22,13 @@ 
 #include <linux/debug_locks.h>
 #include <linux/export.h>
 
+#ifdef CONFIG_MMIOWB
+#ifndef arch_mmiowb_state
+DEFINE_PER_CPU(struct mmiowb_state, __mmiowb_state);
+EXPORT_PER_CPU_SYMBOL(__mmiowb_state);
+#endif
+#endif
+
 /*
  * If lockdep is enabled then we use the non-preemption spin-ops
  * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are