diff mbox

[v2] barriers: introduce smp_mb__release_acquire and update documentation

Message ID 1444215568-24732-1-git-send-email-will.deacon@arm.com
State New
Headers show

Commit Message

Will Deacon Oct. 7, 2015, 10:59 a.m. UTC
As much as we'd like to live in a world where RELEASE -> ACQUIRE is
always cheaply ordered and can be used to construct UNLOCK -> LOCK
definitions with similar guarantees, the grim reality is that this isn't
even possible on x86 (thanks to Paul for bringing us crashing down to
Earth).

This patch handles the issue by introducing a new barrier macro,
smp_mb__release_acquire, that can be placed between a RELEASE and a
subsequent ACQUIRE operation in order to upgrade them to a full memory
barrier. At the moment, it doesn't have any users, so its existence
serves mainly as a documentation aid.

Documentation/memory-barriers.txt is updated to describe more clearly
the ACQUIRE and RELEASE ordering in this area and to show an example of
the new barrier in action.

Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---

v1 -> v2: - Clarified that this barrier affects only access performed by
            the executing CPU
	  - Definitions for all TSO architectures

 Documentation/memory-barriers.txt   | 26 +++++++++++++++++++++++++-
 arch/ia64/include/asm/barrier.h     |  1 +
 arch/powerpc/include/asm/barrier.h  |  1 +
 arch/s390/include/asm/barrier.h     |  2 ++
 arch/sparc/include/asm/barrier_64.h |  5 +++--
 arch/x86/include/asm/barrier.h      |  2 ++
 include/asm-generic/barrier.h       |  4 ++++
 7 files changed, 38 insertions(+), 3 deletions(-)

Comments

Peter Zijlstra Oct. 7, 2015, 11:19 a.m. UTC | #1
On Wed, Oct 07, 2015 at 11:59:28AM +0100, Will Deacon wrote:
> As much as we'd like to live in a world where RELEASE -> ACQUIRE is
> always cheaply ordered and can be used to construct UNLOCK -> LOCK
> definitions with similar guarantees, the grim reality is that this isn't
> even possible on x86 (thanks to Paul for bringing us crashing down to
> Earth).
> 
> This patch handles the issue by introducing a new barrier macro,
> smp_mb__release_acquire, that can be placed between a RELEASE and a
> subsequent ACQUIRE operation in order to upgrade them to a full memory
> barrier. At the moment, it doesn't have any users, so its existence
> serves mainly as a documentation aid.

Does we want to go revert 12d560f4ea87 ("rcu,locking: Privatize
smp_mb__after_unlock_lock()") for that same reason?

> Documentation/memory-barriers.txt is updated to describe more clearly
> the ACQUIRE and RELEASE ordering in this area and to show an example of
> the new barrier in action.

The only nit I have is that if we revert the above it might be make
sense to more clearly call out the distinction between the two.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Will Deacon Oct. 7, 2015, 1:23 p.m. UTC | #2
Hi Peter,

Thanks for the headache ;)

On Wed, Oct 07, 2015 at 01:19:15PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 07, 2015 at 11:59:28AM +0100, Will Deacon wrote:
> > As much as we'd like to live in a world where RELEASE -> ACQUIRE is
> > always cheaply ordered and can be used to construct UNLOCK -> LOCK
> > definitions with similar guarantees, the grim reality is that this isn't
> > even possible on x86 (thanks to Paul for bringing us crashing down to
> > Earth).
> > 
> > This patch handles the issue by introducing a new barrier macro,
> > smp_mb__release_acquire, that can be placed between a RELEASE and a
> > subsequent ACQUIRE operation in order to upgrade them to a full memory
> > barrier. At the moment, it doesn't have any users, so its existence
> > serves mainly as a documentation aid.
> 
> Does we want to go revert 12d560f4ea87 ("rcu,locking: Privatize
> smp_mb__after_unlock_lock()") for that same reason?

I don't think we want a straight revert. smp_mb__after_unlock_lock could
largely die if PPC strengthened its locks, whereas smp_mb__release_acquire
is needed by quite a few architectures.

> > Documentation/memory-barriers.txt is updated to describe more clearly
> > the ACQUIRE and RELEASE ordering in this area and to show an example of
> > the new barrier in action.
> 
> The only nit I have is that if we revert the above it might be make
> sense to more clearly call out the distinction between the two.

Right. Where I think we'd like to get to is:

 - RELEASE -> ACQUIRE acts as a full barrier if they operate on the same
   variable and the ACQUIRE reads from the RELEASE

 - RELEASE -> ACQUIRE acts as a full barrier if they execute on the same
   CPU and are interleaved with an smp_mb__release_acquire barrier.

 - RELEASE -> ACQUIRE ordering is transitive

[only the transitivity part is missing in this patch, because I lost
 track of that discussion]

We could then use these same guarantees for UNLOCK -> LOCK in RCU,
defining smp_mb__after_unlock_lock to be the same as
smp_mb__release_acquire, but only applying to UNLOCK -> LOCK. That's a
slight relaxation of how it's defined at the moment (and I guess would
need some work on PPC?), but it keeps things consistent which is
especially important as core locking primitives are ported over to the
ACQUIRE/RELEASE primitives.

Thoughts?

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 7, 2015, 2:53 p.m. UTC | #3
On Wed, Oct 07, 2015 at 02:23:17PM +0100, Will Deacon wrote:
> Hi Peter,
> 
> Thanks for the headache ;)

Most welcome :-)

> > Does we want to go revert 12d560f4ea87 ("rcu,locking: Privatize
> > smp_mb__after_unlock_lock()") for that same reason?
> 
> I don't think we want a straight revert. smp_mb__after_unlock_lock could
> largely die if PPC strengthened its locks, whereas smp_mb__release_acquire
> is needed by quite a few architectures.

Fair enough, lets wait for the benchmark results from the PPC people
doing that.

> > > Documentation/memory-barriers.txt is updated to describe more clearly
> > > the ACQUIRE and RELEASE ordering in this area and to show an example of
> > > the new barrier in action.
> > 
> > The only nit I have is that if we revert the above it might be make
> > sense to more clearly call out the distinction between the two.
> 
> Right. Where I think we'd like to get to is:
> 
>  - RELEASE -> ACQUIRE acts as a full barrier if they operate on the same
>    variable and the ACQUIRE reads from the RELEASE
> 
>  - RELEASE -> ACQUIRE acts as a full barrier if they execute on the same
>    CPU and are interleaved with an smp_mb__release_acquire barrier.
> 
>  - RELEASE -> ACQUIRE ordering is transitive
> 
> [only the transitivity part is missing in this patch, because I lost
>  track of that discussion]
> 
> We could then use these same guarantees for UNLOCK -> LOCK in RCU,
> defining smp_mb__after_unlock_lock to be the same as
> smp_mb__release_acquire, but only applying to UNLOCK -> LOCK. That's a
> slight relaxation of how it's defined at the moment (and I guess would
> need some work on PPC?), but it keeps things consistent which is
> especially important as core locking primitives are ported over to the
> ACQUIRE/RELEASE primitives.
> 
> Thoughts?

/me like, although I'm too tired to see how those 3 rules combine to
something weaker than the current after_unlock_lock thing for PPC.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Paul E. McKenney Oct. 7, 2015, 3:25 p.m. UTC | #4
On Wed, Oct 07, 2015 at 02:23:17PM +0100, Will Deacon wrote:
> Hi Peter,
> 
> Thanks for the headache ;)
> 
> On Wed, Oct 07, 2015 at 01:19:15PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 07, 2015 at 11:59:28AM +0100, Will Deacon wrote:
> > > As much as we'd like to live in a world where RELEASE -> ACQUIRE is
> > > always cheaply ordered and can be used to construct UNLOCK -> LOCK
> > > definitions with similar guarantees, the grim reality is that this isn't
> > > even possible on x86 (thanks to Paul for bringing us crashing down to
> > > Earth).
> > > 
> > > This patch handles the issue by introducing a new barrier macro,
> > > smp_mb__release_acquire, that can be placed between a RELEASE and a
> > > subsequent ACQUIRE operation in order to upgrade them to a full memory
> > > barrier. At the moment, it doesn't have any users, so its existence
> > > serves mainly as a documentation aid.
> > 
> > Does we want to go revert 12d560f4ea87 ("rcu,locking: Privatize
> > smp_mb__after_unlock_lock()") for that same reason?
> 
> I don't think we want a straight revert. smp_mb__after_unlock_lock could
> largely die if PPC strengthened its locks, whereas smp_mb__release_acquire
> is needed by quite a few architectures.
> 
> > > Documentation/memory-barriers.txt is updated to describe more clearly
> > > the ACQUIRE and RELEASE ordering in this area and to show an example of
> > > the new barrier in action.
> > 
> > The only nit I have is that if we revert the above it might be make
> > sense to more clearly call out the distinction between the two.
> 
> Right. Where I think we'd like to get to is:
> 
>  - RELEASE -> ACQUIRE acts as a full barrier if they operate on the same
>    variable and the ACQUIRE reads from the RELEASE
> 
>  - RELEASE -> ACQUIRE acts as a full barrier if they execute on the same
>    CPU and are interleaved with an smp_mb__release_acquire barrier.
> 
>  - RELEASE -> ACQUIRE ordering is transitive

I believe that these three are good.

> [only the transitivity part is missing in this patch, because I lost
>  track of that discussion]
> 
> We could then use these same guarantees for UNLOCK -> LOCK in RCU,
> defining smp_mb__after_unlock_lock to be the same as
> smp_mb__release_acquire, but only applying to UNLOCK -> LOCK. That's a
> slight relaxation of how it's defined at the moment (and I guess would
> need some work on PPC?), but it keeps things consistent which is
> especially important as core locking primitives are ported over to the
> ACQUIRE/RELEASE primitives.

Currently, we do need smp_mb__after_unlock_lock() to be after the
acquisition on PPC -- putting it between the unlock and the lock
of course doesn't cut it for the cross-thread unlock/lock case.

I am with Peter -- we do need the benchmark results for PPC.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Michael Ellerman Oct. 8, 2015, 3:50 a.m. UTC | #5
On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote:
> On Wed, Oct 07, 2015 at 02:23:17PM +0100, Will Deacon wrote:
> > Hi Peter,
> > 
> > Thanks for the headache ;)
> > 
> > On Wed, Oct 07, 2015 at 01:19:15PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 07, 2015 at 11:59:28AM +0100, Will Deacon wrote:
> > > > As much as we'd like to live in a world where RELEASE -> ACQUIRE is
> > > > always cheaply ordered and can be used to construct UNLOCK -> LOCK
> > > > definitions with similar guarantees, the grim reality is that this isn't
> > > > even possible on x86 (thanks to Paul for bringing us crashing down to
> > > > Earth).
> > > > 
> > > > This patch handles the issue by introducing a new barrier macro,
> > > > smp_mb__release_acquire, that can be placed between a RELEASE and a
> > > > subsequent ACQUIRE operation in order to upgrade them to a full memory
> > > > barrier. At the moment, it doesn't have any users, so its existence
> > > > serves mainly as a documentation aid.
> > > 
> > > Does we want to go revert 12d560f4ea87 ("rcu,locking: Privatize
> > > smp_mb__after_unlock_lock()") for that same reason?
> > 
> > I don't think we want a straight revert. smp_mb__after_unlock_lock could
> > largely die if PPC strengthened its locks, whereas smp_mb__release_acquire
> > is needed by quite a few architectures.
> 
> Currently, we do need smp_mb__after_unlock_lock() to be after the
> acquisition on PPC -- putting it between the unlock and the lock
> of course doesn't cut it for the cross-thread unlock/lock case.
> 
> I am with Peter -- we do need the benchmark results for PPC.

Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not
plentiful at the moment.

If we do a straight lwsync -> sync conversion for unlock it looks like that
will cost us ~4.2% on Anton's standard context switch benchmark.

So that's not all that nice. But we also don't want to be the only arch that
has the weird lock semantics and has to find all the bugs.

cheers


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 8, 2015, 11:16 a.m. UTC | #6
On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote:
> On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote:

> > Currently, we do need smp_mb__after_unlock_lock() to be after the
> > acquisition on PPC -- putting it between the unlock and the lock
> > of course doesn't cut it for the cross-thread unlock/lock case.

This ^, that makes me think I don't understand
smp_mb__after_unlock_lock.

How is:

	UNLOCK x
	smp_mb__after_unlock_lock()
	LOCK y

a problem? That's still a full barrier.

> > I am with Peter -- we do need the benchmark results for PPC.
> 
> Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not
> plentiful at the moment.
> 
> If we do a straight lwsync -> sync conversion for unlock it looks like that
> will cost us ~4.2% on Anton's standard context switch benchmark.

And that does not seem to agree with Paul's smp_mb__after_unlock_lock()
usage and would not be sufficient for the same (as of yet unexplained)
reason.

Why does it matter which of the LOCK or UNLOCK gets promoted to full
barrier on PPC in order to become RCsc?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Will Deacon Oct. 8, 2015, 12:59 p.m. UTC | #7
On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote:
> > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote:
> 
> > > Currently, we do need smp_mb__after_unlock_lock() to be after the
> > > acquisition on PPC -- putting it between the unlock and the lock
> > > of course doesn't cut it for the cross-thread unlock/lock case.
> 
> This ^, that makes me think I don't understand
> smp_mb__after_unlock_lock.
> 
> How is:
> 
> 	UNLOCK x
> 	smp_mb__after_unlock_lock()
> 	LOCK y
> 
> a problem? That's still a full barrier.

I thought Paul was talking about something like this case:


CPU A     CPU B              CPU C
foo = 1
UNLOCK x
          LOCK x
          (RELEASE) bar = 1
                             ACQUIRE bar = 1
                             READ_ONCE foo = 0


but this looks the same as ISA2+lwsyncs/ISA2+lwsync+ctrlisync+lwsync,
which are both forbidden on PPC, so now I'm also confused.

The different-lock, same thread case is more straight-forward, I think.

> > > I am with Peter -- we do need the benchmark results for PPC.
> > 
> > Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not
> > plentiful at the moment.
> > 
> > If we do a straight lwsync -> sync conversion for unlock it looks like that
> > will cost us ~4.2% on Anton's standard context switch benchmark.

Thanks Michael!

> And that does not seem to agree with Paul's smp_mb__after_unlock_lock()
> usage and would not be sufficient for the same (as of yet unexplained)
> reason.
> 
> Why does it matter which of the LOCK or UNLOCK gets promoted to full
> barrier on PPC in order to become RCsc?

I think we need a PPC litmus test illustrating the inter-thread, same
lock failure case when smp_mb__after_unlock_lock is not present so that
we can reason about this properly. Paul?

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Paul E. McKenney Oct. 8, 2015, 9:44 p.m. UTC | #8
On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote:
> > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote:
> 
> > > Currently, we do need smp_mb__after_unlock_lock() to be after the
> > > acquisition on PPC -- putting it between the unlock and the lock
> > > of course doesn't cut it for the cross-thread unlock/lock case.
> 
> This ^, that makes me think I don't understand
> smp_mb__after_unlock_lock.
> 
> How is:
> 
> 	UNLOCK x
> 	smp_mb__after_unlock_lock()
> 	LOCK y
> 
> a problem? That's still a full barrier.

The problem is that I need smp_mb__after_unlock_lock() to give me
transitivity even if the UNLOCK happened on one CPU and the LOCK
on another.  For that to work, the smp_mb__after_unlock_lock() needs
to be either immediately after the acquire (the current choice) or
immediately before the release (which would also work from a purely
technical viewpoint, but I much prefer the current choice).

Or am I missing your point?

> > > I am with Peter -- we do need the benchmark results for PPC.
> > 
> > Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not
> > plentiful at the moment.
> > 
> > If we do a straight lwsync -> sync conversion for unlock it looks like that
> > will cost us ~4.2% on Anton's standard context switch benchmark.
> 
> And that does not seem to agree with Paul's smp_mb__after_unlock_lock()
> usage and would not be sufficient for the same (as of yet unexplained)
> reason.
> 
> Why does it matter which of the LOCK or UNLOCK gets promoted to full
> barrier on PPC in order to become RCsc?

You could do either.  However, as I understand it, there is hardware for
which bc;isync is faster than lwsync.  For such hardware, it is cheaper
to upgrade the unlock from lwsync to sync than to upgrade the lock from
bc;isync to sync.  If I recall correctly, the kernel rewrites itself at
boot to select whichever of lwsync or bc;isync is better for the hardware
at hand.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Paul E. McKenney Oct. 8, 2015, 10:17 p.m. UTC | #9
On Thu, Oct 08, 2015 at 01:59:38PM +0100, Will Deacon wrote:
> On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote:
> > > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote:
> > 
> > > > Currently, we do need smp_mb__after_unlock_lock() to be after the
> > > > acquisition on PPC -- putting it between the unlock and the lock
> > > > of course doesn't cut it for the cross-thread unlock/lock case.
> > 
> > This ^, that makes me think I don't understand
> > smp_mb__after_unlock_lock.
> > 
> > How is:
> > 
> > 	UNLOCK x
> > 	smp_mb__after_unlock_lock()
> > 	LOCK y
> > 
> > a problem? That's still a full barrier.
> 
> I thought Paul was talking about something like this case:
> 
> CPU A     CPU B              CPU C
> foo = 1
> UNLOCK x
>           LOCK x
>           (RELEASE) bar = 1
>                              ACQUIRE bar = 1
>                              READ_ONCE foo = 0

More like this:

CPU A			CPU B			CPU C
WRITE_ONCE(foo, 1);
UNLOCK x
			LOCK x
			r1 = READ_ONCE(bar);
						WRITE_ONCE(bar, 1);
						smp_mb();
						r2 = READ_ONCE(foo);

This can result in r1==0 && r2==0.

> but this looks the same as ISA2+lwsyncs/ISA2+lwsync+ctrlisync+lwsync,
> which are both forbidden on PPC, so now I'm also confused.
> 
> The different-lock, same thread case is more straight-forward, I think.

Indeed it is:

CPU A			CPU B
WRITE_ONCE(foo, 1);
UNLOCK x
LOCK x
r1 = READ_ONCE(bar);
			WRITE_ONCE(bar, 1);
			smp_mb();
			r2 = READ_ONCE(foo);

This also can result in r1==0 && r2==0.

> > > > I am with Peter -- we do need the benchmark results for PPC.
> > > 
> > > Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not
> > > plentiful at the moment.
> > > 
> > > If we do a straight lwsync -> sync conversion for unlock it looks like that
> > > will cost us ~4.2% on Anton's standard context switch benchmark.
> 
> Thanks Michael!
> 
> > And that does not seem to agree with Paul's smp_mb__after_unlock_lock()
> > usage and would not be sufficient for the same (as of yet unexplained)
> > reason.
> > 
> > Why does it matter which of the LOCK or UNLOCK gets promoted to full
> > barrier on PPC in order to become RCsc?
> 
> I think we need a PPC litmus test illustrating the inter-thread, same
> lock failure case when smp_mb__after_unlock_lock is not present so that
> we can reason about this properly. Paul?

Please see above.  ;-)

The corresponding litmus tests are below.

							Thanx, Paul

------------------------------------------------------------------------

PPC lock-2thread-WR-barrier.litmus
""
(*
 * Does 3.0 Linux-kernel Power lock-unlock provide local 
 * barrier that orders prior stores against subsequent loads,
 * if the unlock and lock happen on different threads?
 * This version uses lwsync instead of isync.
 *)
(* 23-July-2013: ppcmem says "Sometimes" *)
{
l=1;
0:r1=1;          0:r4=x;         0:r10=0;          0:r12=l;
1:r1=1; 1:r3=42; 1:r4=x; 1:r5=y; 1:r10=0; 1:r11=0; 1:r12=l;
2:r1=1;          2:r4=x; 2:r5=y;
}
 P0             | P1                 | P2;
 stw r1,0(r4)   | lwarx r11,r10,r12  | stw r1,0(r5) ;
 lwsync         | cmpwi r11,0        | lwsync       ;
 stw r10,0(r12) | bne Fail1          | lwz r7,0(r4) ;
                | stwcx. r1,r10,r12  | ;
                | bne Fail1          | ;
                | isync              | ;
                | lwz r3,0(r5)       | ;
                | Fail1:             | ;


exists
(1:r3=0 /\ 2:r7=0)

------------------------------------------------------------------------

PPC lock-1thread-WR-barrier.litmus
""
(*
 * Does 3.0 Linux-kernel Power lock-unlock provide local 
 * barrier that orders prior stores against subsequent loads,
 * if the unlock and lock happen in the same thread?
 * This version uses lwsync instead of isync.
 *)
(* 8-Oct-2015: ppcmem says "Sometimes" *)
{
l=1;
0:r1=1; 0:r3=42; 0:r4=x; 0:r5=y; 0:r10=0; 0:r11=0; 0:r12=l;
1:r1=1;          1:r4=x; 1:r5=y;
}
 P0                | P1           ;
 stw r1,0(r4)      | stw r1,0(r5) ;
 lwsync            | lwsync       ;
 stw r10,0(r12)    | lwz r7,0(r4) ;
 lwarx r11,r10,r12 |              ;
 cmpwi r11,0       |              ;
 bne Fail1         |              ;
 stwcx. r1,r10,r12 |              ;
 bne Fail1         |              ;
 isync             |              ;
 lwz r3,0(r5)      |              ;
 Fail1:            |              ;


exists
(0:r3=0 /\ 1:r7=0)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 9, 2015, 7:29 a.m. UTC | #10
On Thu, Oct 08, 2015 at 02:44:39PM -0700, Paul E. McKenney wrote:

> > > > I am with Peter -- we do need the benchmark results for PPC.
> > > 
> > > Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not
> > > plentiful at the moment.
> > > 
> > > If we do a straight lwsync -> sync conversion for unlock it looks like that
> > > will cost us ~4.2% on Anton's standard context switch benchmark.
> > 
> > And that does not seem to agree with Paul's smp_mb__after_unlock_lock()
> > usage and would not be sufficient for the same (as of yet unexplained)
> > reason.
> > 
> > Why does it matter which of the LOCK or UNLOCK gets promoted to full
> > barrier on PPC in order to become RCsc?
> 
> You could do either.  However, as I understand it, there is hardware for
> which bc;isync is faster than lwsync.  For such hardware, it is cheaper
> to upgrade the unlock from lwsync to sync than to upgrade the lock from
> bc;isync to sync.  If I recall correctly, the kernel rewrites itself at
> boot to select whichever of lwsync or bc;isync is better for the hardware
> at hand.

Fair enough. I'll go wake up and think about the other issue ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 9, 2015, 8:31 a.m. UTC | #11
On Thu, Oct 08, 2015 at 02:44:39PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote:
> > > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote:
> > 
> > > > Currently, we do need smp_mb__after_unlock_lock() to be after the
> > > > acquisition on PPC -- putting it between the unlock and the lock
> > > > of course doesn't cut it for the cross-thread unlock/lock case.
> > 
> > This ^, that makes me think I don't understand
> > smp_mb__after_unlock_lock.
> > 
> > How is:
> > 
> > 	UNLOCK x
> > 	smp_mb__after_unlock_lock()
> > 	LOCK y
> > 
> > a problem? That's still a full barrier.
> 
> The problem is that I need smp_mb__after_unlock_lock() to give me
> transitivity even if the UNLOCK happened on one CPU and the LOCK
> on another.  For that to work, the smp_mb__after_unlock_lock() needs
> to be either immediately after the acquire (the current choice) or
> immediately before the release (which would also work from a purely
> technical viewpoint, but I much prefer the current choice).
> 
> Or am I missing your point?

So lots of little confusions added up to complete fail :-{

Mostly I think it was the UNLOCK x + LOCK x are fully ordered (where I
forgot: but not against uninvolved CPUs) and RELEASE/ACQUIRE are
transitive (where I forgot: RELEASE/ACQUIRE _chains_ are transitive, but
again not against uninvolved CPUs).

Which leads me to think I would like to suggest alternative rules for
RELEASE/ACQUIRE (to replace those Will suggested; as I think those are
partly responsible for my confusion).

 - RELEASE -> ACQUIRE is fully ordered (but not a full barrier) when
   they operate on the same variable and the ACQUIRE reads from the
   RELEASE. Notable, RELEASE/ACQUIRE are RCpc and lack transitivity.

 - RELEASE -> ACQUIRE can be upgraded to a full barrier (including
   transitivity) using smp_mb__release_acquire(), either before RELEASE
   or after ACQUIRE (but consistently [*]).

 - RELEASE -> ACQUIRE _chains_ (on shared variables) preserve causality,
   (because each link is fully ordered) but are not transitive.

And I think that in the past few weeks we've been using transitive
ambiguously, the definition we have in Documentation/memory-barriers.txt
is a _strong_ transitivity, where we can make guarantees about CPUs not
directly involved.

What we have here (due to RCpc) is a weak form of transitivity, which,
while it preserves the natural concept of causality, does not extend to
other CPUs.

So we could go around and call them 'strong' and 'weak' transitivity,
but I suspect its easier for everyone involved if we come up with
separate terms (less room for error if we accidentally omit the
'strong/weak' qualifier).


[*] Do we want to take that choice away and go for:
smp_mb__after_release_acquire() ?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Will Deacon Oct. 9, 2015, 9:40 a.m. UTC | #12
On Fri, Oct 09, 2015 at 10:31:38AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 08, 2015 at 02:44:39PM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote:
> > > On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote:
> > > > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote:
> > > 
> > > > > Currently, we do need smp_mb__after_unlock_lock() to be after the
> > > > > acquisition on PPC -- putting it between the unlock and the lock
> > > > > of course doesn't cut it for the cross-thread unlock/lock case.
> > > 
> > > This ^, that makes me think I don't understand
> > > smp_mb__after_unlock_lock.
> > > 
> > > How is:
> > > 
> > > 	UNLOCK x
> > > 	smp_mb__after_unlock_lock()
> > > 	LOCK y
> > > 
> > > a problem? That's still a full barrier.
> > 
> > The problem is that I need smp_mb__after_unlock_lock() to give me
> > transitivity even if the UNLOCK happened on one CPU and the LOCK
> > on another.  For that to work, the smp_mb__after_unlock_lock() needs
> > to be either immediately after the acquire (the current choice) or
> > immediately before the release (which would also work from a purely
> > technical viewpoint, but I much prefer the current choice).
> > 
> > Or am I missing your point?
> 
> So lots of little confusions added up to complete fail :-{
> 
> Mostly I think it was the UNLOCK x + LOCK x are fully ordered (where I
> forgot: but not against uninvolved CPUs) and RELEASE/ACQUIRE are
> transitive (where I forgot: RELEASE/ACQUIRE _chains_ are transitive, but
> again not against uninvolved CPUs).
> 
> Which leads me to think I would like to suggest alternative rules for
> RELEASE/ACQUIRE (to replace those Will suggested; as I think those are
> partly responsible for my confusion).

Yeah, sorry. I originally used the phrase "fully ordered" but changed it
to "full barrier", which has stronger transitivity (newly understood
definition) requirements that I didn't intend.

RELEASE -> ACQUIRE should be used for message passing between two CPUs
and not have ordering effects on other observers unless they're part of
the RELEASE -> ACQUIRE chain.

>  - RELEASE -> ACQUIRE is fully ordered (but not a full barrier) when
>    they operate on the same variable and the ACQUIRE reads from the
>    RELEASE. Notable, RELEASE/ACQUIRE are RCpc and lack transitivity.

Are we explicit about the difference between "fully ordered" and "full
barrier" somewhere else, because this looks like it will confuse people.

>  - RELEASE -> ACQUIRE can be upgraded to a full barrier (including
>    transitivity) using smp_mb__release_acquire(), either before RELEASE
>    or after ACQUIRE (but consistently [*]).

Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This
is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently
using (for PPC only).

Stepping back a second, I believe that there are three cases:


 RELEASE X -> ACQUIRE Y (same CPU)
   * Needs a barrier on TSO architectures for full ordering

 UNLOCK X -> LOCK Y (same CPU)
   * Needs a barrier on PPC for full ordering

 RELEASE X -> ACQUIRE X (different CPUs)
 UNLOCK X -> ACQUIRE X (different CPUs)
   * Fully ordered everywhere...
   * ... but needs a barrier on PPC to become a full barrier


so maybe it makes more sense to split out the local and inter-cpu ordering
with something like:

  smp_mb__after_release_acquire()
  smp_mb__after_release_acquire_local()

then the first one directly replaces smp_mb__after_unlock_lock, and is
only defined for PPC, whereas the second one is also defined for TSO archs.

>  - RELEASE -> ACQUIRE _chains_ (on shared variables) preserve causality,
>    (because each link is fully ordered) but are not transitive.

Yup, and that's the same for UNLOCK -> LOCK, too.

> And I think that in the past few weeks we've been using transitive
> ambiguously, the definition we have in Documentation/memory-barriers.txt
> is a _strong_ transitivity, where we can make guarantees about CPUs not
> directly involved.
> 
> What we have here (due to RCpc) is a weak form of transitivity, which,
> while it preserves the natural concept of causality, does not extend to
> other CPUs.
> 
> So we could go around and call them 'strong' and 'weak' transitivity,
> but I suspect its easier for everyone involved if we come up with
> separate terms (less room for error if we accidentally omit the
> 'strong/weak' qualifier).

Surely the general case is message passing and so "transitivity" should
just refer to chains of RELEASE -> ACQUIRE? Then "strong transitivity"
could refer to the far more complicated (imo) case that is synonymous
with "full barrier".

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Will Deacon Oct. 9, 2015, 9:51 a.m. UTC | #13
Hi Paul,

On Thu, Oct 08, 2015 at 03:17:16PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 08, 2015 at 01:59:38PM +0100, Will Deacon wrote:
> > I thought Paul was talking about something like this case:
> > 
> > CPU A     CPU B              CPU C
> > foo = 1
> > UNLOCK x
> >           LOCK x
> >           (RELEASE) bar = 1
> >                              ACQUIRE bar = 1
> >                              READ_ONCE foo = 0
> 
> More like this:
> 
> CPU A			CPU B			CPU C
> WRITE_ONCE(foo, 1);
> UNLOCK x
> 			LOCK x
> 			r1 = READ_ONCE(bar);
> 						WRITE_ONCE(bar, 1);
> 						smp_mb();
> 						r2 = READ_ONCE(foo);
> 
> This can result in r1==0 && r2==0.

Thank you, that is extremely enlightening :)

> > I think we need a PPC litmus test illustrating the inter-thread, same
> > lock failure case when smp_mb__after_unlock_lock is not present so that
> > we can reason about this properly. Paul?
> 
> Please see above.  ;-)
> 
> The corresponding litmus tests are below.

How do people feel about including these in memory-barriers.txt? I find
them considerably easier to read than our current kernel code + list of
possible orderings + wall of text, but there's a good chance that my
brain has been corrupted from staring at this stuff for too long.

The only snag is the ppc assembly code, but it's not *too* horrific ;)

> PPC lock-2thread-WR-barrier.litmus
> ""
> (*
>  * Does 3.0 Linux-kernel Power lock-unlock provide local 
>  * barrier that orders prior stores against subsequent loads,
>  * if the unlock and lock happen on different threads?
>  * This version uses lwsync instead of isync.
>  *)
> (* 23-July-2013: ppcmem says "Sometimes" *)
> {
> l=1;
> 0:r1=1;          0:r4=x;         0:r10=0;          0:r12=l;
> 1:r1=1; 1:r3=42; 1:r4=x; 1:r5=y; 1:r10=0; 1:r11=0; 1:r12=l;
> 2:r1=1;          2:r4=x; 2:r5=y;
> }
>  P0             | P1                 | P2;
>  stw r1,0(r4)   | lwarx r11,r10,r12  | stw r1,0(r5) ;
>  lwsync         | cmpwi r11,0        | lwsync       ;
>  stw r10,0(r12) | bne Fail1          | lwz r7,0(r4) ;
>                 | stwcx. r1,r10,r12  | ;
>                 | bne Fail1          | ;
>                 | isync              | ;
>                 | lwz r3,0(r5)       | ;
>                 | Fail1:             | ;
> 
> 
> exists
> (1:r3=0 /\ 2:r7=0)

We could also include a link to the ppcmem/herd web frontends and your
lwn.net article. (ppcmem is already linked, but it's not obvious that
you can run litmus tests in your browser).

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 9, 2015, 11:02 a.m. UTC | #14
On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote:
> Stepping back a second, I believe that there are three cases:
> 
> 
>  RELEASE X -> ACQUIRE Y (same CPU)
>    * Needs a barrier on TSO architectures for full ordering
	+PPC

>  UNLOCK X -> LOCK Y (same CPU)
>    * Needs a barrier on PPC for full ordering


>  RELEASE X -> ACQUIRE X (different CPUs)
    * Fully ordered everywhere...
    * ... but needs a barrier on TSO + PPC to become a full barrier

>  UNLOCK X -> ACQUIRE X (different CPUs)

s/ACQUIRE/LOCK/ ?

>    * Fully ordered everywhere...
>    * ... but needs a barrier on PPC to become a full barrier

If you really meant ACQUIRE, then x86 also needs a barrier in order to
upgrade, seeing how our unlock is equivalent to smp_store_release(). Our
LOCK otoh is far heavier than smp_load_acquire() and would result in
different rules.

And I'm not sure the "(different CPUs)" bit makes sense, as the same is
true if they're on the same CPU.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 9, 2015, 11:12 a.m. UTC | #15
On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote:
> > Which leads me to think I would like to suggest alternative rules for
> > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are
> > partly responsible for my confusion).
> 
> Yeah, sorry. I originally used the phrase "fully ordered" but changed it
> to "full barrier", which has stronger transitivity (newly understood
> definition) requirements that I didn't intend.

> Are we explicit about the difference between "fully ordered" and "full
> barrier" somewhere else, because this looks like it will confuse people.

I suspect we don't.

> >  - RELEASE -> ACQUIRE can be upgraded to a full barrier (including
> >    transitivity) using smp_mb__release_acquire(), either before RELEASE
> >    or after ACQUIRE (but consistently [*]).
> 
> Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This
> is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently
> using (for PPC only).

No, we do need that. RELEASE/ACQUIRE is RCpc for TSO as well as PPC.

UNLOCK/LOCK is only RCpc for PPC, the rest of the world has RCsc for
UNLOCK/LOCK.

The reason RELEASE/ACQUIRE differ from UNLOCK/LOCK is the fundamental
difference between ACQUIRE and LOCK.

Where ACQUIRE really is just a LOAD, LOCK ends up fundamentally being a
RmW and a control dependency.


Now, if you want to upgrade your RCpc RELEASE/ACQUIRE to RCsc, you need
to do that on the inside (either after ACQUIRE or before RELEASE), this
is crucial (as per Paul's argument) for the case where the RELEASE and
ACQUIRE happen on different CPUs.

IFF RELEASE and ACQUIRE happen on the _same_ CPU, then it doesn't
matter and you can place the barrier in any of the 3 possible locations
(before RELEASE, between RELEASE and ACQUIRE, after ACQUIRE).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 9, 2015, 11:13 a.m. UTC | #16
On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote:
> 
> >  - RELEASE -> ACQUIRE _chains_ (on shared variables) preserve causality,
> >    (because each link is fully ordered) but are not transitive.
> 
> Yup, and that's the same for UNLOCK -> LOCK, too.

Agreed, except RELEASE/ACQUIRE is more RCpc than UNLOCK/LOCK.

IFF we can get UNLOCK/LOCK as RCsc the chains are strongly transitive,
unlike the RELEASE/ACQUIRE chains, which will be weakly so.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 9, 2015, 11:25 a.m. UTC | #17
On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote:
> > The corresponding litmus tests are below.
> 
> How do people feel about including these in memory-barriers.txt? I find
> them considerably easier to read than our current kernel code + list of
> possible orderings + wall of text, but there's a good chance that my
> brain has been corrupted from staring at this stuff for too long.

Your brain is corrupt (but then, so probably is mine, just differently
so).

I've not yet mastered the knack of reading those things; then again, I
suspect its not too hard, just not something I've had the time to play
with.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Will Deacon Oct. 9, 2015, 12:41 p.m. UTC | #18
On Fri, Oct 09, 2015 at 01:02:46PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote:
> > Stepping back a second, I believe that there are three cases:
> > 
> > 
> >  RELEASE X -> ACQUIRE Y (same CPU)
> >    * Needs a barrier on TSO architectures for full ordering
> 	+PPC
> 
> >  UNLOCK X -> LOCK Y (same CPU)
> >    * Needs a barrier on PPC for full ordering
> 
> 
> >  RELEASE X -> ACQUIRE X (different CPUs)
>     * Fully ordered everywhere...
>     * ... but needs a barrier on TSO + PPC to become a full barrier
> 
> >  UNLOCK X -> ACQUIRE X (different CPUs)
> 
> s/ACQUIRE/LOCK/ ?

Yes, sorry.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Will Deacon Oct. 9, 2015, 12:51 p.m. UTC | #19
On Fri, Oct 09, 2015 at 01:12:02PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote:
> > > Which leads me to think I would like to suggest alternative rules for
> > > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are
> > > partly responsible for my confusion).
> > 
> > Yeah, sorry. I originally used the phrase "fully ordered" but changed it
> > to "full barrier", which has stronger transitivity (newly understood
> > definition) requirements that I didn't intend.
> 
> > Are we explicit about the difference between "fully ordered" and "full
> > barrier" somewhere else, because this looks like it will confuse people.
> 
> I suspect we don't.
> 
> > >  - RELEASE -> ACQUIRE can be upgraded to a full barrier (including
> > >    transitivity) using smp_mb__release_acquire(), either before RELEASE
> > >    or after ACQUIRE (but consistently [*]).
> > 
> > Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This
> > is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently
> > using (for PPC only).
> 
> No, we do need that. RELEASE/ACQUIRE is RCpc for TSO as well as PPC.
> 
> UNLOCK/LOCK is only RCpc for PPC, the rest of the world has RCsc for
> UNLOCK/LOCK.
> 
> The reason RELEASE/ACQUIRE differ from UNLOCK/LOCK is the fundamental
> difference between ACQUIRE and LOCK.

But they don't actually differ in the kernel memory model we have right
now, thanks to PPC (we can't be stronger than the weakest implementation).
That's the whole reason we've got this unlock_lock mess!

> Where ACQUIRE really is just a LOAD, LOCK ends up fundamentally being a
> RmW and a control dependency.

Have you checked that this is true for the recent RELEASE/ACQUIRE
conversions in things like the qrwlock? In particular, we should annotate
those control dependencies to make them glaringly obvious if we want to
rely on sequentially-consistent locks (and also Alpha may need that).

> Now, if you want to upgrade your RCpc RELEASE/ACQUIRE to RCsc, you need
> to do that on the inside (either after ACQUIRE or before RELEASE), this
> is crucial (as per Paul's argument) for the case where the RELEASE and
> ACQUIRE happen on different CPUs.
> 
> IFF RELEASE and ACQUIRE happen on the _same_ CPU, then it doesn't
> matter and you can place the barrier in any of the 3 possible locations
> (before RELEASE, between RELEASE and ACQUIRE, after ACQUIRE).

Right, but these two need to be different barriers so that we don't
penalise TSO when UNLOCK -> LOCK ordering is required. That's why I was
proposing the local variant of smp_mb__after_release_acquire().

I think we're in agreement about the barriers we need, we just need to
name them (and then I'll cook a patch and we can GOTO 10).

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 9, 2015, 1:06 p.m. UTC | #20
On Fri, Oct 09, 2015 at 01:51:11PM +0100, Will Deacon wrote:
> On Fri, Oct 09, 2015 at 01:12:02PM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote:
> > > > Which leads me to think I would like to suggest alternative rules for
> > > > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are
> > > > partly responsible for my confusion).
> > > 
> > > Yeah, sorry. I originally used the phrase "fully ordered" but changed it
> > > to "full barrier", which has stronger transitivity (newly understood
> > > definition) requirements that I didn't intend.
> > 
> > > Are we explicit about the difference between "fully ordered" and "full
> > > barrier" somewhere else, because this looks like it will confuse people.
> > 
> > I suspect we don't.
> > 
> > > >  - RELEASE -> ACQUIRE can be upgraded to a full barrier (including
> > > >    transitivity) using smp_mb__release_acquire(), either before RELEASE
> > > >    or after ACQUIRE (but consistently [*]).
> > > 
> > > Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This
> > > is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently
> > > using (for PPC only).
> > 
> > No, we do need that. RELEASE/ACQUIRE is RCpc for TSO as well as PPC.
> > 
> > UNLOCK/LOCK is only RCpc for PPC, the rest of the world has RCsc for
> > UNLOCK/LOCK.
> > 
> > The reason RELEASE/ACQUIRE differ from UNLOCK/LOCK is the fundamental
> > difference between ACQUIRE and LOCK.
> 
> But they don't actually differ in the kernel memory model we have right
> now, thanks to PPC (we can't be stronger than the weakest implementation).
> That's the whole reason we've got this unlock_lock mess!

Correct, which is why I've suggested to separate UNLOCK/LOCK from
RELEASE/ACQUIRE (again).

Even if only PPC is RCpc for locks, this means we need to have different
upgrade barriers (or suffer superfluous full barriers on TSO archs,
which I think we all want to avoid).

> > Where ACQUIRE really is just a LOAD, LOCK ends up fundamentally being a
> > RmW and a control dependency.
> 
> Have you checked that this is true for the recent RELEASE/ACQUIRE
> conversions in things like the qrwlock? In particular, we should annotate
> those control dependencies to make them glaringly obvious if we want to
> rely on sequentially-consistent locks (and also Alpha may need that).

I have not, let me make a note of that.

> > Now, if you want to upgrade your RCpc RELEASE/ACQUIRE to RCsc, you need
> > to do that on the inside (either after ACQUIRE or before RELEASE), this
> > is crucial (as per Paul's argument) for the case where the RELEASE and
> > ACQUIRE happen on different CPUs.
> > 
> > IFF RELEASE and ACQUIRE happen on the _same_ CPU, then it doesn't
> > matter and you can place the barrier in any of the 3 possible locations
> > (before RELEASE, between RELEASE and ACQUIRE, after ACQUIRE).
> 
> Right, but these two need to be different barriers so that we don't
> penalise TSO when UNLOCK -> LOCK ordering is required. That's why I was
> proposing the local variant of smp_mb__after_release_acquire().
> 
> I think we're in agreement about the barriers we need, we just need to
> name them (and then I'll cook a patch and we can GOTO 10).

 smp_mb__after_unlock_lock()
 smp_mb__after_release_acquire()

Would work, unless of course we can convince the PPC people to go RCsc
on their locks -- which per the benchmark result posted is fairly
painful :/

Then again, I do sympathise with them not wanting to find all the bugs
for being the odd duck.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Paul E. McKenney Oct. 9, 2015, 5:21 p.m. UTC | #21
On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote:
> On Fri, Oct 09, 2015 at 10:31:38AM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 08, 2015 at 02:44:39PM -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote:
> > > > On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote:
> > > > > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote:
> > > > 
> > > > > > Currently, we do need smp_mb__after_unlock_lock() to be after the
> > > > > > acquisition on PPC -- putting it between the unlock and the lock
> > > > > > of course doesn't cut it for the cross-thread unlock/lock case.
> > > > 
> > > > This ^, that makes me think I don't understand
> > > > smp_mb__after_unlock_lock.
> > > > 
> > > > How is:
> > > > 
> > > > 	UNLOCK x
> > > > 	smp_mb__after_unlock_lock()
> > > > 	LOCK y
> > > > 
> > > > a problem? That's still a full barrier.
> > > 
> > > The problem is that I need smp_mb__after_unlock_lock() to give me
> > > transitivity even if the UNLOCK happened on one CPU and the LOCK
> > > on another.  For that to work, the smp_mb__after_unlock_lock() needs
> > > to be either immediately after the acquire (the current choice) or
> > > immediately before the release (which would also work from a purely
> > > technical viewpoint, but I much prefer the current choice).
> > > 
> > > Or am I missing your point?
> > 
> > So lots of little confusions added up to complete fail :-{

I know that feeling!

> > Mostly I think it was the UNLOCK x + LOCK x are fully ordered (where I
> > forgot: but not against uninvolved CPUs) and RELEASE/ACQUIRE are
> > transitive (where I forgot: RELEASE/ACQUIRE _chains_ are transitive, but
> > again not against uninvolved CPUs).
> > 
> > Which leads me to think I would like to suggest alternative rules for
> > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are
> > partly responsible for my confusion).
> 
> Yeah, sorry. I originally used the phrase "fully ordered" but changed it
> to "full barrier", which has stronger transitivity (newly understood
> definition) requirements that I didn't intend.
> 
> RELEASE -> ACQUIRE should be used for message passing between two CPUs
> and not have ordering effects on other observers unless they're part of
> the RELEASE -> ACQUIRE chain.
> 
> >  - RELEASE -> ACQUIRE is fully ordered (but not a full barrier) when
> >    they operate on the same variable and the ACQUIRE reads from the
> >    RELEASE. Notable, RELEASE/ACQUIRE are RCpc and lack transitivity.

RELEASE->ACQUIRE has local transistivity/causality.  However, it is more
than just message passing:

	void thread0(void)
	{
		r1 = READ_ONCE(a);
		WRITE_ONCE(b, 1);
		smp_store_release(&c, 1);
	}

	void thread1(void)
	{
		r2 = smp_store_acquire(&c);
		WRITE_ONCE(a, 1);
		r3 = READ_ONCE(b);
		smp_store_release(&d, 1);
	}

	void thread2(void)
	{
		r4 = smp_load_acquire(&d);
		WRITE_ONCE(a, 2);
		r5 = READ_ONCE(b);
	}

After the dust settles, if r2==1&&r4==1, we should also have
r1==0&&r3==1&&r5==1.  And here is the PowerPC ppcmem/herd model:

------------------------------------------------------------------------

PPC LinuxAcqRelLocal
""
(*
 * Does a powerpc lwsync-based Acq-Rel chain provide local causality?
 * October 9, 2015: Yes.
 *)
{
a=0; b=0; c=0; d=0;
0:r0=1; 0:r11=a; 0:r12=b; 0:r13=c; 0:r14=d;
1:r0=1; 1:r11=a; 1:r12=b; 1:r13=c; 1:r14=d;
2:r0=2; 2:r11=a; 2:r12=b; 2:r13=c; 2:r14=d;
}
 P0            | P1            | P2            ;
 lwz r1,0(r11) | lwz r2,0(r13) | lwz r4,0(r14) ;
 stw r0,0(r12) | lwsync        | lwsync        ;
 lwsync        | stw r0,0(r11) | stw r0,0(r11) ;
 stw r0,0(r13) | lwz r3,0(r12) | lwz r5,0(r12) ;
               | lwsync        |               ;
               | stw r0,0(r14) |               ;
exists
(1:r2=1 /\ 2:r4=1 /\ (0:r1=1 \/ 0:r1=2 \/ 1:r3=0 \/ 2:r5=0))

------------------------------------------------------------------------

This confirms the local causality/transitivity.

So let's add an uninvolved observer thread, which will also require adding
another variable to avoid ordering via cache coherence:

	void thread0(void)
	{
		r1 = READ_ONCE(a);
		WRITE_ONCE(b, 1);
		smp_store_release(&c, 1);
	}

	void thread1(void)
	{
		r2 = smp_store_acquire(&c);
		WRITE_ONCE(a, 1);
		r3 = READ_ONCE(b);
		smp_store_release(&d, 1);
	}

	void thread2(void)
	{
		r4 = smp_load_acquire(&d);
		WRITE_ONCE(a, 2);
		r5 = READ_ONCE(b);
		r6 = READ_ONCE(e);
	}

	void thread3(void) /* uninvolved observer */
	{
		WRITE_ONCE(e, 1);
		smp_mb();
		r7 = READ_ONCE(b);
	}

Can we see r2==1&&r4==1&&r6=0&&r7==0?  Here is the PowerPC model:

------------------------------------------------------------------------

PPC LinuxAcqRelGlobal
""
(*
 * Does a powerpc lwsync-based Acq-Rel chain provide global causality?
 * October 9, 2015: No.
 *)
{
a=0; b=0; c=0; d=0; e=0;
0:r0=1; 0:r11=a; 0:r12=b; 0:r13=c; 0:r14=d; 0:r15=e;
1:r0=1; 1:r11=a; 1:r12=b; 1:r13=c; 1:r14=d; 1:r15=e;
2:r0=2; 2:r11=a; 2:r12=b; 2:r13=c; 2:r14=d; 2:r15=e;
3:r0=1; 3:r11=a; 3:r12=b; 3:r13=c; 3:r14=d; 3:r15=e;
}
 P0            | P1            | P2            | P3            ;
 lwz r1,0(r11) | lwz r2,0(r13) | lwz r4,0(r14) | stw r0,0(r15) ;
 stw r0,0(r12) | lwsync        | lwsync        | sync          ;
 lwsync        | stw r0,0(r11) | stw r0,0(r11) | lwz r7,0(r12) ;
 stw r0,0(r13) | lwz r3,0(r12) | lwz r5,0(r12) |               ;
               | lwsync        | lwz r6,0(r15) |               ;
               | stw r0,0(r14) |               |               ;
exists
(1:r2=1 /\ 2:r4=1 /\ 2:r6=0 /\ 3:r7=0)

------------------------------------------------------------------------

And, as expected, P3 can see misordering because it is not involved
in the acquire-release chain.

> Are we explicit about the difference between "fully ordered" and "full
> barrier" somewhere else, because this looks like it will confuse people.

Probably not.  We need something meaning "local transitivity" on the
one hand and "global transitivity" on the other.  I suppose we could use
"causality" for the local case and "transitivity" for the global case,
but I do not believe that normal formal/mathematical usage matches this.

The term "single-copy atomic" could reasonably be used to mean "local
transitivity" and "multi-copy atomic" to mean "global transitivity",
but those are all too easy to confuse with each other, require a lot of
knowledge to understand, and don't exactly roll off the tongue.

The phrases "locally visible causality" and "globally visible causality"
should be (relatively) easy to understand, but don't exactly roll off
the tongue, either.

Maybe LVC and GVC?

LVC:  Locally visible causality:  If you are in the causal chain,
	you are guaranteed to see the ordering.  Otherwise, you
	might not.

GVC: Globally visible causality:  You are guarantee to see the
	ordering whether or not you are in the causal chain.

I could of course poll the memory-ordering academics and get more options.  ;-)

> >  - RELEASE -> ACQUIRE can be upgraded to a full barrier (including
> >    transitivity) using smp_mb__release_acquire(), either before RELEASE
> >    or after ACQUIRE (but consistently [*]).
> 
> Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This
> is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently
> using (for PPC only).

If by "don't actually need this" you mean "don't currently know of a
use case", I agree.  For now, anyway.  ;-)

> Stepping back a second, I believe that there are three cases:
> 
> 
>  RELEASE X -> ACQUIRE Y (same CPU)
>    * Needs a barrier on TSO architectures for full ordering

Agreed, assuming these restatements match your views:

Compiler barrier() for TSO for ordering to be visible within the causal chain.

Memory-barrier instruction for TSO for ordering to be visible globally.

>  UNLOCK X -> LOCK Y (same CPU)
>    * Needs a barrier on PPC for full ordering

Where "full ordering" is what I have been calling "globally visible
ordering" (or transitivity or causality or whatever), agreed.

>  RELEASE X -> ACQUIRE X (different CPUs)
>  UNLOCK X -> ACQUIRE X (different CPUs)
>    * Fully ordered everywhere...
>    * ... but needs a barrier on PPC to become a full barrier

Agreed, assuming these restatements match your views:

With compiler barrier (for TSO) or lwsync-capable memory-barrier instruction
(for weakly ordered systems), ordering is visible within the causal chain.

With compiler barrier (for TSO) or full memory-barrier instruction
(for weakly ordered systems), ordering is visible globally.

> so maybe it makes more sense to split out the local and inter-cpu ordering
> with something like:
> 
>   smp_mb__after_release_acquire()
>   smp_mb__after_release_acquire_local()
> 
> then the first one directly replaces smp_mb__after_unlock_lock, and is
> only defined for PPC, whereas the second one is also defined for TSO archs.

I don't see the need for the _local() variant.  We already have
smp_store_release() and smp_load_acquire(), both of which imply any
barriers needed for the case of locally visible causality.

> >  - RELEASE -> ACQUIRE _chains_ (on shared variables) preserve causality,
> >    (because each link is fully ordered) but are not transitive.
> 
> Yup, and that's the same for UNLOCK -> LOCK, too.

Any transitive ordering is visible only locally, within the chain.
For global visibility, you need smp_mb__after_release_acquire().

> > And I think that in the past few weeks we've been using transitive
> > ambiguously, the definition we have in Documentation/memory-barriers.txt
> > is a _strong_ transitivity, where we can make guarantees about CPUs not
> > directly involved.
> > 
> > What we have here (due to RCpc) is a weak form of transitivity, which,
> > while it preserves the natural concept of causality, does not extend to
> > other CPUs.
> > 
> > So we could go around and call them 'strong' and 'weak' transitivity,
> > but I suspect its easier for everyone involved if we come up with
> > separate terms (less room for error if we accidentally omit the
> > 'strong/weak' qualifier).
> 
> Surely the general case is message passing and so "transitivity" should
> just refer to chains of RELEASE -> ACQUIRE? Then "strong transitivity"
> could refer to the far more complicated (imo) case that is synonymous
> with "full barrier".

We should have fun agreeing on the best words to describe all this, but
agreed.

One nit:  RELEASE->ACQUIRE chains can do more than message passing.
Anything before the RELEASE, be it a load or a store, is visible to
anything after any ACQUIRE further down the chain.  Here a prior load is
"visible" to a subsequent store in the sense that the store cannot affect
the value loaded.  This is consistent with the C11/C++11 memory model.

Seem reasonable?

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Paul E. McKenney Oct. 9, 2015, 5:43 p.m. UTC | #22
On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote:
> Hi Paul,
> 
> On Thu, Oct 08, 2015 at 03:17:16PM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 08, 2015 at 01:59:38PM +0100, Will Deacon wrote:
> > > I thought Paul was talking about something like this case:
> > > 
> > > CPU A     CPU B              CPU C
> > > foo = 1
> > > UNLOCK x
> > >           LOCK x
> > >           (RELEASE) bar = 1
> > >                              ACQUIRE bar = 1
> > >                              READ_ONCE foo = 0
> > 
> > More like this:
> > 
> > CPU A			CPU B			CPU C
> > WRITE_ONCE(foo, 1);
> > UNLOCK x
> > 			LOCK x
> > 			r1 = READ_ONCE(bar);
> > 						WRITE_ONCE(bar, 1);
> > 						smp_mb();
> > 						r2 = READ_ONCE(foo);
> > 
> > This can result in r1==0 && r2==0.
> 
> Thank you, that is extremely enlightening :)
> 
> > > I think we need a PPC litmus test illustrating the inter-thread, same
> > > lock failure case when smp_mb__after_unlock_lock is not present so that
> > > we can reason about this properly. Paul?
> > 
> > Please see above.  ;-)
> > 
> > The corresponding litmus tests are below.
> 
> How do people feel about including these in memory-barriers.txt? I find
> them considerably easier to read than our current kernel code + list of
> possible orderings + wall of text, but there's a good chance that my
> brain has been corrupted from staring at this stuff for too long.
> 
> The only snag is the ppc assembly code, but it's not *too* horrific ;)

Maybe we should include them as separate files in Documentation/litmus
or some such.  We could then use a litmus-test style with Linux-kernel
C code, and reference litmus tests for various architectures.  Maybe
Documentation/litmus/{arm,arm64,powerpc,x86}/ and so on.

For example, the first example in memory-barriers.txt is this:

------------------------------------------------------------------------
	CPU 1		CPU 2
	===============	===============
	{ A == 1; B == 2 }
	A = 3;		x = B;
	B = 4;		y = A;

The set of accesses as seen by the memory system in the middle can be arranged
in 24 different combinations:

	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
	STORE B=4, ...
	...

and can thus result in four different combinations of values:

	x == 2, y == 1
	x == 2, y == 3
	x == 4, y == 1
	x == 4, y == 3
------------------------------------------------------------------------

Maybe this changes to:

------------------------------------------------------------------------
Linux MP
""
{
a=1; b=2;
}
 P0               | P1                ;
 WRITE_ONCE(a, 3) | r1 = READ_ONCE(b) ;
 WRITE_ONCE(b, 4) | r2 = READ_ONCE(a) ;

exists (2:r1=4 /\ 2:r2=3)
------------------------------------------------------------------------

We can then state that this assertion can fail.  We could include
either ppcmem or herd output along with the litmus tests, which would
allow the curious to see a full list of the possible outcomes.

> > PPC lock-2thread-WR-barrier.litmus
> > ""
> > (*
> >  * Does 3.0 Linux-kernel Power lock-unlock provide local 
> >  * barrier that orders prior stores against subsequent loads,
> >  * if the unlock and lock happen on different threads?
> >  * This version uses lwsync instead of isync.
> >  *)
> > (* 23-July-2013: ppcmem says "Sometimes" *)
> > {
> > l=1;
> > 0:r1=1;          0:r4=x;         0:r10=0;          0:r12=l;
> > 1:r1=1; 1:r3=42; 1:r4=x; 1:r5=y; 1:r10=0; 1:r11=0; 1:r12=l;
> > 2:r1=1;          2:r4=x; 2:r5=y;
> > }
> >  P0             | P1                 | P2;
> >  stw r1,0(r4)   | lwarx r11,r10,r12  | stw r1,0(r5) ;
> >  lwsync         | cmpwi r11,0        | lwsync       ;
> >  stw r10,0(r12) | bne Fail1          | lwz r7,0(r4) ;
> >                 | stwcx. r1,r10,r12  | ;
> >                 | bne Fail1          | ;
> >                 | isync              | ;
> >                 | lwz r3,0(r5)       | ;
> >                 | Fail1:             | ;
> > 
> > 
> > exists
> > (1:r3=0 /\ 2:r7=0)
> 
> We could also include a link to the ppcmem/herd web frontends and your
> lwn.net article. (ppcmem is already linked, but it's not obvious that
> you can run litmus tests in your browser).

I bet that the URLs for the web frontends are not stable long term.
Don't get me wrong, PPCMEM/ARMMEM has been there for me for a goodly
number of years, but professors do occasionally move from one institution
to another.  For but one example, Susmit Sarkar is now at University
of St. Andrews rather than at Cambridge.

So to make this work, we probably need to be thinking in terms of
asking the researchers for permission to include their ocaml code in the
Linux-kernel source tree.  I would be strongly in favor of this, actually.

Thoughts?

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Paul E. McKenney Oct. 9, 2015, 5:44 p.m. UTC | #23
On Fri, Oct 09, 2015 at 01:25:54PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote:
> > > The corresponding litmus tests are below.
> > 
> > How do people feel about including these in memory-barriers.txt? I find
> > them considerably easier to read than our current kernel code + list of
> > possible orderings + wall of text, but there's a good chance that my
> > brain has been corrupted from staring at this stuff for too long.
> 
> Your brain is corrupt (but then, so probably is mine, just differently
> so).
> 
> I've not yet mastered the knack of reading those things; then again, I
> suspect its not too hard, just not something I've had the time to play
> with.

It does take a bit of practice, but is worth the effort.  Of course,
I -would- say that.  ;-)

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Will Deacon Oct. 9, 2015, 6:33 p.m. UTC | #24
On Fri, Oct 09, 2015 at 10:43:27AM -0700, Paul E. McKenney wrote:
> On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote:
> > How do people feel about including these in memory-barriers.txt? I find
> > them considerably easier to read than our current kernel code + list of
> > possible orderings + wall of text, but there's a good chance that my
> > brain has been corrupted from staring at this stuff for too long.
> > 
> > The only snag is the ppc assembly code, but it's not *too* horrific ;)
> 
> Maybe we should include them as separate files in Documentation/litmus
> or some such.  We could then use a litmus-test style with Linux-kernel
> C code, and reference litmus tests for various architectures.  Maybe
> Documentation/litmus/{arm,arm64,powerpc,x86}/ and so on.

I think that would be useful, particularly if we had a way to "compile"
a kernel litmus test into one for a particular architecture.

> For example, the first example in memory-barriers.txt is this:
> 
> ------------------------------------------------------------------------
> 	CPU 1		CPU 2
> 	===============	===============
> 	{ A == 1; B == 2 }
> 	A = 3;		x = B;
> 	B = 4;		y = A;
> 
> The set of accesses as seen by the memory system in the middle can be arranged
> in 24 different combinations:
> 
> 	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
> 	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
> 	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
> 	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
> 	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
> 	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
> 	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
> 	STORE B=4, ...
> 	...
> 
> and can thus result in four different combinations of values:
> 
> 	x == 2, y == 1
> 	x == 2, y == 3
> 	x == 4, y == 1
> 	x == 4, y == 3
> ------------------------------------------------------------------------
> 
> Maybe this changes to:
> 
> ------------------------------------------------------------------------
> Linux MP
> ""
> {
> a=1; b=2;
> }
>  P0               | P1                ;
>  WRITE_ONCE(a, 3) | r1 = READ_ONCE(b) ;
>  WRITE_ONCE(b, 4) | r2 = READ_ONCE(a) ;
> 
> exists (2:r1=4 /\ 2:r2=3)
> ------------------------------------------------------------------------
> 
> We can then state that this assertion can fail.  We could include
> either ppcmem or herd output along with the litmus tests, which would
> allow the curious to see a full list of the possible outcomes.

More importantly, it would allow them to make small changes to the test
and see what the outcome is, without us having to spawn another
thread-of-death on LKML :)

> > We could also include a link to the ppcmem/herd web frontends and your
> > lwn.net article. (ppcmem is already linked, but it's not obvious that
> > you can run litmus tests in your browser).
> 
> I bet that the URLs for the web frontends are not stable long term.
> Don't get me wrong, PPCMEM/ARMMEM has been there for me for a goodly
> number of years, but professors do occasionally move from one institution
> to another.  For but one example, Susmit Sarkar is now at University
> of St. Andrews rather than at Cambridge.
> 
> So to make this work, we probably need to be thinking in terms of
> asking the researchers for permission to include their ocaml code in the
> Linux-kernel source tree.  I would be strongly in favor of this, actually.
> 
> Thoughts?

I'm extremely hesitant to import a bunch of dubiously licensed, academic
ocaml code into the kernel. Even if we did, who would maintain it?

A better solution might be to host a mirror of the code on kernel.org,
along with a web front-end for people to play with (the tests we're talking
about here do seem to run ok in my browser).

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Paul E. McKenney Oct. 12, 2015, 11:30 p.m. UTC | #25
On Fri, Oct 09, 2015 at 07:33:28PM +0100, Will Deacon wrote:
> On Fri, Oct 09, 2015 at 10:43:27AM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote:
> > > How do people feel about including these in memory-barriers.txt? I find
> > > them considerably easier to read than our current kernel code + list of
> > > possible orderings + wall of text, but there's a good chance that my
> > > brain has been corrupted from staring at this stuff for too long.
> > > 
> > > The only snag is the ppc assembly code, but it's not *too* horrific ;)
> > 
> > Maybe we should include them as separate files in Documentation/litmus
> > or some such.  We could then use a litmus-test style with Linux-kernel
> > C code, and reference litmus tests for various architectures.  Maybe
> > Documentation/litmus/{arm,arm64,powerpc,x86}/ and so on.
> 
> I think that would be useful, particularly if we had a way to "compile"
> a kernel litmus test into one for a particular architecture.

Very useful!  Probably some limitations, especially for loops and lock
acquisition, but what else is new?

> > For example, the first example in memory-barriers.txt is this:
> > 
> > ------------------------------------------------------------------------
> > 	CPU 1		CPU 2
> > 	===============	===============
> > 	{ A == 1; B == 2 }
> > 	A = 3;		x = B;
> > 	B = 4;		y = A;
> > 
> > The set of accesses as seen by the memory system in the middle can be arranged
> > in 24 different combinations:
> > 
> > 	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
> > 	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
> > 	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
> > 	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
> > 	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
> > 	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
> > 	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
> > 	STORE B=4, ...
> > 	...
> > 
> > and can thus result in four different combinations of values:
> > 
> > 	x == 2, y == 1
> > 	x == 2, y == 3
> > 	x == 4, y == 1
> > 	x == 4, y == 3
> > ------------------------------------------------------------------------
> > > > Maybe this changes to:
> > 
> > ------------------------------------------------------------------------
> > Linux MP
> > ""
> > {
> > a=1; b=2;
> > }
> >  P0               | P1                ;
> >  WRITE_ONCE(a, 3) | r1 = READ_ONCE(b) ;
> >  WRITE_ONCE(b, 4) | r2 = READ_ONCE(a) ;
> > 
> > exists (2:r1=4 /\ 2:r2=3)
> > ------------------------------------------------------------------------
> > 
> > We can then state that this assertion can fail.  We could include
> > either ppcmem or herd output along with the litmus tests, which would
> > allow the curious to see a full list of the possible outcomes.
> 
> More importantly, it would allow them to make small changes to the test
> and see what the outcome is, without us having to spawn another
> thread-of-death on LKML :)

Or perhaps we could at least hope for thread-of-tool-supported-death
on LKML.  ;-)

> > > We could also include a link to the ppcmem/herd web frontends and your
> > > lwn.net article. (ppcmem is already linked, but it's not obvious that
> > > you can run litmus tests in your browser).
> > 
> > I bet that the URLs for the web frontends are not stable long term.
> > Don't get me wrong, PPCMEM/ARMMEM has been there for me for a goodly
> > number of years, but professors do occasionally move from one institution
> > to another.  For but one example, Susmit Sarkar is now at University
> > of St. Andrews rather than at Cambridge.
> > 
> > So to make this work, we probably need to be thinking in terms of
> > asking the researchers for permission to include their ocaml code in the
> > Linux-kernel source tree.  I would be strongly in favor of this, actually.
> > 
> > Thoughts?
> 
> I'm extremely hesitant to import a bunch of dubiously licensed, academic
> ocaml code into the kernel. Even if we did, who would maintain it?
> 
> A better solution might be to host a mirror of the code on kernel.org,
> along with a web front-end for people to play with (the tests we're talking
> about here do seem to run ok in my browser).

I am not too worried about how this happens, but we should avoid
constraining the work of our academic partners.  The reason I was thinking
in terms of in the kernel was to avoid version-synchronization issues.
"Wait, this is Linux kernel v4.17, which means that you need to use
version 8.3.5.1 of the tooling...  And with these four patches as well."

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Boqun Feng Oct. 19, 2015, 1:17 a.m. UTC | #26
On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote:
> On Fri, Oct 09, 2015 at 10:31:38AM +0200, Peter Zijlstra wrote:
[snip]
> > 
> > So lots of little confusions added up to complete fail :-{
> > 
> > Mostly I think it was the UNLOCK x + LOCK x are fully ordered (where I
> > forgot: but not against uninvolved CPUs) and RELEASE/ACQUIRE are
> > transitive (where I forgot: RELEASE/ACQUIRE _chains_ are transitive, but
> > again not against uninvolved CPUs).
> > 
> > Which leads me to think I would like to suggest alternative rules for
> > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are
> > partly responsible for my confusion).
> 
> Yeah, sorry. I originally used the phrase "fully ordered" but changed it
> to "full barrier", which has stronger transitivity (newly understood
> definition) requirements that I didn't intend.
> 
> RELEASE -> ACQUIRE should be used for message passing between two CPUs
> and not have ordering effects on other observers unless they're part of
> the RELEASE -> ACQUIRE chain.
> 
> >  - RELEASE -> ACQUIRE is fully ordered (but not a full barrier) when
> >    they operate on the same variable and the ACQUIRE reads from the
> >    RELEASE. Notable, RELEASE/ACQUIRE are RCpc and lack transitivity.
> 
> Are we explicit about the difference between "fully ordered" and "full
> barrier" somewhere else, because this looks like it will confuse people.
> 

This is confusing me right now. ;-)

Let's use a simple example for only one primitive, as I understand it,
if we say a primitive A is "fully ordered", we actually mean:

1.	The memory operations preceding(in program order) A can't be
	reordered after the memory operations following(in PO) A.

and

2.	The memory operation(s) in A can't be reordered before the
	memory operations preceding(in PO) A and after the memory
	operations following(in PO) A.

If we say A is a "full barrier", we actually means:

1.	The memory operations preceding(in program order) A can't be
	reordered after the memory operations following(in PO) A.

and

2.	The memory ordering guarantee in #1 is visible globally.

Is that correct? Or "full barrier" is more strong than I understand,
i.e. there is a third property of "full barrier":

3.	The memory operation(s) in A can't be reordered before the
	memory operations preceding(in PO) A and after the memory
	operations following(in PO) A.

IOW, is "full barrier" a more strong version of "fully ordered" or not?

Regards,
Boqun

> >  - RELEASE -> ACQUIRE can be upgraded to a full barrier (including
> >    transitivity) using smp_mb__release_acquire(), either before RELEASE
> >    or after ACQUIRE (but consistently [*]).
> 
> Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This
> is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently
> using (for PPC only).
> 
> Stepping back a second, I believe that there are three cases:
> 
> 
>  RELEASE X -> ACQUIRE Y (same CPU)
>    * Needs a barrier on TSO architectures for full ordering
> 
>  UNLOCK X -> LOCK Y (same CPU)
>    * Needs a barrier on PPC for full ordering
> 
>  RELEASE X -> ACQUIRE X (different CPUs)
>  UNLOCK X -> ACQUIRE X (different CPUs)
>    * Fully ordered everywhere...
>    * ... but needs a barrier on PPC to become a full barrier
> 
>
Peter Zijlstra Oct. 19, 2015, 10:23 a.m. UTC | #27
On Mon, Oct 19, 2015 at 09:17:18AM +0800, Boqun Feng wrote:
> This is confusing me right now. ;-)
> 
> Let's use a simple example for only one primitive, as I understand it,
> if we say a primitive A is "fully ordered", we actually mean:
> 
> 1.	The memory operations preceding(in program order) A can't be
> 	reordered after the memory operations following(in PO) A.
> 
> and
> 
> 2.	The memory operation(s) in A can't be reordered before the
> 	memory operations preceding(in PO) A and after the memory
> 	operations following(in PO) A.
> 
> If we say A is a "full barrier", we actually means:
> 
> 1.	The memory operations preceding(in program order) A can't be
> 	reordered after the memory operations following(in PO) A.
> 
> and
> 
> 2.	The memory ordering guarantee in #1 is visible globally.
> 
> Is that correct? Or "full barrier" is more strong than I understand,
> i.e. there is a third property of "full barrier":
> 
> 3.	The memory operation(s) in A can't be reordered before the
> 	memory operations preceding(in PO) A and after the memory
> 	operations following(in PO) A.
> 
> IOW, is "full barrier" a more strong version of "fully ordered" or not?

Yes, that was how I used it.

Now of course; the big question is do we want to promote this usage or
come up with a different set of words describing this stuff.

I think separating the ordering from the transitivity is useful, for we
can then talk about and specify them independently.

That is, we can say:

	LOAD-ACQUIRE: orders LOAD->{LOAD,STORE}
	              weak transitivity (RCpc)

	MB: orders {LOAD,STORE}->{LOAD,STORE} (fully ordered)
	    strong transitivity (RCsc)

etc..

Also, in the above I used weak and strong transitivity, but that too is
of course up for grabs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Boqun Feng Oct. 20, 2015, 7:35 a.m. UTC | #28
On Mon, Oct 19, 2015 at 12:23:24PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 19, 2015 at 09:17:18AM +0800, Boqun Feng wrote:
> > This is confusing me right now. ;-)
> > 
> > Let's use a simple example for only one primitive, as I understand it,
> > if we say a primitive A is "fully ordered", we actually mean:
> > 
> > 1.	The memory operations preceding(in program order) A can't be
> > 	reordered after the memory operations following(in PO) A.
> > 
> > and
> > 
> > 2.	The memory operation(s) in A can't be reordered before the
> > 	memory operations preceding(in PO) A and after the memory
> > 	operations following(in PO) A.
> > 
> > If we say A is a "full barrier", we actually means:
> > 
> > 1.	The memory operations preceding(in program order) A can't be
> > 	reordered after the memory operations following(in PO) A.
> > 
> > and
> > 
> > 2.	The memory ordering guarantee in #1 is visible globally.
> > 
> > Is that correct? Or "full barrier" is more strong than I understand,
> > i.e. there is a third property of "full barrier":
> > 
> > 3.	The memory operation(s) in A can't be reordered before the
> > 	memory operations preceding(in PO) A and after the memory
> > 	operations following(in PO) A.
> > 
> > IOW, is "full barrier" a more strong version of "fully ordered" or not?
> 
> Yes, that was how I used it.
> 
> Now of course; the big question is do we want to promote this usage or
> come up with a different set of words describing this stuff.
> 
> I think separating the ordering from the transitivity is useful, for we
> can then talk about and specify them independently.
> 

Great idea! 

> That is, we can say:
> 
> 	LOAD-ACQUIRE: orders LOAD->{LOAD,STORE}
> 	              weak transitivity (RCpc)
> 
> 	MB: orders {LOAD,STORE}->{LOAD,STORE} (fully ordered)
> 	    strong transitivity (RCsc)
> 

It will be helpful if we have this kind of description for each
primitive mentioned in memory-barriers.txt, which, IMO, is better than
the description like the following:

"""
Any atomic operation that modifies some state in memory and returns information
about the state (old or new) implies an SMP-conditional general memory barrier
(smp_mb()) on each side of the actual operation (with the exception of
"""

I'm assuming that the arrow "->" stands for the program order, and word
"orders" means that a primitive guarantees some program order becomes
the memory operation order, so that the description above can be
rewritten as:

value-returning atomics:
	orders {LOAD,STORE}->RmW(atomic operation)->{LOAD,STORE}
	strong transitivity
	
much simpler and clearer for discussion and reasoning

Regards,
Boqun

> etc..
> 
> Also, in the above I used weak and strong transitivity, but that too is
> of course up for grabs.
Boqun Feng Oct. 20, 2015, 2:20 p.m. UTC | #29
On Mon, Oct 12, 2015 at 04:30:48PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 09, 2015 at 07:33:28PM +0100, Will Deacon wrote:
> > On Fri, Oct 09, 2015 at 10:43:27AM -0700, Paul E. McKenney wrote:
> > > On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote:
[snip]
> 
> > > > We could also include a link to the ppcmem/herd web frontends and your
> > > > lwn.net article. (ppcmem is already linked, but it's not obvious that
> > > > you can run litmus tests in your browser).
> > > 
> > > I bet that the URLs for the web frontends are not stable long term.
> > > Don't get me wrong, PPCMEM/ARMMEM has been there for me for a goodly
> > > number of years, but professors do occasionally move from one institution
> > > to another.  For but one example, Susmit Sarkar is now at University
> > > of St. Andrews rather than at Cambridge.
> > > 
> > > So to make this work, we probably need to be thinking in terms of
> > > asking the researchers for permission to include their ocaml code in the
> > > Linux-kernel source tree.  I would be strongly in favor of this, actually.
> > > 
> > > Thoughts?
> > 
> > I'm extremely hesitant to import a bunch of dubiously licensed, academic
> > ocaml code into the kernel. Even if we did, who would maintain it?
> > 
> > A better solution might be to host a mirror of the code on kernel.org,
> > along with a web front-end for people to play with (the tests we're talking
> > about here do seem to run ok in my browser).
> 
> I am not too worried about how this happens, but we should avoid
> constraining the work of our academic partners.  The reason I was thinking
> in terms of in the kernel was to avoid version-synchronization issues.
> "Wait, this is Linux kernel v4.17, which means that you need to use
> version 8.3.5.1 of the tooling...  And with these four patches as well."
> 

Maybe including only the models' code(arm.cat, ppc.cat, etc.) into
kernel rather than the whole code base could also solve the
version-synchronization in some degree, and avoid maintaining the whole
tool code? I'm assuming modifying the verifier's code other than the
models' code will unlikely change the result of a litmus test.

Regards,
Boqun
Paul E. McKenney Oct. 20, 2015, 11:34 p.m. UTC | #30
On Mon, Oct 19, 2015 at 09:17:18AM +0800, Boqun Feng wrote:
> On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote:
> > On Fri, Oct 09, 2015 at 10:31:38AM +0200, Peter Zijlstra wrote:
> [snip]
> > > 
> > > So lots of little confusions added up to complete fail :-{
> > > 
> > > Mostly I think it was the UNLOCK x + LOCK x are fully ordered (where I
> > > forgot: but not against uninvolved CPUs) and RELEASE/ACQUIRE are
> > > transitive (where I forgot: RELEASE/ACQUIRE _chains_ are transitive, but
> > > again not against uninvolved CPUs).
> > > 
> > > Which leads me to think I would like to suggest alternative rules for
> > > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are
> > > partly responsible for my confusion).
> > 
> > Yeah, sorry. I originally used the phrase "fully ordered" but changed it
> > to "full barrier", which has stronger transitivity (newly understood
> > definition) requirements that I didn't intend.
> > 
> > RELEASE -> ACQUIRE should be used for message passing between two CPUs
> > and not have ordering effects on other observers unless they're part of
> > the RELEASE -> ACQUIRE chain.
> > 
> > >  - RELEASE -> ACQUIRE is fully ordered (but not a full barrier) when
> > >    they operate on the same variable and the ACQUIRE reads from the
> > >    RELEASE. Notable, RELEASE/ACQUIRE are RCpc and lack transitivity.
> > 
> > Are we explicit about the difference between "fully ordered" and "full
> > barrier" somewhere else, because this looks like it will confuse people.
> > 
> 
> This is confusing me right now. ;-)
> 
> Let's use a simple example for only one primitive, as I understand it,
> if we say a primitive A is "fully ordered", we actually mean:
> 
> 1.	The memory operations preceding(in program order) A can't be
> 	reordered after the memory operations following(in PO) A.
> 
> and
> 
> 2.	The memory operation(s) in A can't be reordered before the
> 	memory operations preceding(in PO) A and after the memory
> 	operations following(in PO) A.
> 
> If we say A is a "full barrier", we actually means:
> 
> 1.	The memory operations preceding(in program order) A can't be
> 	reordered after the memory operations following(in PO) A.
> 
> and
> 
> 2.	The memory ordering guarantee in #1 is visible globally.
> 
> Is that correct? Or "full barrier" is more strong than I understand,
> i.e. there is a third property of "full barrier":
> 
> 3.	The memory operation(s) in A can't be reordered before the
> 	memory operations preceding(in PO) A and after the memory
> 	operations following(in PO) A.
> 
> IOW, is "full barrier" a more strong version of "fully ordered" or not?

There is also the question of whether the barrier forces ordering
of unrelated stores, everything initially zero and all accesses
READ_ONCE() or WRITE_ONCE():

	P0		P1		P2		P3
	X = 1;		Y = 1;		r1 = X;		r3 = Y;
					some_barrier();	some_barrier();
					r2 = Y;		r4 = X;

P2's and P3's ordering could be globally visible without requiring
P0's and P1's independent stores to be ordered, for example, if you
used smp_rmb() for some_barrier().  In contrast, if we used smp_mb()
for barrier, everyone would agree on the order of P0's and P0's stores.

There are actually a fair number of different combinations of
aspects of memory ordering.  We will need to choose wisely.  ;-)

My hope is that the store-ordering gets folded into the globally
visible transitive level.  Especially given that I have not (yet)
seen any algorithms used in production that relied on the ordering of
independent stores.

							Thanx, Paul

> Regards,
> Boqun
> 
> > >  - RELEASE -> ACQUIRE can be upgraded to a full barrier (including
> > >    transitivity) using smp_mb__release_acquire(), either before RELEASE
> > >    or after ACQUIRE (but consistently [*]).
> > 
> > Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This
> > is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently
> > using (for PPC only).
> > 
> > Stepping back a second, I believe that there are three cases:
> > 
> > 
> >  RELEASE X -> ACQUIRE Y (same CPU)
> >    * Needs a barrier on TSO architectures for full ordering
> > 
> >  UNLOCK X -> LOCK Y (same CPU)
> >    * Needs a barrier on PPC for full ordering
> > 
> >  RELEASE X -> ACQUIRE X (different CPUs)
> >  UNLOCK X -> ACQUIRE X (different CPUs)
> >    * Fully ordered everywhere...
> >    * ... but needs a barrier on PPC to become a full barrier
> > 
> > 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 21, 2015, 8:24 a.m. UTC | #31
On Tue, Oct 20, 2015 at 04:34:51PM -0700, Paul E. McKenney wrote:
> There is also the question of whether the barrier forces ordering
> of unrelated stores, everything initially zero and all accesses
> READ_ONCE() or WRITE_ONCE():
> 
> 	P0		P1		P2		P3
> 	X = 1;		Y = 1;		r1 = X;		r3 = Y;
> 					some_barrier();	some_barrier();
> 					r2 = Y;		r4 = X;
> 
> P2's and P3's ordering could be globally visible without requiring
> P0's and P1's independent stores to be ordered, for example, if you
> used smp_rmb() for some_barrier().  In contrast, if we used smp_mb()
> for barrier, everyone would agree on the order of P0's and P0's stores.

Oh!?

> There are actually a fair number of different combinations of
> aspects of memory ordering.  We will need to choose wisely.  ;-)
> 
> My hope is that the store-ordering gets folded into the globally
> visible transitive level.  Especially given that I have not (yet)
> seen any algorithms used in production that relied on the ordering of
> independent stores.

I would hope not, that's quite insane.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
David Laight Oct. 21, 2015, 4:04 p.m. UTC | #32
From: Paul E. McKenney

> Sent: 21 October 2015 00:35

...
> There is also the question of whether the barrier forces ordering

> of unrelated stores, everything initially zero and all accesses

> READ_ONCE() or WRITE_ONCE():

> 

> 	P0		P1		P2		P3

> 	X = 1;		Y = 1;		r1 = X;		r3 = Y;

> 					some_barrier();	some_barrier();

> 					r2 = Y;		r4 = X;

> 

> P2's and P3's ordering could be globally visible without requiring

> P0's and P1's independent stores to be ordered, for example, if you

> used smp_rmb() for some_barrier().  In contrast, if we used smp_mb()

> for barrier, everyone would agree on the order of P0's and P0's stores.

> 

> There are actually a fair number of different combinations of

> aspects of memory ordering.  We will need to choose wisely.  ;-)


My thoughts on this are that most code probably isn't performance critical
enough to be using anything other than normal locks for inter-cpu
synchronisation.
Certainly most people are likely to get it wrong somewhere.
So you want a big red sticker saying 'Don't try to be too clever'.

Also without examples of why things go wrong (eg member_consumer()
and alpha) it is difficult to understand the differences between
all the barriers (etc).

OTOH device driver code may need things slightly stronger than
barrier() (which I think is asm(:::"memory")) to sequence accesses
to hardware devices (and memory the hardware reads), but without
having a strong barrier in every ioread/write() access.

	David
Paul E. McKenney Oct. 21, 2015, 7:29 p.m. UTC | #33
On Wed, Oct 21, 2015 at 10:24:52AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 20, 2015 at 04:34:51PM -0700, Paul E. McKenney wrote:
> > There is also the question of whether the barrier forces ordering
> > of unrelated stores, everything initially zero and all accesses
> > READ_ONCE() or WRITE_ONCE():
> > 
> > 	P0		P1		P2		P3
> > 	X = 1;		Y = 1;		r1 = X;		r3 = Y;
> > 					some_barrier();	some_barrier();
> > 					r2 = Y;		r4 = X;
> > 
> > P2's and P3's ordering could be globally visible without requiring
> > P0's and P1's independent stores to be ordered, for example, if you
> > used smp_rmb() for some_barrier().  In contrast, if we used smp_mb()
> > for barrier, everyone would agree on the order of P0's and P0's stores.
> 
> Oh!?

Behold sequential consistency, worshipped fervently by a surprisingly
large number of people!  Something about legacy proof methods, as near
as I can tell.  ;-)

> > There are actually a fair number of different combinations of
> > aspects of memory ordering.  We will need to choose wisely.  ;-)
> > 
> > My hope is that the store-ordering gets folded into the globally
> > visible transitive level.  Especially given that I have not (yet)
> > seen any algorithms used in production that relied on the ordering of
> > independent stores.
> 
> I would hope not, that's quite insane.

Your point being?  ;-)

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Paul E. McKenney Oct. 21, 2015, 7:34 p.m. UTC | #34
On Wed, Oct 21, 2015 at 04:04:04PM +0000, David Laight wrote:
> From: Paul E. McKenney
> > Sent: 21 October 2015 00:35
> ...
> > There is also the question of whether the barrier forces ordering
> > of unrelated stores, everything initially zero and all accesses
> > READ_ONCE() or WRITE_ONCE():
> > 
> > 	P0		P1		P2		P3
> > 	X = 1;		Y = 1;		r1 = X;		r3 = Y;
> > 					some_barrier();	some_barrier();
> > 					r2 = Y;		r4 = X;
> > 
> > P2's and P3's ordering could be globally visible without requiring
> > P0's and P1's independent stores to be ordered, for example, if you
> > used smp_rmb() for some_barrier().  In contrast, if we used smp_mb()
> > for barrier, everyone would agree on the order of P0's and P0's stores.
> > 
> > There are actually a fair number of different combinations of
> > aspects of memory ordering.  We will need to choose wisely.  ;-)
> 
> My thoughts on this are that most code probably isn't performance critical
> enough to be using anything other than normal locks for inter-cpu
> synchronisation.
> Certainly most people are likely to get it wrong somewhere.
> So you want a big red sticker saying 'Don't try to be too clever'.

I am afraid that I would run out of red stickers rather quickly,
given the large number of ways that one can shoot oneself in the
foot, even when single-threaded.

> Also without examples of why things go wrong (eg member_consumer()
> and alpha) it is difficult to understand the differences between
> all the barriers (etc).

Not just the hardware peculiarities.  It is also important to understand
the common use cases.

> OTOH device driver code may need things slightly stronger than
> barrier() (which I think is asm(:::"memory")) to sequence accesses
> to hardware devices (and memory the hardware reads), but without
> having a strong barrier in every ioread/write() access.

There are more memory models than you can shake a stick at, so yes,
we do have to choose carefully.  And yes, it does get more complex
when you add MMIO, and no, I don't know of any formal model that
takes MMIO into account.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 21, 2015, 7:36 p.m. UTC | #35
On Wed, Oct 21, 2015 at 12:29:23PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 21, 2015 at 10:24:52AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 04:34:51PM -0700, Paul E. McKenney wrote:
> > > There is also the question of whether the barrier forces ordering
> > > of unrelated stores, everything initially zero and all accesses
> > > READ_ONCE() or WRITE_ONCE():
> > > 
> > > 	P0		P1		P2		P3
> > > 	X = 1;		Y = 1;		r1 = X;		r3 = Y;
> > > 					some_barrier();	some_barrier();
> > > 					r2 = Y;		r4 = X;
> > > 
> > > P2's and P3's ordering could be globally visible without requiring
> > > P0's and P1's independent stores to be ordered, for example, if you
> > > used smp_rmb() for some_barrier().  In contrast, if we used smp_mb()
> > > for barrier, everyone would agree on the order of P0's and P0's stores.
> > 
> > Oh!?
> 
> Behold sequential consistency, worshipped fervently by a surprisingly
> large number of people!  Something about legacy proof methods, as near
> as I can tell.  ;-)

But how can smp_mb() guarantee anything about P[01]? There is but the
single store, which can race against P[23] arbitrarily. There is nothing
to order.

Maybe I'm confused again..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Paul E. McKenney Oct. 21, 2015, 7:56 p.m. UTC | #36
On Wed, Oct 21, 2015 at 09:36:44PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 12:29:23PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 21, 2015 at 10:24:52AM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 20, 2015 at 04:34:51PM -0700, Paul E. McKenney wrote:
> > > > There is also the question of whether the barrier forces ordering
> > > > of unrelated stores, everything initially zero and all accesses
> > > > READ_ONCE() or WRITE_ONCE():
> > > > 
> > > > 	P0		P1		P2		P3
> > > > 	X = 1;		Y = 1;		r1 = X;		r3 = Y;
> > > > 					some_barrier();	some_barrier();
> > > > 					r2 = Y;		r4 = X;
> > > > 
> > > > P2's and P3's ordering could be globally visible without requiring
> > > > P0's and P1's independent stores to be ordered, for example, if you
> > > > used smp_rmb() for some_barrier().  In contrast, if we used smp_mb()
> > > > for barrier, everyone would agree on the order of P0's and P0's stores.
> > > 
> > > Oh!?
> > 
> > Behold sequential consistency, worshipped fervently by a surprisingly
> > large number of people!  Something about legacy proof methods, as near
> > as I can tell.  ;-)
> 
> But how can smp_mb() guarantee anything about P[01]? There is but the
> single store, which can race against P[23] arbitrarily. There is nothing
> to order.

Indeed, if your barrier is only acting locally, there is no way that
you can order the two stores.  However, some barriers act non-locally,
and this non-local action can order the stores.  This non-locality is
"cumulativity" in the PowerPC docs.

And x86 can also order P0's and P1's stores, courtesy of the "T" in "TSO".

> Maybe I'm confused again..

They say that confusion is the most productive frame of mind...

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 2ba8461b0631..0f69f1dc74fc 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -459,11 +459,18 @@  And a couple of implicit varieties:
      RELEASE on that same variable are guaranteed to be visible.  In other
      words, within a given variable's critical section, all accesses of all
      previous critical sections for that variable are guaranteed to have
-     completed.
+     completed.  If the RELEASE and ACQUIRE operations act on independent
+     variables, an smp_mb__release_acquire() barrier can be placed between
+     them to upgrade the sequence to a full barrier.
 
      This means that ACQUIRE acts as a minimal "acquire" operation and
      RELEASE acts as a minimal "release" operation.
 
+A subset of the atomic operations described in atomic_ops.txt have ACQUIRE
+and RELEASE variants in addition to fully-ordered and relaxed (no barrier
+semantics) definitions.  For compound atomics performing both a load and
+a store, ACQUIRE semantics apply only to the load and RELEASE semantics
+only to the store portion of the operation.
 
 Memory barriers are only required where there's a possibility of interaction
 between two CPUs or between a CPU and a device.  If it can be guaranteed that
@@ -1895,6 +1902,23 @@  the RELEASE would simply complete, thereby avoiding the deadlock.
 	a sleep-unlock race, but the locking primitive needs to resolve
 	such races properly in any case.
 
+Where the RELEASE and ACQUIRE operations are performed by the same CPU,
+ordering can be enforced by use of an smp_mb__release_acquire() barrier:
+
+	*A = a;
+	RELEASE M
+	smp_mb__release_acquire();
+	ACQUIRE N
+	*B = b;
+
+in which case, the only permitted sequences are:
+
+	STORE *A, RELEASE M, ACQUIRE N, STORE *B
+	STORE *A, ACQUIRE N, RELEASE M, STORE *B
+
+Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE
+operations performed by other CPUs.
+
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
 anything at all - especially with respect to I/O accesses - unless combined
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index df896a1c41d3..9dceee6c2f20 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -77,6 +77,7 @@  do {									\
 	___p1;								\
 })
 
+#define smp_mb__release_acquire()	smp_mb()
 #define smp_store_mb(var, value)	do { WRITE_ONCE(var, value); mb(); } while (0)
 
 /*
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index 0eca6efc0631..919624634d0a 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -87,6 +87,7 @@  do {									\
 	___p1;								\
 })
 
+#define smp_mb__release_acquire()   smp_mb()
 #define smp_mb__before_atomic()     smp_mb()
 #define smp_mb__after_atomic()      smp_mb()
 #define smp_mb__before_spinlock()   smp_mb()
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index d48fe0162331..0c150b5fdd1c 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -53,4 +53,6 @@  do {									\
 	___p1;								\
 })
 
+#define smp_mb__release_acquire()	smp_mb()
+
 #endif /* __ASM_BARRIER_H */
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 14a928601657..4ae875cd9e78 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -71,7 +71,8 @@  do {									\
 	___p1;								\
 })
 
-#define smp_mb__before_atomic()	barrier()
-#define smp_mb__after_atomic()	barrier()
+#define smp_mb__release_acquire()	smp_mb()
+#define smp_mb__before_atomic()		barrier()
+#define smp_mb__after_atomic()		barrier()
 
 #endif /* !(__SPARC64_BARRIER_H) */
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 0681d2532527..1c61ad251e0e 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -85,6 +85,8 @@  do {									\
 	___p1;								\
 })
 
+#define smp_mb__release_acquire()	smp_mb()
+
 #endif
 
 /* Atomic operations are already serializing on x86 */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index b42afada1280..61ae95199397 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -119,5 +119,9 @@  do {									\
 	___p1;								\
 })
 
+#ifndef smp_mb__release_acquire
+#define smp_mb__release_acquire()	do { } while (0)
+#endif
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */