diff mbox

arm64: spinlock: serialise spin_unlock_wait against concurrent lockers

Message ID 1448624646-15863-1-git-send-email-will.deacon@arm.com
State Accepted
Commit d86b8da04dfa4771a68bdbad6c424d40f22f0d14
Headers show

Commit Message

Will Deacon Nov. 27, 2015, 11:44 a.m. UTC
Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
on architectures implementing spin_lock with LL/SC sequences and acquire
semantics:

 | CPU 1                   CPU 2                     CPU 3
 | ==================      ====================      ==============
 |                                                   spin_unlock(&lock);
 |                         spin_lock(&lock):
 |                           r1 = *lock; // r1 == 0;
 |                         o = READ_ONCE(object); // reordered here
 | object = NULL;
 | smp_mb();
 | spin_unlock_wait(&lock);
 |                           *lock = 1;
 | smp_mb();
 | o->dead = true;
 |                         if (o) // true
 |                           BUG_ON(o->dead); // true!!

The crux of the problem is that spin_unlock_wait(&lock) can return on
CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it
to serialise against a concurrent locker and giving it acquire semantics
in the process (although it is not at all clear whether this is needed -
different callers seem to assume different things about the barrier
semantics and architectures are similarly disjoint in their
implementations of the macro).

This patch implements spin_unlock_wait using an LL/SC sequence with
acquire semantics on arm64. For v8.1 systems with the LSE atomics, the
exclusive writeback is omitted, since the spin_lock operation is
indivisible and no intermediate state can be observed.

Signed-off-by: Will Deacon <will.deacon@arm.com>

---
 arch/arm64/include/asm/spinlock.h | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

-- 
2.1.4


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Comments

Will Deacon Dec. 1, 2015, 4:32 p.m. UTC | #1
On Tue, Dec 01, 2015 at 08:40:17AM +0800, Boqun Feng wrote:
> Hi Will,


Hi Boqun,

> On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote:

> > Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait

> > on architectures implementing spin_lock with LL/SC sequences and acquire

> > semantics:

> > 

> >  | CPU 1                   CPU 2                     CPU 3

> >  | ==================      ====================      ==============

> >  |                                                   spin_unlock(&lock);

> >  |                         spin_lock(&lock):

> >  |                           r1 = *lock; // r1 == 0;

> >  |                         o = READ_ONCE(object); // reordered here

> >  | object = NULL;

> >  | smp_mb();

> >  | spin_unlock_wait(&lock);

> >  |                           *lock = 1;

> >  | smp_mb();

> >  | o->dead = true;

> >  |                         if (o) // true

> >  |                           BUG_ON(o->dead); // true!!

> > 

> > The crux of the problem is that spin_unlock_wait(&lock) can return on

> > CPU 1 whilst CPU 2 is in the process of taking the lock. This can be

> > resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it

> 

> I wonder whether upgrading it to a LOCK operation is necessary. Please

> see below.

> 

> > to serialise against a concurrent locker and giving it acquire semantics

> > in the process (although it is not at all clear whether this is needed -

> > different callers seem to assume different things about the barrier

> > semantics and architectures are similarly disjoint in their

> > implementations of the macro).

> > 

> > This patch implements spin_unlock_wait using an LL/SC sequence with

> > acquire semantics on arm64. For v8.1 systems with the LSE atomics, the

> 

> IIUC, you implement this with acquire semantics because a LOCK requires

> acquire semantics, right? I get that spin_unlock_wait() becoming a LOCK

> will surely simply our analysis, because LOCK->LOCK is always globally

> ordered. But for this particular problem, seems only a relaxed LL/SC

> loop suffices, and the use of spin_unlock_wait() in do_exit() only

> requires a control dependency which could be fulfilled by a relaxed

> LL/SC loop. So the acquire semantics may be not necessary here.

> 

> Am I missing something subtle here which is the reason you want to

> upgrading spin_unlock_wait() to a LOCK?


There's two things going on here:

  (1) Having spin_unlock_wait do a complete LL/SC sequence (i.e. writing
      to the lock, like a LOCK operation would)

  (2) Giving it ACQUIRE semantics

(2) is not necessary to fix the problem you described. However, LOCK
operations do imply ACQUIRE and it's not clear to me that what the
ordering semantics are for spin_unlock_wait.

I'd really like to keep this as simple as possible. Introducing yet
another magic barrier macro that isn't well understood or widely used
feels like a major step backwards to me, so I opted to upgrade
spin_unlock_wait to LOCK on arm64 so that I no longer have to worry
about it.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon Dec. 1, 2015, 4:40 p.m. UTC | #2
On Mon, Nov 30, 2015 at 04:58:39PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote:

> > Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait

> > on architectures implementing spin_lock with LL/SC sequences and acquire

> > semantics:

> > 

> >  | CPU 1                   CPU 2                     CPU 3

> >  | ==================      ====================      ==============

> >  |                                                   spin_unlock(&lock);

> >  |                         spin_lock(&lock):

> >  |                           r1 = *lock; // r1 == 0;

> >  |                         o = READ_ONCE(object); // reordered here

> >  | object = NULL;

> >  | smp_mb();

> >  | spin_unlock_wait(&lock);

> >  |                           *lock = 1;

> >  | smp_mb();

> >  | o->dead = true;

> >  |                         if (o) // true

> >  |                           BUG_ON(o->dead); // true!!

> > 

> > The crux of the problem is that spin_unlock_wait(&lock) can return on

> > CPU 1 whilst CPU 2 is in the process of taking the lock. This can be

> > resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it

> > to serialise against a concurrent locker and giving it acquire semantics

> > in the process (although it is not at all clear whether this is needed -

> > different callers seem to assume different things about the barrier

> > semantics and architectures are similarly disjoint in their

> > implementations of the macro).

> 

> Do we want to go do a note with spin_unlock_wait() in

> include/linux/spinlock.h warning about these subtle issues for the next

> arch that thinks this is a 'trivial' thing to implement?


Could do, but I still need agreement from Paul on the solution before I
can describe it in core code. At the moment, the semantics are,
unfortunately, arch-specific.

Paul -- did you have any more thoughts about this? I ended up at:

  https://lkml.org/lkml/2015/11/16/343

and then ran out of ideas.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon Dec. 3, 2015, 4:32 p.m. UTC | #3
Hi Peter, Paul,

Firstly, thanks for writing that up. I agree that you have something
that can work in theory, but see below.

On Thu, Dec 03, 2015 at 02:28:39PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 02, 2015 at 04:11:41PM -0800, Paul E. McKenney wrote:

> > This looks architecture-agnostic to me:

> > 

> > a.	TSO systems have smp_mb__after_unlock_lock() be a no-op, and

> > 	have a read-only implementation for spin_unlock_wait().

> > 

> > b.	Small-scale weakly ordered systems can also have

> > 	smp_mb__after_unlock_lock() be a no-op, but must instead

> > 	have spin_unlock_wait() acquire the lock and immediately 

> > 	release it, or some optimized implementation of this.

> > 

> > c.	Large-scale weakly ordered systems are required to define

> > 	smp_mb__after_unlock_lock() as smp_mb(), but can have a

> > 	read-only implementation of spin_unlock_wait().

> 

> This would still require all relevant spin_lock() sites to be annotated

> with smp_mb__after_unlock_lock(), which is going to be a painful (no

> warning when done wrong) exercise and expensive (added MBs all over the

> place).

> 

> But yes, I think the proposal is technically sound, just not quite sure

> how far we'll want to push this.


When I said that the solution isn't architecture-agnostic, I was referring
to the fact that spin_unlock_wait acts as a LOCK operation on all
architectures other than those using case (c) above. You've resolved
this by requiring smp_mb__after_unlock_lock() after every LOCK, but I
share Peter's concerns that this isn't going to work in practice because:

  1. The smp_mb__after_unlock_lock additions aren't necessarily in the
     code where the spin_unlock_wait() is being added, so it's going to
     require a lot of diligence for developers to get this right

  2. Only PowerPC is going to see the (very occassional) failures, so
     testing this is nigh on impossible :(

  3. We've now made the kernel memory model even more difficult to
     understand, so people might not even bother with this addition

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon Dec. 4, 2015, 4:24 p.m. UTC | #4
On Fri, Dec 04, 2015 at 08:07:06AM -0800, Paul E. McKenney wrote:
> On Fri, Dec 04, 2015 at 10:21:10AM +0100, Peter Zijlstra wrote:

> > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:

> > > >   2. Only PowerPC is going to see the (very occassional) failures, so

> > > >      testing this is nigh on impossible :(

> > > 

> > > Indeed, we clearly cannot rely on normal testing, witness rcutorture

> > > failing to find the missing smp_mb__after_unlock_lock() instances that

> > > Peter found by inspection.  So I believe that augmented testing is

> > > required, perhaps as suggested above.

> > 

> > To be fair, those were in debug code and non critical for correctness

> > per se. That is, at worst the debug print would've observed an incorrect

> > value.

> 

> True enough, but there is still risk from people repurposing debug code

> for non-debug uses.  Still, thank you, I don't feel -quite- so bad about

> rcutorture's failure to find these.  ;-)


It's the ones that it's yet to find that you should be worried about,
and the debug code is all fixed ;)

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon Dec. 11, 2015, 9:46 a.m. UTC | #5
On Fri, Dec 11, 2015 at 04:09:11PM +0800, Boqun Feng wrote:
> In conclusion, we have more than a half of uses working well already,

> and each of the fix-needed ones has only one related critical section

> and only one related data access in it. So on PPC, I think my proposal

> won't have more smp_mb() instances to fix all current use cases than

> adding smp_mb__after_unlock_lock() after the lock acquisition in each

> related lock critical section.

> 

> Of course, my proposal needs the buy-ins of both PPC and ARM64v8, so

> Paul and Will, what do you think? ;-)


I already queued the change promoting it to LOCK for arm64. It makes the
semantics easy to understand and I've failed to measure any difference
in performance. It's also robust against any future users of the macro
and matches what other architectures do.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon Dec. 11, 2015, 1:54 p.m. UTC | #6
On Fri, Dec 11, 2015 at 05:42:26AM -0800, Paul E. McKenney wrote:
> On Fri, Dec 11, 2015 at 09:46:52AM +0000, Will Deacon wrote:

> > On Fri, Dec 11, 2015 at 04:09:11PM +0800, Boqun Feng wrote:

> > > In conclusion, we have more than a half of uses working well already,

> > > and each of the fix-needed ones has only one related critical section

> > > and only one related data access in it. So on PPC, I think my proposal

> > > won't have more smp_mb() instances to fix all current use cases than

> > > adding smp_mb__after_unlock_lock() after the lock acquisition in each

> > > related lock critical section.

> > > 

> > > Of course, my proposal needs the buy-ins of both PPC and ARM64v8, so

> > > Paul and Will, what do you think? ;-)

> > 

> > I already queued the change promoting it to LOCK for arm64. It makes the

> > semantics easy to understand and I've failed to measure any difference

> > in performance. It's also robust against any future users of the macro

> > and matches what other architectures do.

> 

> What size system did you do your performance testing on?


A tiny system by your standards (4 clusters of 2 CPUs), but my take for
arm64 is that either wfe-based ll/sc loops scale sufficiently or you
build with the new atomic instructions (that don't have an issue here).

I have a bigger system (10s of cores) I can try with, but I don't
currently have the ability to run mainline on it.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
diff mbox

Patch

diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index c85e96d174a5..fc9682bfe002 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -26,9 +26,28 @@ 
  * The memory barriers are implicit with the load-acquire and store-release
  * instructions.
  */
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	unsigned int tmp;
+	arch_spinlock_t lockval;
 
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+	asm volatile(
+"	sevl\n"
+"1:	wfe\n"
+"2:	ldaxr	%w0, %2\n"
+"	eor	%w1, %w0, %w0, ror #16\n"
+"	cbnz	%w1, 1b\n"
+	ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
+"	stxr	%w1, %w0, %2\n"
+"	cbnz	%w1, 2b\n", /* Serialise against any concurrent lockers */
+	/* LSE atomics */
+"	nop\n"
+"	nop\n")
+	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
+	:
+	: "memory");
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)