diff mbox series

[10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()

Message ID 1522947547-24081-11-git-send-email-will.deacon@arm.com
State Superseded
Headers show
Series kernel/locking: qspinlock improvements | expand

Commit Message

Will Deacon April 5, 2018, 4:59 p.m. UTC
The qspinlock slowpath must ensure that the MCS node is fully initialised
before it can be reached by another other CPU. This is currently enforced
by using a RELEASE operation when updating the tail and also when linking
the node into the waitqueue (since the control dependency off xchg_tail
is insufficient to enforce sufficient ordering -- see 95bcade33a8a
("locking/qspinlock: Ensure node is initialised before updating prev->next")).

Back-to-back RELEASE operations may be expensive on some architectures,
particularly those that implement them using fences under the hood. We
can replace the two RELEASE operations with a single smp_wmb() fence and
use RELAXED operations for the subsequent publishing of the node.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>

---
 kernel/locking/qspinlock.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

-- 
2.1.4

Comments

Peter Zijlstra April 5, 2018, 5:28 p.m. UTC | #1
On Thu, Apr 05, 2018 at 05:59:07PM +0100, Will Deacon wrote:
> @@ -340,12 +341,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

>  		goto release;

>  

>  	/*

> +	 * Ensure that the initialisation of @node is complete before we

> +	 * publish the updated tail and potentially link @node into the

> +	 * waitqueue.

> +	 */

> +	smp_wmb();


Maybe an explicit note to where the matching barrier lives..
Will Deacon April 6, 2018, 11:34 a.m. UTC | #2
On Thu, Apr 05, 2018 at 07:28:08PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 05, 2018 at 05:59:07PM +0100, Will Deacon wrote:

> > @@ -340,12 +341,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

> >  		goto release;

> >  

> >  	/*

> > +	 * Ensure that the initialisation of @node is complete before we

> > +	 * publish the updated tail and potentially link @node into the

> > +	 * waitqueue.

> > +	 */

> > +	smp_wmb();

> 

> Maybe an explicit note to where the matching barrier lives..


Oh man, that's not a simple thing to write: there isn't a matching barrier!

Instead, we rely on dependency ordering for two cases:

  * We access a node by decoding the tail we get back from the xchg

- or -

  * We access a node by following our own ->next pointer

I could say something like:

  "Pairs with dependency ordering from both xchg_tail and explicit
   dereferences of node->next"

but it's a bit cryptic :(

Will
Andrea Parri April 6, 2018, 1:05 p.m. UTC | #3
Hi Will,

On Fri, Apr 06, 2018 at 12:34:36PM +0100, Will Deacon wrote:
> On Thu, Apr 05, 2018 at 07:28:08PM +0200, Peter Zijlstra wrote:

> > On Thu, Apr 05, 2018 at 05:59:07PM +0100, Will Deacon wrote:

> > > @@ -340,12 +341,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

> > >  		goto release;

> > >  

> > >  	/*

> > > +	 * Ensure that the initialisation of @node is complete before we

> > > +	 * publish the updated tail and potentially link @node into the

> > > +	 * waitqueue.

> > > +	 */

> > > +	smp_wmb();

> > 

> > Maybe an explicit note to where the matching barrier lives..

> 

> Oh man, that's not a simple thing to write: there isn't a matching barrier!

> 

> Instead, we rely on dependency ordering for two cases:

> 

>   * We access a node by decoding the tail we get back from the xchg

> 

> - or -

> 

>   * We access a node by following our own ->next pointer

> 

> I could say something like:

> 

>   "Pairs with dependency ordering from both xchg_tail and explicit

>    dereferences of node->next"

> 

> but it's a bit cryptic :(


Agreed. ;)  It might be helpful to instead include a snippet to highlight
the interested memory accesses/dependencies; IIUC,

/*
 * Pairs with dependency ordering from both xchg_tail and explicit/?
 * dereferences of node->next:
 *
 *   CPU0
 *
 *   /* get node0, encode node0 in tail */
 *   pv_init_node(node0);
 *     ((struct pv_node *)node0)->cpu   = smp_processor_id();
 *     ((struct pv_node *)node0)->state = vcpu_running;

 *   smp_wmb();
 *   old = xchg_tail(lock, tail);
 *
 *   CPU1:
 *
 *   /* get node1, encode tail from node1 */
 *   old = xchg_tail(lock, tail);   // = tail corresponding to node0
 *                                  // head an addr. dependency
 *   /* decode old in prev */
 *   pv_wait_node(node1, prev);
 *     READ ((struct pv_node *)prev)->cpu   // addr. dependent read
 *     READ ((struct pv_node *)prev)->state // addr. dependend read
 *
 * [More details for the case "following our own ->next pointer" you
 *  mentioned dabove.]
 */

CPU1 would also have:

   WRITE_ONCE(prev->next, node1); // addr. dependent write

but I'm not sure how this pairs: does this belong to the the second
case above? can you elaborate on that?

  Andrea


> 

> Will
Will Deacon April 6, 2018, 3:27 p.m. UTC | #4
Hi Andrea,

On Fri, Apr 06, 2018 at 03:05:12PM +0200, Andrea Parri wrote:
> On Fri, Apr 06, 2018 at 12:34:36PM +0100, Will Deacon wrote:

> > I could say something like:

> > 

> >   "Pairs with dependency ordering from both xchg_tail and explicit

> >    dereferences of node->next"

> > 

> > but it's a bit cryptic :(

> 

> Agreed. ;)  It might be helpful to instead include a snippet to highlight

> the interested memory accesses/dependencies; IIUC,

> 

> /*

>  * Pairs with dependency ordering from both xchg_tail and explicit/?

>  * dereferences of node->next:

>  *

>  *   CPU0

>  *

>  *   /* get node0, encode node0 in tail */

>  *   pv_init_node(node0);

>  *     ((struct pv_node *)node0)->cpu   = smp_processor_id();

>  *     ((struct pv_node *)node0)->state = vcpu_running;


I'd probably ignore the PV case here and just focus on the native init
of count/locked/next.

>  *   smp_wmb();

>  *   old = xchg_tail(lock, tail);

>  *

>  *   CPU1:

>  *

>  *   /* get node1, encode tail from node1 */

>  *   old = xchg_tail(lock, tail);   // = tail corresponding to node0

>  *                                  // head an addr. dependency

>  *   /* decode old in prev */

>  *   pv_wait_node(node1, prev);


Similarly here -- the dependency is through decode_tail.

>  *     READ ((struct pv_node *)prev)->cpu   // addr. dependent read

>  *     READ ((struct pv_node *)prev)->state // addr. dependend read

>  *

>  * [More details for the case "following our own ->next pointer" you

>  *  mentioned dabove.]

>  */

> 

> CPU1 would also have:

> 

>    WRITE_ONCE(prev->next, node1); // addr. dependent write

> 

> but I'm not sure how this pairs: does this belong to the the second

> case above? can you elaborate on that?


This is dependent on the result of decode_tail, so it's still the first
case. The second case is when we queued into an empty tail but somebody
later queued behind us, so we don't find them until we're claiming the
lock:

  if (!next)
  	next = smp_cond_load_relaxed(&node->next, (VAL));

  arch_mcs_spin_unlock_contended(&next->locked);

here, this is all straightforward address dependencies rather than the
arithmetic in decode_tail.

Will
Andrea Parri April 6, 2018, 3:49 p.m. UTC | #5
On Fri, Apr 06, 2018 at 04:27:45PM +0100, Will Deacon wrote:
> Hi Andrea,

> 

> On Fri, Apr 06, 2018 at 03:05:12PM +0200, Andrea Parri wrote:

> > On Fri, Apr 06, 2018 at 12:34:36PM +0100, Will Deacon wrote:

> > > I could say something like:

> > > 

> > >   "Pairs with dependency ordering from both xchg_tail and explicit

> > >    dereferences of node->next"

> > > 

> > > but it's a bit cryptic :(

> > 

> > Agreed. ;)  It might be helpful to instead include a snippet to highlight

> > the interested memory accesses/dependencies; IIUC,

> > 

> > /*

> >  * Pairs with dependency ordering from both xchg_tail and explicit/?

> >  * dereferences of node->next:

> >  *

> >  *   CPU0

> >  *

> >  *   /* get node0, encode node0 in tail */

> >  *   pv_init_node(node0);

> >  *     ((struct pv_node *)node0)->cpu   = smp_processor_id();

> >  *     ((struct pv_node *)node0)->state = vcpu_running;

> 

> I'd probably ignore the PV case here and just focus on the native init

> of count/locked/next.

> 

> >  *   smp_wmb();

> >  *   old = xchg_tail(lock, tail);

> >  *

> >  *   CPU1:

> >  *

> >  *   /* get node1, encode tail from node1 */

> >  *   old = xchg_tail(lock, tail);   // = tail corresponding to node0

> >  *                                  // head an addr. dependency

> >  *   /* decode old in prev */

> >  *   pv_wait_node(node1, prev);

> 

> Similarly here -- the dependency is through decode_tail.

> 

> >  *     READ ((struct pv_node *)prev)->cpu   // addr. dependent read

> >  *     READ ((struct pv_node *)prev)->state // addr. dependend read

> >  *

> >  * [More details for the case "following our own ->next pointer" you

> >  *  mentioned dabove.]

> >  */

> > 

> > CPU1 would also have:

> > 

> >    WRITE_ONCE(prev->next, node1); // addr. dependent write

> > 

> > but I'm not sure how this pairs: does this belong to the the second

> > case above? can you elaborate on that?

> 

> This is dependent on the result of decode_tail, so it's still the first

> case. The second case is when we queued into an empty tail but somebody

> later queued behind us, so we don't find them until we're claiming the

> lock:

> 

>   if (!next)

>   	next = smp_cond_load_relaxed(&node->next, (VAL));

> 

>   arch_mcs_spin_unlock_contended(&next->locked);

> 

> here, this is all straightforward address dependencies rather than the

> arithmetic in decode_tail.


Got it. Thanks!

  Andrea


> 

> Will
Boqun Feng April 7, 2018, 5:47 a.m. UTC | #6
On Thu, Apr 05, 2018 at 05:59:07PM +0100, Will Deacon wrote:
> The qspinlock slowpath must ensure that the MCS node is fully initialised

> before it can be reached by another other CPU. This is currently enforced

> by using a RELEASE operation when updating the tail and also when linking

> the node into the waitqueue (since the control dependency off xchg_tail

> is insufficient to enforce sufficient ordering -- see 95bcade33a8a

> ("locking/qspinlock: Ensure node is initialised before updating prev->next")).

> 

> Back-to-back RELEASE operations may be expensive on some architectures,

> particularly those that implement them using fences under the hood. We

> can replace the two RELEASE operations with a single smp_wmb() fence and

> use RELAXED operations for the subsequent publishing of the node.

> 

> Cc: Peter Zijlstra <peterz@infradead.org>

> Cc: Ingo Molnar <mingo@kernel.org>

> Signed-off-by: Will Deacon <will.deacon@arm.com>

> ---

>  kernel/locking/qspinlock.c | 32 +++++++++++++++-----------------

>  1 file changed, 15 insertions(+), 17 deletions(-)

> 

> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c

> index 3ad8786a47e2..42c61f7b37c5 100644

> --- a/kernel/locking/qspinlock.c

> +++ b/kernel/locking/qspinlock.c

> @@ -141,10 +141,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)

>  static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)

>  {

>  	/*

> -	 * Use release semantics to make sure that the MCS node is properly

> -	 * initialized before changing the tail code.

> +	 * We can use relaxed semantics since the caller ensures that the

> +	 * MCS node is properly initialized before updating the tail.

>  	 */

> -	return (u32)xchg_release(&lock->tail,

> +	return (u32)xchg_relaxed(&lock->tail,

>  				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;

>  }

>  

> @@ -178,10 +178,11 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)

>  	for (;;) {

>  		new = (val & _Q_LOCKED_PENDING_MASK) | tail;

>  		/*

> -		 * Use release semantics to make sure that the MCS node is

> -		 * properly initialized before changing the tail code.

> +		 * We can use relaxed semantics since the caller ensures that

> +		 * the MCS node is properly initialized before updating the

> +		 * tail.

>  		 */

> -		old = atomic_cmpxchg_release(&lock->val, val, new);

> +		old = atomic_cmpxchg_relaxed(&lock->val, val, new);

>  		if (old == val)

>  			break;

>  

> @@ -340,12 +341,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

>  		goto release;

>  

>  	/*

> +	 * Ensure that the initialisation of @node is complete before we

> +	 * publish the updated tail and potentially link @node into the


I think it might be better if we mention exactly where we "publish the
updated tail" and "link @node", how about:

	* publish the update tail via xchg_tail() and potentially link
	* @node into the waitqueue via WRITE_ONCE(->next,..) below.

and also add comments below like:

> +	 * waitqueue.

> +	 */

> +	smp_wmb();

> +

> +	/*

>  	 * We have already touched the queueing cacheline; don't bother with

>  	 * pending stuff.

>  	 *

>  	 * p,*,* -> n,*,*

> -	 *

> -	 * RELEASE, such that the stores to @node must be complete.


	* publish the updated tail

>  	 */

>  	old = xchg_tail(lock, tail);

>  	next = NULL;

> @@ -356,15 +362,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

>  	 */

>  	if (old & _Q_TAIL_MASK) {

>  		prev = decode_tail(old);

> -

> -		/*

> -		 * We must ensure that the stores to @node are observed before

> -		 * the write to prev->next. The address dependency from

> -		 * xchg_tail is not sufficient to ensure this because the read

> -		 * component of xchg_tail is unordered with respect to the

> -		 * initialisation of @node.

> -		 */

> -		smp_store_release(&prev->next, node);


		/* Eventually link @node to the wait queue */
	
Thoughts?

Regards,
Boqun

> +		WRITE_ONCE(prev->next, node);

>  

>  		pv_wait_node(node, prev);

>  		arch_mcs_spin_lock_contended(&node->locked);

> -- 

> 2.1.4

>
Will Deacon April 9, 2018, 10:47 a.m. UTC | #7
Hi Boqun,

On Sat, Apr 07, 2018 at 01:47:11PM +0800, Boqun Feng wrote:
> On Thu, Apr 05, 2018 at 05:59:07PM +0100, Will Deacon wrote:

> > @@ -340,12 +341,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

> >  		goto release;

> >  

> >  	/*

> > +	 * Ensure that the initialisation of @node is complete before we

> > +	 * publish the updated tail and potentially link @node into the

> 

> I think it might be better if we mention exactly where we "publish the

> updated tail" and "link @node", how about:

> 

> 	* publish the update tail via xchg_tail() and potentially link

> 	* @node into the waitqueue via WRITE_ONCE(->next,..) below.

> 

> and also add comments below like:

> 

> > +	 * waitqueue.

> > +	 */

> > +	smp_wmb();

> > +

> > +	/*

> >  	 * We have already touched the queueing cacheline; don't bother with

> >  	 * pending stuff.

> >  	 *

> >  	 * p,*,* -> n,*,*

> > -	 *

> > -	 * RELEASE, such that the stores to @node must be complete.

> 

> 	* publish the updated tail

> 

> >  	 */

> >  	old = xchg_tail(lock, tail);

> >  	next = NULL;

> > @@ -356,15 +362,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

> >  	 */

> >  	if (old & _Q_TAIL_MASK) {

> >  		prev = decode_tail(old);

> > -

> > -		/*

> > -		 * We must ensure that the stores to @node are observed before

> > -		 * the write to prev->next. The address dependency from

> > -		 * xchg_tail is not sufficient to ensure this because the read

> > -		 * component of xchg_tail is unordered with respect to the

> > -		 * initialisation of @node.

> > -		 */

> > -		smp_store_release(&prev->next, node);

> 

> 		/* Eventually link @node to the wait queue */

> 	

> Thoughts?


I'll make some changes along these lines for v2. Thanks!

Will
diff mbox series

Patch

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 3ad8786a47e2..42c61f7b37c5 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -141,10 +141,10 @@  static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
 	/*
-	 * Use release semantics to make sure that the MCS node is properly
-	 * initialized before changing the tail code.
+	 * We can use relaxed semantics since the caller ensures that the
+	 * MCS node is properly initialized before updating the tail.
 	 */
-	return (u32)xchg_release(&lock->tail,
+	return (u32)xchg_relaxed(&lock->tail,
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
@@ -178,10 +178,11 @@  static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 	for (;;) {
 		new = (val & _Q_LOCKED_PENDING_MASK) | tail;
 		/*
-		 * Use release semantics to make sure that the MCS node is
-		 * properly initialized before changing the tail code.
+		 * We can use relaxed semantics since the caller ensures that
+		 * the MCS node is properly initialized before updating the
+		 * tail.
 		 */
-		old = atomic_cmpxchg_release(&lock->val, val, new);
+		old = atomic_cmpxchg_relaxed(&lock->val, val, new);
 		if (old == val)
 			break;
 
@@ -340,12 +341,17 @@  void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		goto release;
 
 	/*
+	 * Ensure that the initialisation of @node is complete before we
+	 * publish the updated tail and potentially link @node into the
+	 * waitqueue.
+	 */
+	smp_wmb();
+
+	/*
 	 * We have already touched the queueing cacheline; don't bother with
 	 * pending stuff.
 	 *
 	 * p,*,* -> n,*,*
-	 *
-	 * RELEASE, such that the stores to @node must be complete.
 	 */
 	old = xchg_tail(lock, tail);
 	next = NULL;
@@ -356,15 +362,7 @@  void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 */
 	if (old & _Q_TAIL_MASK) {
 		prev = decode_tail(old);
-
-		/*
-		 * We must ensure that the stores to @node are observed before
-		 * the write to prev->next. The address dependency from
-		 * xchg_tail is not sufficient to ensure this because the read
-		 * component of xchg_tail is unordered with respect to the
-		 * initialisation of @node.
-		 */
-		smp_store_release(&prev->next, node);
+		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);