[tip/core/rcu,16/23] rcu: Prevent initialization-time quiescent-state race

Message ID 1346350718-30937-16-git-send-email-paulmck@linux.vnet.ibm.com
State New
Headers show

Commit Message

Paul E. McKenney Aug. 30, 2012, 6:18 p.m.
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Now the the grace-period initialization procedure is preemptible, it is
subject to the following race on systems whose rcu_node tree contains
more than one node:

1.	CPU 31 starts initializing the grace period, including the
	first leaf rcu_node structures, and is then preempted.

2.	CPU 0 refers to the first leaf rcu_node structure, and notes
	that a new grace period has started.  It passes through a
	quiescent state shortly thereafter, and informs the RCU core
	of this rite of passage.

3.	CPU 0 enters an RCU read-side critical section, acquiring
	a pointer to an RCU-protected data item.

4.	CPU 31 removes the data item referenced by CPU 0 from the
	data structure, and registers an RCU callback in order to
	free it.

5.	CPU 31 resumes initializing the grace period, including its
	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
	which advances all callbacks, including the one registered
	in #4 above, to be handled by the current grace period.

6.	The remaining CPUs pass through quiescent states and inform
	the RCU core, but CPU 0 remains in its RCU read-side critical
	section, still referencing the now-removed data item.

7.	The grace period completes and all the callbacks are invoked,
	including the one that frees the data item that CPU 0 is still
	referencing.  Oops!!!

This commit therefore moves the callback handling to precede initialization
of any of the rcu_node structures, thus avoiding this race.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |   33 +++++++++++++++++++--------------
 1 files changed, 19 insertions(+), 14 deletions(-)

Comments

Josh Triplett Sept. 3, 2012, 9:37 a.m. | #1
On Thu, Aug 30, 2012 at 11:18:31AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Now the the grace-period initialization procedure is preemptible, it is
> subject to the following race on systems whose rcu_node tree contains
> more than one node:
> 
> 1.	CPU 31 starts initializing the grace period, including the
> 	first leaf rcu_node structures, and is then preempted.
> 
> 2.	CPU 0 refers to the first leaf rcu_node structure, and notes
> 	that a new grace period has started.  It passes through a
> 	quiescent state shortly thereafter, and informs the RCU core
> 	of this rite of passage.
> 
> 3.	CPU 0 enters an RCU read-side critical section, acquiring
> 	a pointer to an RCU-protected data item.
> 
> 4.	CPU 31 removes the data item referenced by CPU 0 from the
> 	data structure, and registers an RCU callback in order to
> 	free it.
> 
> 5.	CPU 31 resumes initializing the grace period, including its
> 	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
> 	which advances all callbacks, including the one registered
> 	in #4 above, to be handled by the current grace period.
> 
> 6.	The remaining CPUs pass through quiescent states and inform
> 	the RCU core, but CPU 0 remains in its RCU read-side critical
> 	section, still referencing the now-removed data item.
> 
> 7.	The grace period completes and all the callbacks are invoked,
> 	including the one that frees the data item that CPU 0 is still
> 	referencing.  Oops!!!
> 
> This commit therefore moves the callback handling to precede initialization
> of any of the rcu_node structures, thus avoiding this race.

I don't think it makes sense to introduce and subsequently fix a race in
the same patch series. :)

Could you squash this patch into the one moving grace-period
initialization into a kthread?

- Josh Triplett

> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  kernel/rcutree.c |   33 +++++++++++++++++++--------------
>  1 files changed, 19 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 55f20fd..d435009 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1028,20 +1028,6 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
>  	/* Prior grace period ended, so advance callbacks for current CPU. */
>  	__rcu_process_gp_end(rsp, rnp, rdp);
>  
> -	/*
> -	 * Because this CPU just now started the new grace period, we know
> -	 * that all of its callbacks will be covered by this upcoming grace
> -	 * period, even the ones that were registered arbitrarily recently.
> -	 * Therefore, advance all outstanding callbacks to RCU_WAIT_TAIL.
> -	 *
> -	 * Other CPUs cannot be sure exactly when the grace period started.
> -	 * Therefore, their recently registered callbacks must pass through
> -	 * an additional RCU_NEXT_READY stage, so that they will be handled
> -	 * by the next RCU grace period.
> -	 */
> -	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> -	rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> -
>  	/* Set state so that this CPU will detect the next quiescent state. */
>  	__note_new_gpnum(rsp, rnp, rdp);
>  }
> @@ -1068,6 +1054,25 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  	rsp->gpnum++;
>  	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
>  	record_gp_stall_check_time(rsp);
> +
> +	/*
> +	 * Because this CPU just now started the new grace period, we
> +	 * know that all of its callbacks will be covered by this upcoming
> +	 * grace period, even the ones that were registered arbitrarily
> +	 * recently.    Therefore, advance all RCU_NEXT_TAIL callbacks
> +	 * to RCU_NEXT_READY_TAIL.  When the CPU later recognizes the
> +	 * start of the new grace period, it will advance all callbacks
> +	 * one position, which will cause all of its current outstanding
> +	 * callbacks to be handled by the newly started grace period.
> +	 *
> +	 * Other CPUs cannot be sure exactly when the grace period started.
> +	 * Therefore, their recently registered callbacks must pass through
> +	 * an additional RCU_NEXT_READY stage, so that they will be handled
> +	 * by the next RCU grace period.
> +	 */
> +	rdp = __this_cpu_ptr(rsp->rda);
> +	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> +
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  
>  	/* Exclude any concurrent CPU-hotplug operations. */
> -- 
> 1.7.8
>
Paul E. McKenney Sept. 5, 2012, 6:19 p.m. | #2
On Mon, Sep 03, 2012 at 02:37:42AM -0700, Josh Triplett wrote:
> On Thu, Aug 30, 2012 at 11:18:31AM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > Now the the grace-period initialization procedure is preemptible, it is
> > subject to the following race on systems whose rcu_node tree contains
> > more than one node:
> > 
> > 1.	CPU 31 starts initializing the grace period, including the
> > 	first leaf rcu_node structures, and is then preempted.
> > 
> > 2.	CPU 0 refers to the first leaf rcu_node structure, and notes
> > 	that a new grace period has started.  It passes through a
> > 	quiescent state shortly thereafter, and informs the RCU core
> > 	of this rite of passage.
> > 
> > 3.	CPU 0 enters an RCU read-side critical section, acquiring
> > 	a pointer to an RCU-protected data item.
> > 
> > 4.	CPU 31 removes the data item referenced by CPU 0 from the
> > 	data structure, and registers an RCU callback in order to
> > 	free it.
> > 
> > 5.	CPU 31 resumes initializing the grace period, including its
> > 	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
> > 	which advances all callbacks, including the one registered
> > 	in #4 above, to be handled by the current grace period.
> > 
> > 6.	The remaining CPUs pass through quiescent states and inform
> > 	the RCU core, but CPU 0 remains in its RCU read-side critical
> > 	section, still referencing the now-removed data item.
> > 
> > 7.	The grace period completes and all the callbacks are invoked,
> > 	including the one that frees the data item that CPU 0 is still
> > 	referencing.  Oops!!!
> > 
> > This commit therefore moves the callback handling to precede initialization
> > of any of the rcu_node structures, thus avoiding this race.
> 
> I don't think it makes sense to introduce and subsequently fix a race in
> the same patch series. :)
> 
> Could you squash this patch into the one moving grace-period
> initialization into a kthread?

I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
the problem is that breaking up rcu_gp_kthread() into subfunctions
did enough code motion to defeat straightforward rebasing.  Is there
some way to tell "git rebase" about such code motion, or would this
need to be carried out carefully by hand?

							Thanx, Paul

> - Josh Triplett
> 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  kernel/rcutree.c |   33 +++++++++++++++++++--------------
> >  1 files changed, 19 insertions(+), 14 deletions(-)
> > 
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 55f20fd..d435009 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -1028,20 +1028,6 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
> >  	/* Prior grace period ended, so advance callbacks for current CPU. */
> >  	__rcu_process_gp_end(rsp, rnp, rdp);
> >  
> > -	/*
> > -	 * Because this CPU just now started the new grace period, we know
> > -	 * that all of its callbacks will be covered by this upcoming grace
> > -	 * period, even the ones that were registered arbitrarily recently.
> > -	 * Therefore, advance all outstanding callbacks to RCU_WAIT_TAIL.
> > -	 *
> > -	 * Other CPUs cannot be sure exactly when the grace period started.
> > -	 * Therefore, their recently registered callbacks must pass through
> > -	 * an additional RCU_NEXT_READY stage, so that they will be handled
> > -	 * by the next RCU grace period.
> > -	 */
> > -	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> > -	rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> > -
> >  	/* Set state so that this CPU will detect the next quiescent state. */
> >  	__note_new_gpnum(rsp, rnp, rdp);
> >  }
> > @@ -1068,6 +1054,25 @@ static int rcu_gp_init(struct rcu_state *rsp)
> >  	rsp->gpnum++;
> >  	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
> >  	record_gp_stall_check_time(rsp);
> > +
> > +	/*
> > +	 * Because this CPU just now started the new grace period, we
> > +	 * know that all of its callbacks will be covered by this upcoming
> > +	 * grace period, even the ones that were registered arbitrarily
> > +	 * recently.    Therefore, advance all RCU_NEXT_TAIL callbacks
> > +	 * to RCU_NEXT_READY_TAIL.  When the CPU later recognizes the
> > +	 * start of the new grace period, it will advance all callbacks
> > +	 * one position, which will cause all of its current outstanding
> > +	 * callbacks to be handled by the newly started grace period.
> > +	 *
> > +	 * Other CPUs cannot be sure exactly when the grace period started.
> > +	 * Therefore, their recently registered callbacks must pass through
> > +	 * an additional RCU_NEXT_READY stage, so that they will be handled
> > +	 * by the next RCU grace period.
> > +	 */
> > +	rdp = __this_cpu_ptr(rsp->rda);
> > +	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> > +
> >  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >  
> >  	/* Exclude any concurrent CPU-hotplug operations. */
> > -- 
> > 1.7.8
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
Josh Triplett Sept. 5, 2012, 6:55 p.m. | #3
On Wed, Sep 05, 2012 at 11:19:20AM -0700, Paul E. McKenney wrote:
> On Mon, Sep 03, 2012 at 02:37:42AM -0700, Josh Triplett wrote:
> > On Thu, Aug 30, 2012 at 11:18:31AM -0700, Paul E. McKenney wrote:
> > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > 
> > > Now the the grace-period initialization procedure is preemptible, it is
> > > subject to the following race on systems whose rcu_node tree contains
> > > more than one node:
> > > 
> > > 1.	CPU 31 starts initializing the grace period, including the
> > > 	first leaf rcu_node structures, and is then preempted.
> > > 
> > > 2.	CPU 0 refers to the first leaf rcu_node structure, and notes
> > > 	that a new grace period has started.  It passes through a
> > > 	quiescent state shortly thereafter, and informs the RCU core
> > > 	of this rite of passage.
> > > 
> > > 3.	CPU 0 enters an RCU read-side critical section, acquiring
> > > 	a pointer to an RCU-protected data item.
> > > 
> > > 4.	CPU 31 removes the data item referenced by CPU 0 from the
> > > 	data structure, and registers an RCU callback in order to
> > > 	free it.
> > > 
> > > 5.	CPU 31 resumes initializing the grace period, including its
> > > 	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
> > > 	which advances all callbacks, including the one registered
> > > 	in #4 above, to be handled by the current grace period.
> > > 
> > > 6.	The remaining CPUs pass through quiescent states and inform
> > > 	the RCU core, but CPU 0 remains in its RCU read-side critical
> > > 	section, still referencing the now-removed data item.
> > > 
> > > 7.	The grace period completes and all the callbacks are invoked,
> > > 	including the one that frees the data item that CPU 0 is still
> > > 	referencing.  Oops!!!
> > > 
> > > This commit therefore moves the callback handling to precede initialization
> > > of any of the rcu_node structures, thus avoiding this race.
> > 
> > I don't think it makes sense to introduce and subsequently fix a race in
> > the same patch series. :)
> > 
> > Could you squash this patch into the one moving grace-period
> > initialization into a kthread?
> 
> I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
> the problem is that breaking up rcu_gp_kthread() into subfunctions
> did enough code motion to defeat straightforward rebasing.  Is there
> some way to tell "git rebase" about such code motion, or would this
> need to be carried out carefully by hand?

To the extent rebase knows how to handle that, I think it does so
automatically as part of merge attempts.  Fortunately, in this case, the
change consists of moving two lines of code and their attached comment,
which seems easy enough to change in the original code; you'll then get
a conflict on the commit that moves the newly fixed code (easily
resolved by moving the change to the new code), and conflicts on any
changes next to the change in the new code (hopefully handled by
three-way merge, and if not then easily fixed by keeping the new lines).

- Josh Triplett
Paul E. McKenney Sept. 5, 2012, 7:49 p.m. | #4
On Wed, Sep 05, 2012 at 11:55:34AM -0700, Josh Triplett wrote:
> On Wed, Sep 05, 2012 at 11:19:20AM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 03, 2012 at 02:37:42AM -0700, Josh Triplett wrote:
> > > On Thu, Aug 30, 2012 at 11:18:31AM -0700, Paul E. McKenney wrote:
> > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > 
> > > > Now the the grace-period initialization procedure is preemptible, it is
> > > > subject to the following race on systems whose rcu_node tree contains
> > > > more than one node:
> > > > 
> > > > 1.	CPU 31 starts initializing the grace period, including the
> > > > 	first leaf rcu_node structures, and is then preempted.
> > > > 
> > > > 2.	CPU 0 refers to the first leaf rcu_node structure, and notes
> > > > 	that a new grace period has started.  It passes through a
> > > > 	quiescent state shortly thereafter, and informs the RCU core
> > > > 	of this rite of passage.
> > > > 
> > > > 3.	CPU 0 enters an RCU read-side critical section, acquiring
> > > > 	a pointer to an RCU-protected data item.
> > > > 
> > > > 4.	CPU 31 removes the data item referenced by CPU 0 from the
> > > > 	data structure, and registers an RCU callback in order to
> > > > 	free it.
> > > > 
> > > > 5.	CPU 31 resumes initializing the grace period, including its
> > > > 	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
> > > > 	which advances all callbacks, including the one registered
> > > > 	in #4 above, to be handled by the current grace period.
> > > > 
> > > > 6.	The remaining CPUs pass through quiescent states and inform
> > > > 	the RCU core, but CPU 0 remains in its RCU read-side critical
> > > > 	section, still referencing the now-removed data item.
> > > > 
> > > > 7.	The grace period completes and all the callbacks are invoked,
> > > > 	including the one that frees the data item that CPU 0 is still
> > > > 	referencing.  Oops!!!
> > > > 
> > > > This commit therefore moves the callback handling to precede initialization
> > > > of any of the rcu_node structures, thus avoiding this race.
> > > 
> > > I don't think it makes sense to introduce and subsequently fix a race in
> > > the same patch series. :)
> > > 
> > > Could you squash this patch into the one moving grace-period
> > > initialization into a kthread?
> > 
> > I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
> > the problem is that breaking up rcu_gp_kthread() into subfunctions
> > did enough code motion to defeat straightforward rebasing.  Is there
> > some way to tell "git rebase" about such code motion, or would this
> > need to be carried out carefully by hand?
> 
> To the extent rebase knows how to handle that, I think it does so
> automatically as part of merge attempts.  Fortunately, in this case, the
> change consists of moving two lines of code and their attached comment,
> which seems easy enough to change in the original code; you'll then get
> a conflict on the commit that moves the newly fixed code (easily
> resolved by moving the change to the new code), and conflicts on any
> changes next to the change in the new code (hopefully handled by
> three-way merge, and if not then easily fixed by keeping the new lines).

Good point, perhaps if I do the code movement manually and use multiple
rebases it will go more easily.

							Thanx, Paul
Peter Zijlstra Sept. 6, 2012, 2:21 p.m. | #5
On Wed, 2012-09-05 at 11:19 -0700, Paul E. McKenney wrote:
> I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
> the problem is that breaking up rcu_gp_kthread() into subfunctions
> did enough code motion to defeat straightforward rebasing.  Is there
> some way to tell "git rebase" about such code motion, or would this
> need to be carried out carefully by hand? 

The alternative is doing that rebase by hand and in the process make
that code movement patch (6) obsolete by making patches (1) and (3)
introduce the code in the final form :-)

Yay for less patches :-)
Paul E. McKenney Sept. 6, 2012, 4:18 p.m. | #6
On Thu, Sep 06, 2012 at 04:21:30PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-09-05 at 11:19 -0700, Paul E. McKenney wrote:
> > I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
> > the problem is that breaking up rcu_gp_kthread() into subfunctions
> > did enough code motion to defeat straightforward rebasing.  Is there
> > some way to tell "git rebase" about such code motion, or would this
> > need to be carried out carefully by hand? 
> 
> The alternative is doing that rebase by hand and in the process make
> that code movement patch (6) obsolete by making patches (1) and (3)
> introduce the code in the final form :-)
> 
> Yay for less patches :-)

Actually, my original intent was that patches 1-6 be one patch.
The need to locate a nasty bug caused me to split it up.  So the best
approach is to squash patches 1-6 together with the related patches.

							Thanx, paul
Peter Zijlstra Sept. 6, 2012, 4:22 p.m. | #7
On Thu, 2012-09-06 at 09:18 -0700, Paul E. McKenney wrote:
> On Thu, Sep 06, 2012 at 04:21:30PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-09-05 at 11:19 -0700, Paul E. McKenney wrote:
> > > I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
> > > the problem is that breaking up rcu_gp_kthread() into subfunctions
> > > did enough code motion to defeat straightforward rebasing.  Is there
> > > some way to tell "git rebase" about such code motion, or would this
> > > need to be carried out carefully by hand? 
> > 
> > The alternative is doing that rebase by hand and in the process make
> > that code movement patch (6) obsolete by making patches (1) and (3)
> > introduce the code in the final form :-)
> > 
> > Yay for less patches :-)
> 
> Actually, my original intent was that patches 1-6 be one patch.
> The need to locate a nasty bug caused me to split it up.  So the best
> approach is to squash patches 1-6 together with the related patches.

I didn't mind the smaller steps, but patches like 6 which move newly
introduced code around are weird. As are patches fixing bugs introduced
in previous patches (of the same series).

Patch

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 55f20fd..d435009 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1028,20 +1028,6 @@  rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 	/* Prior grace period ended, so advance callbacks for current CPU. */
 	__rcu_process_gp_end(rsp, rnp, rdp);
 
-	/*
-	 * Because this CPU just now started the new grace period, we know
-	 * that all of its callbacks will be covered by this upcoming grace
-	 * period, even the ones that were registered arbitrarily recently.
-	 * Therefore, advance all outstanding callbacks to RCU_WAIT_TAIL.
-	 *
-	 * Other CPUs cannot be sure exactly when the grace period started.
-	 * Therefore, their recently registered callbacks must pass through
-	 * an additional RCU_NEXT_READY stage, so that they will be handled
-	 * by the next RCU grace period.
-	 */
-	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
-	rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
-
 	/* Set state so that this CPU will detect the next quiescent state. */
 	__note_new_gpnum(rsp, rnp, rdp);
 }
@@ -1068,6 +1054,25 @@  static int rcu_gp_init(struct rcu_state *rsp)
 	rsp->gpnum++;
 	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
 	record_gp_stall_check_time(rsp);
+
+	/*
+	 * Because this CPU just now started the new grace period, we
+	 * know that all of its callbacks will be covered by this upcoming
+	 * grace period, even the ones that were registered arbitrarily
+	 * recently.    Therefore, advance all RCU_NEXT_TAIL callbacks
+	 * to RCU_NEXT_READY_TAIL.  When the CPU later recognizes the
+	 * start of the new grace period, it will advance all callbacks
+	 * one position, which will cause all of its current outstanding
+	 * callbacks to be handled by the newly started grace period.
+	 *
+	 * Other CPUs cannot be sure exactly when the grace period started.
+	 * Therefore, their recently registered callbacks must pass through
+	 * an additional RCU_NEXT_READY stage, so that they will be handled
+	 * by the next RCU grace period.
+	 */
+	rdp = __this_cpu_ptr(rsp->rda);
+	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
+
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
 	/* Exclude any concurrent CPU-hotplug operations. */