[RFC,tip/core/rcu,1/6] rcu: Stabilize use of num_online_cpus() for GP short circuit

Message ID 1335199347-13926-1-git-send-email-paulmck@linux.vnet.ibm.com
State New
Headers show

Commit Message

Paul E. McKenney April 23, 2012, 4:42 p.m.
From: "Paul E. McKenney" <paul.mckenney@linaro.org>

The rcu_blocking_is_gp() function tests to see if there is only one
online CPU, and if so, synchronize_sched() and friends become no-ops.
However, for larger systems, num_online_cpus() scans a large vector,
and might be preempted while doing so.  While preempted, any number
of CPUs might come online and go offline, potentially resulting in
num_online_cpus() returning 1 when there never had only been one
CPU online.  This would result in a too-short RCU grace period, which
could in turn result in total failure.

This commit therefore protects the num_online_cpus() with preempt_disable().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/rcutree.h |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

Comments

Srivatsa S. Bhat April 24, 2012, 3:35 p.m. | #1
On 04/23/2012 10:12 PM, Paul E. McKenney wrote:

> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> 
> The rcu_blocking_is_gp() function tests to see if there is only one
> online CPU, and if so, synchronize_sched() and friends become no-ops.
> However, for larger systems, num_online_cpus() scans a large vector,


Venki had posted a patch to optimize that by using a variable, so that we
don't calculate the value each and every time.
http://thread.gmane.org/gmane.linux.kernel/1240569/focus=1246659

However, unfortunately there was some confusion around that patch and
even though it made it to akpm's tree and stayed there briefly, it didn't
go upstream. Venki had attempted to resolve the confusion here:
http://thread.gmane.org/gmane.linux.kernel/1240569/focus=1260702

> and might be preempted while doing so.  While preempted, any number
> of CPUs might come online and go offline, potentially resulting in
> num_online_cpus() returning 1 when there never had only been one
> CPU online.  This would result in a too-short RCU grace period, which
> could in turn result in total failure.
> 


So, if I understand correctly, the "too-short RCU grace period being
troublesome" case occurs when we move from UP to SMP (ie., RCU thought
that the system is in UP judging by the value returned by num_online_cpus(),
but underneath, some CPUs got onlined in the meantime and hence the system
now became SMP).

> This commit therefore protects the num_online_cpus() with preempt_disable().
> 
> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  include/linux/rcutree.h |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> index e8ee5dd..997179f 100644
> --- a/include/linux/rcutree.h
> +++ b/include/linux/rcutree.h
> @@ -101,8 +101,13 @@ extern void rcu_sched_force_quiescent_state(void);
>  /* A context switch is a grace period for RCU-sched and RCU-bh. */
>  static inline int rcu_blocking_is_gp(void)
>  {
> +	int ret;
> +


Wouldn't it be more appropriate to return a bool? (and of course change the
function prototype as well..)

>  	might_sleep();  /* Check for RCU read-side critical section. */
> -	return num_online_cpus() == 1;
> +	preempt_disable();  /* Prevent CPU hotplug from confusing us. */


Preempt disable only prevents CPU offline from occurring concurrently.  So
even if we use preempt_disable/enable(), new CPUs can always come online!
And going by the above concern about RCU grace periods being too-short, I
guess we should also protect against CPUs from coming online concurrently
right?

[By the way, we don't really need to protect against CPU offlining, because
an SMP -> UP switch is harmless; we'll just miss a fast-path in the worst
case, but the code correctness is still guaranteed.]

> +	ret = num_online_cpus() == 1;
> +	preempt_enable();
> +	return ret;
>  }
>


Maybe I am missing something, but how is this code safe? Isn't it still racy?
Even if we use:

{
	might_sleep();
	get_online_cpus(); /* Prevent both CPU offline and CPU online */
	ret = num_online_cpus() == 1;
	put_online_cpus();
			<===============
	return ret;
}

What prevents a CPU Hotplug from happening right before we return from this
function, thereby rendering our num_online_cpus() calculation out-dated and
hence useless?

Regards,
Srivatsa S. Bhat
Paul E. McKenney April 24, 2012, 4:50 p.m. | #2
On Tue, Apr 24, 2012 at 09:05:20PM +0530, Srivatsa S. Bhat wrote:

Thank you for looking this over!

> On 04/23/2012 10:12 PM, Paul E. McKenney wrote:
> 
> > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > 
> > The rcu_blocking_is_gp() function tests to see if there is only one
> > online CPU, and if so, synchronize_sched() and friends become no-ops.
> > However, for larger systems, num_online_cpus() scans a large vector,
> 
> 
> Venki had posted a patch to optimize that by using a variable, so that we
> don't calculate the value each and every time.
> http://thread.gmane.org/gmane.linux.kernel/1240569/focus=1246659
> 
> However, unfortunately there was some confusion around that patch and
> even though it made it to akpm's tree and stayed there briefly, it didn't
> go upstream. Venki had attempted to resolve the confusion here:
> http://thread.gmane.org/gmane.linux.kernel/1240569/focus=1260702

Having a single variable tracking the online state would be good,
but as you say it isn't there yet.

> > and might be preempted while doing so.  While preempted, any number
> > of CPUs might come online and go offline, potentially resulting in
> > num_online_cpus() returning 1 when there never had only been one
> > CPU online.  This would result in a too-short RCU grace period, which
> > could in turn result in total failure.
> > 
> 
> 
> So, if I understand correctly, the "too-short RCU grace period being
> troublesome" case occurs when we move from UP to SMP (ie., RCU thought
> that the system is in UP judging by the value returned by num_online_cpus(),
> but underneath, some CPUs got onlined in the meantime and hence the system
> now became SMP).

No, the UP-to-SMP case is OK.  If we were UP at any point during the
execution of synchronize_rcu(), a non-preemptible RCU grace period can
be a no-op.

The problematic case is instead the one where we were SMP throughout,
but rcu_blocking_is_gp() mistakenly decides that we were UP.  For example,
consider the following sequence of events, based on the commit log's
sentence "While preempted, any number of CPUs might come online and go
offline, potentially resulting in num_online_cpus() returning 1 when
there never had only been one CPU online":

o	CPUs 100 and 150 are initially online, with a long-running RCU
	read-side critical section on CPU 100 and rcu_blocking_is_gp()
	initially running on CPU 150.

o	The rcu_blocking_is_gp() function checks the bits for CPUs
	0-63, and counts zero online CPUs.

o	CPU 1 comes online.

o	The rcu_blocking_is_gp() function checks the bits for CPUs
	64-127, and counts one online CPUs, for a total thus far
	of one CPU online..

o	CPU 150 goes offline.  Ah, but it cannot do this, because
	this is non-preemptible RCU, which means that the RCU
	read-side critical section has disabled preemption on
	CPU 100, which prevents CPU 150 from going offline, which
	prevents this scenario from occurring.

	So, yes, rcu_blocking_is_gp() can be fooled into incorrectly
	stating that the system has only one CPU (or even that it has
	only zero CPUs), but not while there is actually a non-preemptible
	RCU read-side critical section running.  Yow!

	I clearly had not thought this change through far enough,
	thank you for calling it to my attention!

So I could replace this patch with a patch that adds a comment
explaining why this works.  Though this patch might be simpler and
easier to understand.  ;-)  But not so good for real-time response
on large systems, I suppose.

And rcu_blocking_is_gp() is called only from synchronize_sched() and
synchronize_rcu_bh(), so it is safe for all current callers.  But it is
defined publicly, so I should move it to somewhere like kernel/rcutree.c
to keep new users from being fatally confused and disappointed.

I can also change the comparison from "num_online_cpus() == 1" to
"num_online_cpus() <= 1".  That should at least serve as a caution to
people who might attempt to use it where it shouldn't be used.  ;-)

> > This commit therefore protects the num_online_cpus() with preempt_disable().
> > 
> > Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  include/linux/rcutree.h |    7 ++++++-
> >  1 files changed, 6 insertions(+), 1 deletions(-)
> > 
> > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> > index e8ee5dd..997179f 100644
> > --- a/include/linux/rcutree.h
> > +++ b/include/linux/rcutree.h
> > @@ -101,8 +101,13 @@ extern void rcu_sched_force_quiescent_state(void);
> >  /* A context switch is a grace period for RCU-sched and RCU-bh. */
> >  static inline int rcu_blocking_is_gp(void)
> >  {
> > +	int ret;
> > +
> 
> 
> Wouldn't it be more appropriate to return a bool? (and of course change the
> function prototype as well..)

Sounds like a separate patch, but yes it does.

> >  	might_sleep();  /* Check for RCU read-side critical section. */
> > -	return num_online_cpus() == 1;
> > +	preempt_disable();  /* Prevent CPU hotplug from confusing us. */
> 
> 
> Preempt disable only prevents CPU offline from occurring concurrently.  So
> even if we use preempt_disable/enable(), new CPUs can always come online!
> And going by the above concern about RCU grace periods being too-short, I
> guess we should also protect against CPUs from coming online concurrently
> right?
> 
> [By the way, we don't really need to protect against CPU offlining, because
> an SMP -> UP switch is harmless; we'll just miss a fast-path in the worst
> case, but the code correctness is still guaranteed.]

Yes, CPUs coming online are harmless, so it is OK that disabling preemption
does not prevent this.

> > +	ret = num_online_cpus() == 1;
> > +	preempt_enable();
> > +	return ret;
> >  }
> 
> Maybe I am missing something, but how is this code safe? Isn't it still racy?
> Even if we use:
> 
> {
> 	might_sleep();
> 	get_online_cpus(); /* Prevent both CPU offline and CPU online */
> 	ret = num_online_cpus() == 1;
> 	put_online_cpus();
> 			<===============
> 	return ret;
> }
> 
> What prevents a CPU Hotplug from happening right before we return from this
> function, thereby rendering our num_online_cpus() calculation out-dated and
> hence useless?

Nothing prevents CPU hotplug from happening at the indicated point,
but the function is still safe.

The reason for this safety is that if we were UP at any time during the
execution of synchronize_rcu(), we had to have been running on that single
CPU, which means that a non-preemptible RCU read-side critical section
could not possibly have also been running.  Because non-preemptible
RCU read-side critical sections run with preemption disabled, this in
turn means that all prior RCU read-side critical sections have to have
completed, which means that synchronize_rcu() doesn't need to do anything.

							Thanx, Paul
Srivatsa S. Bhat April 24, 2012, 5:46 p.m. | #3
On 04/24/2012 10:20 PM, Paul E. McKenney wrote:

> On Tue, Apr 24, 2012 at 09:05:20PM +0530, Srivatsa S. Bhat wrote:
> 
>>
>>> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
>>>
>>> The rcu_blocking_is_gp() function tests to see if there is only one
>>> online CPU, and if so, synchronize_sched() and friends become no-ops.
>>> However, for larger systems, num_online_cpus() scans a large vector,
>>
>>
>> Venki had posted a patch to optimize that by using a variable, so that we
>> don't calculate the value each and every time.
>> http://thread.gmane.org/gmane.linux.kernel/1240569/focus=1246659
>>
>> However, unfortunately there was some confusion around that patch and
>> even though it made it to akpm's tree and stayed there briefly, it didn't
>> go upstream. Venki had attempted to resolve the confusion here:
>> http://thread.gmane.org/gmane.linux.kernel/1240569/focus=1260702
> 
> Having a single variable tracking the online state would be good,
> but as you say it isn't there yet.
> 
>>> and might be preempted while doing so.  While preempted, any number
>>> of CPUs might come online and go offline, potentially resulting in
>>> num_online_cpus() returning 1 when there never had only been one
>>> CPU online.  This would result in a too-short RCU grace period, which
>>> could in turn result in total failure.
>>
>>
[...]

> 
> The problematic case is instead the one where we were SMP throughout,
> but rcu_blocking_is_gp() mistakenly decides that we were UP.  For example,
> consider the following sequence of events, based on the commit log's
> sentence "While preempted, any number of CPUs might come online and go
> offline, potentially resulting in num_online_cpus() returning 1 when
> there never had only been one CPU online":
> 


Oh, I didn't think in the direction illustrated below when reading that
sentence.. :-(

> o	CPUs 100 and 150 are initially online, with a long-running RCU
> 	read-side critical section on CPU 100 and rcu_blocking_is_gp()
> 	initially running on CPU 150.
> 
> o	The rcu_blocking_is_gp() function checks the bits for CPUs
> 	0-63, and counts zero online CPUs.
> 
> o	CPU 1 comes online.
> 
> o	The rcu_blocking_is_gp() function checks the bits for CPUs
> 	64-127, and counts one online CPUs, for a total thus far
> 	of one CPU online..
> 
> o	CPU 150 goes offline.  Ah, but it cannot do this, because
> 	this is non-preemptible RCU, which means that the RCU
> 	read-side critical section has disabled preemption on
> 	CPU 100, which prevents CPU 150 from going offline, which
> 	prevents this scenario from occurring.
> 
> 	So, yes, rcu_blocking_is_gp() can be fooled into incorrectly
> 	stating that the system has only one CPU (or even that it has
> 	only zero CPUs), but not while there is actually a non-preemptible
> 	RCU read-side critical section running.  Yow!
>


Awesome :-)

 
> 	I clearly had not thought this change through far enough,
> 	thank you for calling it to my attention!
> 
> So I could replace this patch with a patch that adds a comment
> explaining why this works.


Yes, that would be great..

>  Though this patch might be simpler and
> easier to understand.  ;-)


Oh well, but I completely missed the intention behind the patch!
So I guess a comment would be better ;-)

>  But not so good for real-time response
> on large systems, I suppose.
> 
> And rcu_blocking_is_gp() is called only from synchronize_sched() and
> synchronize_rcu_bh(), so it is safe for all current callers.  But it is
> defined publicly, so I should move it to somewhere like kernel/rcutree.c
> to keep new users from being fatally confused and disappointed.
> 
> I can also change the comparison from "num_online_cpus() == 1" to
> "num_online_cpus() <= 1".  That should at least serve as a caution to
> people who might attempt to use it where it shouldn't be used.  ;-)
> 


Hehe, yeah!

Regards,
Srivatsa S. Bhat
Rusty Russell May 7, 2012, 3:47 a.m. | #4
On Tue, 24 Apr 2012 09:50:56 -0700, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> On Tue, Apr 24, 2012 at 09:05:20PM +0530, Srivatsa S. Bhat wrote:
> 
> Thank you for looking this over!
> 
> > On 04/23/2012 10:12 PM, Paul E. McKenney wrote:
> > 
> > > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > > 
> > > The rcu_blocking_is_gp() function tests to see if there is only one
> > > online CPU, and if so, synchronize_sched() and friends become no-ops.
> > > However, for larger systems, num_online_cpus() scans a large vector,
> > 
> > 
> > Venki had posted a patch to optimize that by using a variable, so that we
> > don't calculate the value each and every time.
> > http://thread.gmane.org/gmane.linux.kernel/1240569/focus=1246659
> > 
> > However, unfortunately there was some confusion around that patch and
> > even though it made it to akpm's tree and stayed there briefly, it didn't
> > go upstream. Venki had attempted to resolve the confusion here:
> > http://thread.gmane.org/gmane.linux.kernel/1240569/focus=1260702
> 
> Having a single variable tracking the online state would be good,
> but as you say it isn't there yet.

I think you could have a flag set by a notifier, but you'd have to
re-think the races...

Cheers,
Rusty.

Patch

diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index e8ee5dd..997179f 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -101,8 +101,13 @@  extern void rcu_sched_force_quiescent_state(void);
 /* A context switch is a grace period for RCU-sched and RCU-bh. */
 static inline int rcu_blocking_is_gp(void)
 {
+	int ret;
+
 	might_sleep();  /* Check for RCU read-side critical section. */
-	return num_online_cpus() == 1;
+	preempt_disable();  /* Prevent CPU hotplug from confusing us. */
+	ret = num_online_cpus() == 1;
+	preempt_enable();
+	return ret;
 }
 
 extern void rcu_scheduler_starting(void);