[RFC] rcu: Make __rcu_read_lock() inlinable

Message ID 20120325205249.GA29528@linux.vnet.ibm.com
State New
Headers show

Commit Message

Paul E. McKenney March 25, 2012, 8:52 p.m.
The preemptible-RCU implementations of __rcu_read_lock() have not been
inlinable due to task_struct references that ran afoul of include-file
dependencies.  Fix this (as suggested by Linus) by moving the task_struct
->rcu_read_lock_nesting field to a per-CPU variable that is saved and
restored at context-switch time.  With this change, __rcu_read_lock()
references only that new per-CPU variable, and so may be safely
inlined.  This change also allows some code that was duplicated in
kernel/rcutree_plugin.h and kernel/rcutiny_plugin.h to be merged into
include/linux/rcupdate.h.

This same approach unfortunately cannot be used on __rcu_read_unlock()
because it also references the task_struct ->rcu_read_unlock_special
field, to which cross-task access is required by rcu_boost().  This
function will be handled in a separate commit, if need be.

The TREE_RCU variant passes modest rcutorture runs, while TINY_RCU still
has a few bugs.  Peter Zijlstra might have some thoughts on hooking into
the scheduler.  Disallows use of RCU from within the architecture-specific
switch_to() function, which probably runs afoul of tracing for at least
some architectures.  There probably are still a few other bugs, as well.

TREE_RCU should be OK for experimental usage.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Not-yet-signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Not-yet-signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 arch/um/drivers/mconsole_kern.c |    2 
 include/linux/init_task.h       |    2 
 include/linux/rcupdate.h        |   66 +++++++++++++++++++++++++++++++-
 include/linux/rcutiny.h         |    7 ---
 include/linux/rcutree.h         |   12 -----
 include/linux/sched.h           |    4 -
 kernel/rcupdate.c               |    4 +
 kernel/rcutiny_plugin.h         |   82 ++++++++++++++++------------------------
 kernel/rcutree_plugin.h         |   72 +++++++++++++++++------------------
 kernel/sched/core.c             |    2 
 10 files changed, 145 insertions(+), 108 deletions(-)

The preemptible-RCU implementations of __rcu_read_lock() have not been
inlinable due to task_struct references that ran afoul of include-file
dependencies.  Fix this (as suggested by Linus) by moving the task_struct
->rcu_read_lock_nesting field to a per-CPU variable that is saved and
restored at context-switch time.  With this change, __rcu_read_lock()
references only that new per-CPU variable, and so may be safely
inlined.  This change also allows some code that was duplicated in
kernel/rcutree_plugin.h and kernel/rcutiny_plugin.h to be merged into
include/linux/rcupdate.h.

This same approach unfortunately cannot be used on __rcu_read_unlock()
because it also references the task_struct ->rcu_read_unlock_special
field, to which cross-task access is required by rcu_boost().  This
function will be handled in a separate commit, if need be.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Comments

Peter Zijlstra March 26, 2012, 7:54 a.m. | #1
On Sun, 2012-03-25 at 13:52 -0700, Paul E. McKenney wrote:
> The preemptible-RCU implementations of __rcu_read_lock() have not been
> inlinable due to task_struct references that ran afoul of include-file
> dependencies.  Fix this (as suggested by Linus) by moving the task_struct
> ->rcu_read_lock_nesting field to a per-CPU variable that is saved and
> restored at context-switch time.  With this change, __rcu_read_lock()
> references only that new per-CPU variable, and so may be safely
> inlined.  This change also allows some code that was duplicated in
> kernel/rcutree_plugin.h and kernel/rcutiny_plugin.h to be merged into
> include/linux/rcupdate.h.
> 
> This same approach unfortunately cannot be used on __rcu_read_unlock()
> because it also references the task_struct ->rcu_read_unlock_special
> field, to which cross-task access is required by rcu_boost().  This
> function will be handled in a separate commit, if need be.
> 
> The TREE_RCU variant passes modest rcutorture runs, while TINY_RCU still
> has a few bugs.  Peter Zijlstra might have some thoughts on hooking into
> the scheduler.  Disallows use of RCU from within the architecture-specific
> switch_to() function, which probably runs afoul of tracing for at least
> some architectures.  There probably are still a few other bugs, as well.
> 
> TREE_RCU should be OK for experimental usage.

Right, so I really dislike adding this cache-miss to the context switch
path, that said, rcu is used often enough that the savings on
rcu_read_lock() might just come out in favour of this.. but it would be
very nice to have some numbers.

Also,

>  /*
> + * Save the incoming task's value for rcu_read_lock_nesting at the
> + * end of a context switch.  There can be no process-state RCU read-side
> + * critical sections between the call to rcu_switch_from() and to
> + * rcu_switch_to().  Interrupt-level RCU read-side critical sections are
> + * OK because rcu_read_unlock_special() takes early exits when called
> + * at interrupt level.
> + */
> +void rcu_switch_from(void)
> +{
> +	current->rcu_read_lock_nesting_save =
> +		__this_cpu_read(rcu_read_lock_nesting);
> +	barrier();
> +	__this_cpu_write(rcu_read_lock_nesting, 0);
> +}

Since rcu_switch_to() will again write rcu_read_lock_nesting, what's the
point of setting it to zero?

Also, that barrier(), there's a clear dependency between the operations
how can the compiler mess that up?

> +/*
> + * Restore the incoming task's value for rcu_read_lock_nesting at the
> + * end of a context switch.
>   */
> +void rcu_switch_to(void)
>  {
> +	__this_cpu_write(rcu_read_lock_nesting,
> +			 current->rcu_read_lock_nesting_save);
> +	barrier();
> +	current->rcu_read_lock_nesting_save = 0;
>  }

Similar, a future rcu_switch_from() will again over-write
current->rcu_read_lock_nesting_save, what's the point of clearing it?


> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2051,7 +2051,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  #endif
>  
>  	/* Here we just switch the register state and the stack. */
> +	rcu_switch_from();
>  	switch_to(prev, next, prev);
> +	rcu_switch_to();
>  
>  	barrier();
>  	/*

So why not save one call and do:

	switch_to(prev, next, prev);
 	rcu_switch_to(prev, next);

and have

void rcu_switch_to(struct task_struct *prev, struct task_struct *next)
{
	prev->rcu_read_lock_nesting_save = __this_cpu_read(rcu_read_lock_nesting);
	__this_cpu_write(rcu_read_lock_nesting) = next->rcu_read_lock_nesting_save;
}

preferably as an inline function so we can avoid all calls.
Paul E. McKenney March 26, 2012, 6:32 p.m. | #2
On Mon, Mar 26, 2012 at 09:54:44AM +0200, Peter Zijlstra wrote:
> On Sun, 2012-03-25 at 13:52 -0700, Paul E. McKenney wrote:
> > The preemptible-RCU implementations of __rcu_read_lock() have not been
> > inlinable due to task_struct references that ran afoul of include-file
> > dependencies.  Fix this (as suggested by Linus) by moving the task_struct
> > ->rcu_read_lock_nesting field to a per-CPU variable that is saved and
> > restored at context-switch time.  With this change, __rcu_read_lock()
> > references only that new per-CPU variable, and so may be safely
> > inlined.  This change also allows some code that was duplicated in
> > kernel/rcutree_plugin.h and kernel/rcutiny_plugin.h to be merged into
> > include/linux/rcupdate.h.
> > 
> > This same approach unfortunately cannot be used on __rcu_read_unlock()
> > because it also references the task_struct ->rcu_read_unlock_special
> > field, to which cross-task access is required by rcu_boost().  This
> > function will be handled in a separate commit, if need be.
> > 
> > The TREE_RCU variant passes modest rcutorture runs, while TINY_RCU still
> > has a few bugs.  Peter Zijlstra might have some thoughts on hooking into
> > the scheduler.  Disallows use of RCU from within the architecture-specific
> > switch_to() function, which probably runs afoul of tracing for at least
> > some architectures.  There probably are still a few other bugs, as well.
> > 
> > TREE_RCU should be OK for experimental usage.
> 
> Right, so I really dislike adding this cache-miss to the context switch
> path, that said, rcu is used often enough that the savings on
> rcu_read_lock() might just come out in favour of this.. but it would be
> very nice to have some numbers.

I need to get it into known-good shape before evaluating, but yes, some
justification is clearly required.

> Also,
> 
> >  /*
> > + * Save the incoming task's value for rcu_read_lock_nesting at the
> > + * end of a context switch.  There can be no process-state RCU read-side
> > + * critical sections between the call to rcu_switch_from() and to
> > + * rcu_switch_to().  Interrupt-level RCU read-side critical sections are
> > + * OK because rcu_read_unlock_special() takes early exits when called
> > + * at interrupt level.
> > + */
> > +void rcu_switch_from(void)
> > +{
> > +	current->rcu_read_lock_nesting_save =
> > +		__this_cpu_read(rcu_read_lock_nesting);
> > +	barrier();
> > +	__this_cpu_write(rcu_read_lock_nesting, 0);
> > +}
> 
> Since rcu_switch_to() will again write rcu_read_lock_nesting, what's the
> point of setting it to zero?
> 
> Also, that barrier(), there's a clear dependency between the operations
> how can the compiler mess that up?

Both were debugging assists which I have now removed.

> > +/*
> > + * Restore the incoming task's value for rcu_read_lock_nesting at the
> > + * end of a context switch.
> >   */
> > +void rcu_switch_to(void)
> >  {
> > +	__this_cpu_write(rcu_read_lock_nesting,
> > +			 current->rcu_read_lock_nesting_save);
> > +	barrier();
> > +	current->rcu_read_lock_nesting_save = 0;
> >  }
> 
> Similar, a future rcu_switch_from() will again over-write
> current->rcu_read_lock_nesting_save, what's the point of clearing it?

I removed that one as well, again, debug code.

> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2051,7 +2051,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >  #endif
> >  
> >  	/* Here we just switch the register state and the stack. */
> > +	rcu_switch_from();
> >  	switch_to(prev, next, prev);
> > +	rcu_switch_to();
> >  
> >  	barrier();
> >  	/*
> 
> So why not save one call and do:
> 
> 	switch_to(prev, next, prev);
>  	rcu_switch_to(prev, next);
> 
> and have
> 
> void rcu_switch_to(struct task_struct *prev, struct task_struct *next)
> {
> 	prev->rcu_read_lock_nesting_save = __this_cpu_read(rcu_read_lock_nesting);
> 	__this_cpu_write(rcu_read_lock_nesting) = next->rcu_read_lock_nesting_save;
> }
> 
> preferably as an inline function so we can avoid all calls.

I could inline them into sched.h, if you are agreeable.

I am a bit concerned about putting them both together because I am betting
that at least some of the architectures having tracing in switch_to(),
which I currently do not handle well.  At the moment, the ways I can
think of to handle it well require saving before the switch and restoring
afterwards.  Otherwise, I can end up with the ->rcu_read_unlock_special
flags getting associated with the wrong RCU read-side critical section,
as happened last year.

Preemption is disabled at this point, correct?

Hmmm...  One way that I could reduce the overhead that preemptible RCU
imposes on the scheduler would be to move the task_struct queuing from
its current point upon entry to the scheduler to just before switch_to().
(The _bh and _sched quiescent states still need to remain at scheduler
entry.)  That would mean that RCU would not queue tasks that entered
the scheduler, but did not actually do a context switch.

Would that be helpful?

							Thanx, Paul
Peter Zijlstra March 26, 2012, 6:47 p.m. | #3
On Mon, 2012-03-26 at 11:32 -0700, Paul E. McKenney wrote:
> 
> I could inline them into sched.h, if you are agreeable.

Sure, or put it in kernel/sched/core.c.

> I am a bit concerned about putting them both together because I am betting
> that at least some of the architectures having tracing in switch_to(),
> which I currently do not handle well. 

I would hope not.. there's a generic trace_sched_switch() and
switch_to() is supposed to be black magic. I'd be fine breaking that as
long as we can detect it.

>  At the moment, the ways I can
> think of to handle it well require saving before the switch and restoring
> afterwards.  Otherwise, I can end up with the ->rcu_read_unlock_special
> flags getting associated with the wrong RCU read-side critical section,
> as happened last year.
> 
> Preemption is disabled at this point, correct?

Yeah, and soon we'll have interrupts disabled as well (on all archs,
currently only ARM has interrupts enabled at this point).

> Hmmm...  One way that I could reduce the overhead that preemptible RCU
> imposes on the scheduler would be to move the task_struct queuing from
> its current point upon entry to the scheduler to just before switch_to().
> (The _bh and _sched quiescent states still need to remain at scheduler
> entry.)  That would mean that RCU would not queue tasks that entered
> the scheduler, but did not actually do a context switch.

That would make sense anyhow, right? No point in queueing a task if you
didn't actually switch away from it.

> Would that be helpful?

For now that's preemptible rcu only, and as such a somewhat niche
feature (iirc its not enabled in the big distros) so I don't think it
matters too much. But yeah, would be nice.
Steven Rostedt March 26, 2012, 6:53 p.m. | #4
On Mon, 2012-03-26 at 11:32 -0700, Paul E. McKenney wrote:

> Preemption is disabled at this point, correct?
> 
Is this a philosophical question? You're asking if preemption is
disabled while a task is currently being preempted ;-)

-- Steve
Paul E. McKenney March 26, 2012, 11:43 p.m. | #5
On Mon, Mar 26, 2012 at 02:53:28PM -0400, Steven Rostedt wrote:
> On Mon, 2012-03-26 at 11:32 -0700, Paul E. McKenney wrote:
> 
> > Preemption is disabled at this point, correct?
> > 
> Is this a philosophical question? You're asking if preemption is
> disabled while a task is currently being preempted ;-)

That was my first reaction as well, but never underestimate Peter's
cleverness.  ;-)

							Thanx, Paul
Paul E. McKenney March 27, 2012, 5:15 a.m. | #6
On Mon, Mar 26, 2012 at 08:47:10PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 11:32 -0700, Paul E. McKenney wrote:
> > 
> > I could inline them into sched.h, if you are agreeable.
> 
> Sure, or put it in kernel/sched/core.c.

That was my first thought, but there is a use of switch_to() in
arch/um/drivers/mconsole_kern.c.

> > I am a bit concerned about putting them both together because I am betting
> > that at least some of the architectures having tracing in switch_to(),
> > which I currently do not handle well. 
> 
> I would hope not.. there's a generic trace_sched_switch() and
> switch_to() is supposed to be black magic. I'd be fine breaking that as
> long as we can detect it.

Hmmm...  I am not yet sure whether it is easier to make RCU use legal
in switch_to() or to detect it.  I am inclined to take whatever course
is easiest, which is likely to make it legal.  :-/

> >  At the moment, the ways I can
> > think of to handle it well require saving before the switch and restoring
> > afterwards.  Otherwise, I can end up with the ->rcu_read_unlock_special
> > flags getting associated with the wrong RCU read-side critical section,
> > as happened last year.
> > 
> > Preemption is disabled at this point, correct?
> 
> Yeah, and soon we'll have interrupts disabled as well (on all archs,
> currently only ARM has interrupts enabled at this point).

Good to know!

> > Hmmm...  One way that I could reduce the overhead that preemptible RCU
> > imposes on the scheduler would be to move the task_struct queuing from
> > its current point upon entry to the scheduler to just before switch_to().
> > (The _bh and _sched quiescent states still need to remain at scheduler
> > entry.)  That would mean that RCU would not queue tasks that entered
> > the scheduler, but did not actually do a context switch.
> 
> That would make sense anyhow, right? No point in queueing a task if you
> didn't actually switch away from it.

Also it would simplify the save and restore operation, I believe.

> > Would that be helpful?
> 
> For now that's preemptible rcu only, and as such a somewhat niche
> feature (iirc its not enabled in the big distros) so I don't think it
> matters too much. But yeah, would be nice.

OK, let me see what works best.

							Thanx, Paul
Lai Jiangshan March 27, 2012, 8:06 a.m. | #7
On 03/26/2012 04:52 AM, Paul E. McKenney wrote:

> +void rcu_switch_from(void)
>  {
> -	current->rcu_read_lock_nesting++;
> -	barrier();  /* needed if we ever invoke rcu_read_lock in rcutree.c */
> +	current->rcu_read_lock_nesting_save =
> +		__this_cpu_read(rcu_read_lock_nesting);
> +	barrier();
> +	__this_cpu_write(rcu_read_lock_nesting, 0);

-	__this_cpu_write(rcu_read_lock_nesting, 0);
+	__this_cpu_write(rcu_read_lock_nesting, 1);

if prev or next task has non-zero rcu_read_unlock_special,
"__this_cpu_write(rcu_read_lock_nesting, 1)" will prevent wrong qs reporting
when rcu_read_unlock() is called in any interrupt/tracing while doing switch_to().

> +}
> +
> +/*
> + * Restore the incoming task's value for rcu_read_lock_nesting at the
> + * end of a context switch.
> + */
> +void rcu_switch_to(void)
> +{
> +	__this_cpu_write(rcu_read_lock_nesting,
> +			 current->rcu_read_lock_nesting_save);
> +	barrier();
> +	current->rcu_read_lock_nesting_save = 0;
>  }

-	barrier();
-	current->rcu_read_lock_nesting_save = 0;

rcu_read_lock_nesting_save is set but not used before next set here, just remove it.


I don't like it hooks too much into scheduler.

Approaches:
0) stay using function call
1) hook into kbuild(https://lkml.org/lkml/2011/3/27/170,https://lkml.org/lkml/2011/3/27/171)
2) hook into scheduler(still need more works for rcu_read_unlock())
3) Add rcu_read_lock_nesting to thread_info like preempt_count
4) resolve header-file dependence

For me
3=4>1>2>0

Thanks,
Lai
Steven Rostedt March 27, 2012, 12:26 p.m. | #8
On Mon, 2012-03-26 at 22:15 -0700, Paul E. McKenney wrote:

> Hmmm...  I am not yet sure whether it is easier to make RCU use legal
> in switch_to() or to detect it.  I am inclined to take whatever course
> is easiest, which is likely to make it legal.  :-/

We could just declare that we do not allow tracepoints in arch specific
"switch_to" code. Then you shouldn't need to worry about RCU in
switch_to().

sched_rcu can still work there correct? That is, a synchronize_sched()
should not be affected. As that is needed for the function tracing, and
that may be called within a switch_to.

-- Steve
Paul E. McKenney March 27, 2012, 4:39 p.m. | #9
On Tue, Mar 27, 2012 at 08:26:07AM -0400, Steven Rostedt wrote:
> On Mon, 2012-03-26 at 22:15 -0700, Paul E. McKenney wrote:
> 
> > Hmmm...  I am not yet sure whether it is easier to make RCU use legal
> > in switch_to() or to detect it.  I am inclined to take whatever course
> > is easiest, which is likely to make it legal.  :-/
> 
> We could just declare that we do not allow tracepoints in arch specific
> "switch_to" code. Then you shouldn't need to worry about RCU in
> switch_to().

Heh.  I expect that to work about as well as the earlier declaration
that RCU not be used in the idle loop.  ;-)

> sched_rcu can still work there correct? That is, a synchronize_sched()
> should not be affected. As that is needed for the function tracing, and
> that may be called within a switch_to.

Yep, good point.

							Thanx, Paul
Paul E. McKenney March 27, 2012, 4:46 p.m. | #10
On Tue, Mar 27, 2012 at 04:06:08PM +0800, Lai Jiangshan wrote:
> On 03/26/2012 04:52 AM, Paul E. McKenney wrote:
> 
> > +void rcu_switch_from(void)
> >  {
> > -	current->rcu_read_lock_nesting++;
> > -	barrier();  /* needed if we ever invoke rcu_read_lock in rcutree.c */
> > +	current->rcu_read_lock_nesting_save =
> > +		__this_cpu_read(rcu_read_lock_nesting);
> > +	barrier();
> > +	__this_cpu_write(rcu_read_lock_nesting, 0);
> 
> -	__this_cpu_write(rcu_read_lock_nesting, 0);
> +	__this_cpu_write(rcu_read_lock_nesting, 1);
> 
> if prev or next task has non-zero rcu_read_unlock_special,
> "__this_cpu_write(rcu_read_lock_nesting, 1)" will prevent wrong qs reporting
> when rcu_read_unlock() is called in any interrupt/tracing while doing switch_to().

This is one approach that I have been considering.  I am concerned about
interactions with ->rcu_read_unlock_special, however.  The approach that I
am favoring at the moment is to save and restore ->rcu_read_unlock_special
from another per-CPU variable, which would allow that per-CPU variable to
be zeroed at this point.  Then because there can be no preemption at this
point in the code, execution would stay out of rcu_read_unlock_special()
for the duration of the context switch.

> > +}
> > +
> > +/*
> > + * Restore the incoming task's value for rcu_read_lock_nesting at the
> > + * end of a context switch.
> > + */
> > +void rcu_switch_to(void)
> > +{
> > +	__this_cpu_write(rcu_read_lock_nesting,
> > +			 current->rcu_read_lock_nesting_save);
> > +	barrier();
> > +	current->rcu_read_lock_nesting_save = 0;
> >  }
> 
> -	barrier();
> -	current->rcu_read_lock_nesting_save = 0;
> 
> rcu_read_lock_nesting_save is set but not used before next set here, just remove it.

Yep, as noted earlier.

> I don't like it hooks too much into scheduler.
> 
> Approaches:
> 0) stay using function call
> 1) hook into kbuild(https://lkml.org/lkml/2011/3/27/170,https://lkml.org/lkml/2011/3/27/171)
> 2) hook into scheduler(still need more works for rcu_read_unlock())
> 3) Add rcu_read_lock_nesting to thread_info like preempt_count
> 4) resolve header-file dependence
> 
> For me
> 3=4>1>2>0

The advantage of the current per-CPU-variable approach is that it
permits x86 to reduce rcu_read_lock() to a single instruction, so it
seems worthwhile persuing it.  In addition, having RCU-preempt hook
at switch_to() eliminates needless task queuing in the case where the
scheduler is entered, but no context switch actually takes place.

							Thanx, Paul

Patch

diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c
index c70e047..a98693f 100644
--- a/arch/um/drivers/mconsole_kern.c
+++ b/arch/um/drivers/mconsole_kern.c
@@ -704,7 +704,9 @@  static void stack_proc(void *arg)
 	struct task_struct *from = current, *to = arg;
 
 	to->thread.saved_task = from;
+	rcu_switch_from();
 	switch_to(from, to, from);
+	rcu_switch_to();
 }
 
 /*
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 9c66b1a..6148cd6 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -105,7 +105,7 @@  extern struct group_info init_groups;
 #endif
 #ifdef CONFIG_PREEMPT_RCU
 #define INIT_TASK_RCU_PREEMPT(tsk)					\
-	.rcu_read_lock_nesting = 0,					\
+	.rcu_read_lock_nesting_save = 0,				\
 	.rcu_read_unlock_special = 0,					\
 	.rcu_node_entry = LIST_HEAD_INIT(tsk.rcu_node_entry),		\
 	INIT_TASK_RCU_TREE_PREEMPT()					\
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 9372174..2921217 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -43,6 +43,7 @@ 
 #include <linux/completion.h>
 #include <linux/debugobjects.h>
 #include <linux/compiler.h>
+#include <linux/percpu.h>
 
 #ifdef CONFIG_RCU_TORTURE_TEST
 extern int rcutorture_runnable; /* for sysctl */
@@ -144,7 +145,19 @@  extern void synchronize_sched(void);
 
 #ifdef CONFIG_PREEMPT_RCU
 
-extern void __rcu_read_lock(void);
+DECLARE_PER_CPU(int, rcu_read_lock_nesting);
+
+/*
+ * Preemptible-RCU implementation for rcu_read_lock().  Just increment
+ * the per-CPU rcu_read_lock_nesting: Shared state and per-task state will
+ * be updated if we block.
+ */
+static inline void __rcu_read_lock(void)
+{
+	__this_cpu_inc(rcu_read_lock_nesting);
+	barrier(); /* Keep code within RCU read-side critical section. */
+}
+
 extern void __rcu_read_unlock(void);
 void synchronize_rcu(void);
 
@@ -154,7 +167,42 @@  void synchronize_rcu(void);
  * nesting depth, but makes sense only if CONFIG_PREEMPT_RCU -- in other
  * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
  */
-#define rcu_preempt_depth() (current->rcu_read_lock_nesting)
+#define rcu_preempt_depth() (__this_cpu_read(rcu_read_lock_nesting))
+
+/*
+ * Check for a running RCU reader on the current CPU.  If used from
+ * TINY_PREEMPT_RCU, works globally, as there can be but one running
+ * RCU reader at a time in that case.  ;-)
+ *
+ * Returns zero if there are no running readers.  Returns a positive
+ * number if there is at least one reader within its RCU read-side
+ * critical section.  Returns a negative number if an outermost reader
+ * is in the midst of exiting from its RCU read-side critical section
+ *
+ * This differs from rcu_preempt_depth() in throwing a build error
+ * if used from under !CONFIG_PREEMPT_RCU.
+ */
+static inline int rcu_preempt_running_reader(void)
+{
+	return __this_cpu_read(rcu_read_lock_nesting);
+}
+
+/*
+ * Check for a task exiting while in a preemptible-RCU read-side
+ * critical section, clean up if so.  No need to issue warnings,
+ * as debug_check_no_locks_held() already does this if lockdep
+ * is enabled.  To be called from the task-exit path, and nowhere else.
+ */
+static inline void exit_rcu(void)
+{
+	if (likely(__this_cpu_read(rcu_read_lock_nesting) == 0))
+		return;
+	__this_cpu_write(rcu_read_lock_nesting, 1);
+	__rcu_read_unlock();
+}
+
+void rcu_switch_from(void);
+void rcu_switch_to(void);
 
 #else /* #ifdef CONFIG_PREEMPT_RCU */
 
@@ -178,9 +226,23 @@  static inline int rcu_preempt_depth(void)
 	return 0;
 }
 
+static inline void rcu_switch_from(void)
+{
+}
+
+static inline void rcu_switch_to(void)
+{
+}
+
+static inline void exit_rcu(void)
+{
+}
+
 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
 
 /* Internal to kernel */
+extern void rcu_switch_from(void);
+extern void rcu_switch_to(void);
 extern void rcu_sched_qs(int cpu);
 extern void rcu_bh_qs(int cpu);
 extern void rcu_check_callbacks(int cpu, int user);
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index e93df77..ef39878 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -87,11 +87,7 @@  static inline void kfree_call_rcu(struct rcu_head *head,
 
 #ifdef CONFIG_TINY_RCU
 
-static inline void rcu_preempt_note_context_switch(void)
-{
-}
-
-static inline void exit_rcu(void)
+static void rcu_preempt_note_context_switch(void)
 {
 }
 
@@ -103,7 +99,6 @@  static inline int rcu_needs_cpu(int cpu)
 #else /* #ifdef CONFIG_TINY_RCU */
 
 void rcu_preempt_note_context_switch(void);
-extern void exit_rcu(void);
 int rcu_preempt_needs_cpu(void);
 
 static inline int rcu_needs_cpu(int cpu)
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index e8ee5dd..782a8ab 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -45,18 +45,6 @@  static inline void rcu_virt_note_context_switch(int cpu)
 	rcu_note_context_switch(cpu);
 }
 
-#ifdef CONFIG_TREE_PREEMPT_RCU
-
-extern void exit_rcu(void);
-
-#else /* #ifdef CONFIG_TREE_PREEMPT_RCU */
-
-static inline void exit_rcu(void)
-{
-}
-
-#endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */
-
 extern void synchronize_rcu_bh(void);
 extern void synchronize_sched_expedited(void);
 extern void synchronize_rcu_expedited(void);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692aba..f24bf6a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,7 +1275,7 @@  struct task_struct {
 	cpumask_t cpus_allowed;
 
 #ifdef CONFIG_PREEMPT_RCU
-	int rcu_read_lock_nesting;
+	int rcu_read_lock_nesting_save;
 	char rcu_read_unlock_special;
 	struct list_head rcu_node_entry;
 #endif /* #ifdef CONFIG_PREEMPT_RCU */
@@ -1868,7 +1868,7 @@  extern void task_clear_jobctl_pending(struct task_struct *task,
 
 static inline void rcu_copy_process(struct task_struct *p)
 {
-	p->rcu_read_lock_nesting = 0;
+	p->rcu_read_lock_nesting_save = 0;
 	p->rcu_read_unlock_special = 0;
 #ifdef CONFIG_TREE_PREEMPT_RCU
 	p->rcu_blocked_node = NULL;
diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index a86f174..91b8623 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -51,6 +51,10 @@ 
 
 #include "rcu.h"
 
+#ifdef CONFIG_PREEMPT_RCU
+DEFINE_PER_CPU(int, rcu_read_lock_nesting);
+#endif /* #ifdef CONFIG_PREEMPT_RCU */
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 static struct lock_class_key rcu_lock_key;
 struct lockdep_map rcu_lock_map =
diff --git a/kernel/rcutiny_plugin.h b/kernel/rcutiny_plugin.h
index 22ecea0..b842e8d 100644
--- a/kernel/rcutiny_plugin.h
+++ b/kernel/rcutiny_plugin.h
@@ -145,25 +145,6 @@  static int rcu_cpu_blocking_cur_gp(void)
 }
 
 /*
- * Check for a running RCU reader.  Because there is only one CPU,
- * there can be but one running RCU reader at a time.  ;-)
- *
- * Returns zero if there are no running readers.  Returns a positive
- * number if there is at least one reader within its RCU read-side
- * critical section.  Returns a negative number if an outermost reader
- * is in the midst of exiting from its RCU read-side critical section
- *
- * Returns zero if there are no running readers.  Returns a positive
- * number if there is at least one reader within its RCU read-side
- * critical section.  Returns a negative number if an outermost reader
- * is in the midst of exiting from its RCU read-side critical section.
- */
-static int rcu_preempt_running_reader(void)
-{
-	return current->rcu_read_lock_nesting;
-}
-
-/*
  * Check for preempted RCU readers blocking any grace period.
  * If the caller needs a reliable answer, it must disable hard irqs.
  */
@@ -522,21 +503,40 @@  void rcu_preempt_note_context_switch(void)
 	 * grace period, then the fact that the task has been enqueued
 	 * means that current grace period continues to be blocked.
 	 */
+	t->rcu_read_lock_nesting_save =
+		__this_cpu_read(rcu_read_lock_nesting);
+	__this_cpu_write(rcu_read_lock_nesting, 0);
 	rcu_preempt_cpu_qs();
 	local_irq_restore(flags);
 }
 
 /*
- * Tiny-preemptible RCU implementation for rcu_read_lock().
- * Just increment ->rcu_read_lock_nesting, shared state will be updated
- * if we block.
+ * Save the incoming task's value for rcu_read_lock_nesting at the
+ * end of a context switch.  There can be no process-state RCU read-side
+ * critical sections between the call to rcu_switch_from() and to
+ * rcu_switch_to().  Interrupt-level RCU read-side critical sections are
+ * OK because rcu_read_unlock_special() takes early exits when called
+ * at interrupt level.
+ */
+void rcu_switch_from(void)
+{
+	current->rcu_read_lock_nesting_save =
+		__this_cpu_read(rcu_read_lock_nesting);
+	barrier();
+	__this_cpu_write(rcu_read_lock_nesting, 0);
+}
+
+/*
+ * Restore the incoming task's value for rcu_read_lock_nesting at the
+ * end of a context switch.
  */
-void __rcu_read_lock(void)
+void rcu_switch_to(void)
 {
-	current->rcu_read_lock_nesting++;
-	barrier();  /* needed if we ever invoke rcu_read_lock in rcutiny.c */
+	__this_cpu_write(rcu_read_lock_nesting,
+			 current->rcu_read_lock_nesting_save);
+	barrier();
+	current->rcu_read_lock_nesting_save = 0;
 }
-EXPORT_SYMBOL_GPL(__rcu_read_lock);
 
 /*
  * Handle special cases during rcu_read_unlock(), such as needing to
@@ -628,7 +628,7 @@  static noinline void rcu_read_unlock_special(struct task_struct *t)
 
 /*
  * Tiny-preemptible RCU implementation for rcu_read_unlock().
- * Decrement ->rcu_read_lock_nesting.  If the result is zero (outermost
+ * Decrement rcu_read_lock_nesting.  If the result is zero (outermost
  * rcu_read_unlock()) and ->rcu_read_unlock_special is non-zero, then
  * invoke rcu_read_unlock_special() to clean up after a context switch
  * in an RCU read-side critical section and other special cases.
@@ -638,21 +638,21 @@  void __rcu_read_unlock(void)
 	struct task_struct *t = current;
 
 	barrier();  /* needed if we ever invoke rcu_read_unlock in rcutiny.c */
-	if (t->rcu_read_lock_nesting != 1)
-		--t->rcu_read_lock_nesting;
+	if (__this_cpu_read(rcu_read_lock_nesting) != 1)
+		__this_cpu_dec(rcu_read_lock_nesting);
 	else {
-		t->rcu_read_lock_nesting = INT_MIN;
+		__this_cpu_write(rcu_read_lock_nesting, INT_MIN);
 		barrier();  /* assign before ->rcu_read_unlock_special load */
 		if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
 			rcu_read_unlock_special(t);
 		barrier();  /* ->rcu_read_unlock_special load before assign */
-		t->rcu_read_lock_nesting = 0;
+		__this_cpu_write(rcu_read_lock_nesting, 0);
 	}
 #ifdef CONFIG_PROVE_LOCKING
 	{
-		int rrln = ACCESS_ONCE(t->rcu_read_lock_nesting);
+		int rln = __this_cpu_read(rcu_read_lock_nesting);
 
-		WARN_ON_ONCE(rrln < 0 && rrln > INT_MIN / 2);
+		WARN_ON_ONCE(rln < 0 && rln > INT_MIN / 2);
 	}
 #endif /* #ifdef CONFIG_PROVE_LOCKING */
 }
@@ -851,22 +851,6 @@  int rcu_preempt_needs_cpu(void)
 	return rcu_preempt_ctrlblk.rcb.rcucblist != NULL;
 }
 
-/*
- * Check for a task exiting while in a preemptible -RCU read-side
- * critical section, clean up if so.  No need to issue warnings,
- * as debug_check_no_locks_held() already does this if lockdep
- * is enabled.
- */
-void exit_rcu(void)
-{
-	struct task_struct *t = current;
-
-	if (t->rcu_read_lock_nesting == 0)
-		return;
-	t->rcu_read_lock_nesting = 1;
-	__rcu_read_unlock();
-}
-
 #else /* #ifdef CONFIG_TINY_PREEMPT_RCU */
 
 #ifdef CONFIG_RCU_TRACE
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index c023464..8a19210 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -160,7 +160,7 @@  static void rcu_preempt_note_context_switch(int cpu)
 	struct rcu_data *rdp;
 	struct rcu_node *rnp;
 
-	if (t->rcu_read_lock_nesting > 0 &&
+	if (__this_cpu_read(rcu_read_lock_nesting) > 0 &&
 	    (t->rcu_read_unlock_special & RCU_READ_UNLOCK_BLOCKED) == 0) {
 
 		/* Possibly blocking in an RCU read-side critical section. */
@@ -208,7 +208,7 @@  static void rcu_preempt_note_context_switch(int cpu)
 				       ? rnp->gpnum
 				       : rnp->gpnum + 1);
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
-	} else if (t->rcu_read_lock_nesting < 0 &&
+	} else if (__this_cpu_read(rcu_read_lock_nesting) < 0 &&
 		   t->rcu_read_unlock_special) {
 
 		/*
@@ -233,16 +233,32 @@  static void rcu_preempt_note_context_switch(int cpu)
 }
 
 /*
- * Tree-preemptible RCU implementation for rcu_read_lock().
- * Just increment ->rcu_read_lock_nesting, shared state will be updated
- * if we block.
+ * Save the incoming task's value for rcu_read_lock_nesting at the
+ * end of a context switch.  There can be no process-state RCU read-side
+ * critical sections between the call to rcu_switch_from() and to
+ * rcu_switch_to().  Interrupt-level RCU read-side critical sections are
+ * OK because rcu_read_unlock_special() takes early exits when called
+ * at interrupt level.
  */
-void __rcu_read_lock(void)
+void rcu_switch_from(void)
 {
-	current->rcu_read_lock_nesting++;
-	barrier();  /* needed if we ever invoke rcu_read_lock in rcutree.c */
+	current->rcu_read_lock_nesting_save =
+		__this_cpu_read(rcu_read_lock_nesting);
+	barrier();
+	__this_cpu_write(rcu_read_lock_nesting, 0);
+}
+
+/*
+ * Restore the incoming task's value for rcu_read_lock_nesting at the
+ * end of a context switch.
+ */
+void rcu_switch_to(void)
+{
+	__this_cpu_write(rcu_read_lock_nesting,
+			 current->rcu_read_lock_nesting_save);
+	barrier();
+	current->rcu_read_lock_nesting_save = 0;
 }
-EXPORT_SYMBOL_GPL(__rcu_read_lock);
 
 /*
  * Check for preempted RCU readers blocking the current grace period
@@ -420,7 +436,7 @@  static noinline void rcu_read_unlock_special(struct task_struct *t)
 
 /*
  * Tree-preemptible RCU implementation for rcu_read_unlock().
- * Decrement ->rcu_read_lock_nesting.  If the result is zero (outermost
+ * Decrement rcu_read_lock_nesting.  If the result is zero (outermost
  * rcu_read_unlock()) and ->rcu_read_unlock_special is non-zero, then
  * invoke rcu_read_unlock_special() to clean up after a context switch
  * in an RCU read-side critical section and other special cases.
@@ -429,22 +445,22 @@  void __rcu_read_unlock(void)
 {
 	struct task_struct *t = current;
 
-	if (t->rcu_read_lock_nesting != 1)
-		--t->rcu_read_lock_nesting;
+	if (__this_cpu_read(rcu_read_lock_nesting) != 1)
+		__this_cpu_dec(rcu_read_lock_nesting);
 	else {
 		barrier();  /* critical section before exit code. */
-		t->rcu_read_lock_nesting = INT_MIN;
+		__this_cpu_write(rcu_read_lock_nesting, INT_MIN);
 		barrier();  /* assign before ->rcu_read_unlock_special load */
 		if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
 			rcu_read_unlock_special(t);
 		barrier();  /* ->rcu_read_unlock_special load before assign */
-		t->rcu_read_lock_nesting = 0;
+		__this_cpu_write(rcu_read_lock_nesting, 0);
 	}
 #ifdef CONFIG_PROVE_LOCKING
 	{
-		int rrln = ACCESS_ONCE(t->rcu_read_lock_nesting);
+		int rln = __this_cpu_read(rcu_read_lock_nesting);
 
-		WARN_ON_ONCE(rrln < 0 && rrln > INT_MIN / 2);
+		WARN_ON_ONCE(rln < 0 && rln > INT_MIN / 2);
 	}
 #endif /* #ifdef CONFIG_PROVE_LOCKING */
 }
@@ -668,11 +684,11 @@  static void rcu_preempt_check_callbacks(int cpu)
 {
 	struct task_struct *t = current;
 
-	if (t->rcu_read_lock_nesting == 0) {
+	if (!rcu_preempt_running_reader()) {
 		rcu_preempt_qs(cpu);
 		return;
 	}
-	if (t->rcu_read_lock_nesting > 0 &&
+	if (rcu_preempt_running_reader() > 0 &&
 	    per_cpu(rcu_preempt_data, cpu).qs_pending)
 		t->rcu_read_unlock_special |= RCU_READ_UNLOCK_NEED_QS;
 }
@@ -969,22 +985,6 @@  static void __init __rcu_init_preempt(void)
 	rcu_init_one(&rcu_preempt_state, &rcu_preempt_data);
 }
 
-/*
- * Check for a task exiting while in a preemptible-RCU read-side
- * critical section, clean up if so.  No need to issue warnings,
- * as debug_check_no_locks_held() already does this if lockdep
- * is enabled.
- */
-void exit_rcu(void)
-{
-	struct task_struct *t = current;
-
-	if (t->rcu_read_lock_nesting == 0)
-		return;
-	t->rcu_read_lock_nesting = 1;
-	__rcu_read_unlock();
-}
-
 #else /* #ifdef CONFIG_TREE_PREEMPT_RCU */
 
 static struct rcu_state *rcu_state = &rcu_sched_state;
@@ -1018,8 +1018,8 @@  void rcu_force_quiescent_state(void)
 EXPORT_SYMBOL_GPL(rcu_force_quiescent_state);
 
 /*
- * Because preemptible RCU does not exist, we never have to check for
- * CPUs being in quiescent states.
+ * Because preemptible RCU does not exist, we never have to save or restore
+ * its state upon context switch.
  */
 static void rcu_preempt_note_context_switch(int cpu)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5255c9d..9f10b76 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2051,7 +2051,9 @@  context_switch(struct rq *rq, struct task_struct *prev,
 #endif
 
 	/* Here we just switch the register state and the stack. */
+	rcu_switch_from();
 	switch_to(prev, next, prev);
+	rcu_switch_to();
 
 	barrier();
 	/*