[tip/core/rcu,11/26] rcu: Exit RCU extended QS on user preemption

Message ID 1346360743-3628-11-git-send-email-paulmck@linux.vnet.ibm.com
State New
Headers show

Commit Message

Paul E. McKenney Aug. 30, 2012, 9:05 p.m.
From: Frederic Weisbecker <fweisbec@gmail.com>

When exceptions or irq are about to resume userspace, if
the task needs to be rescheduled, the arch low level code
calls schedule() directly.

At that time we may be in extended quiescent state from RCU
POV: the exception is not anymore protected inside
rcu_user_exit() - rcu_user_enter() and the irq has called
rcu_irq_exit() already.

Create a new API schedule_user() that calls schedule() inside
rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
will need to rely on it now to implement user preemption safely.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/sched/core.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

Comments

Josh Triplett Aug. 31, 2012, 11:52 p.m. | #1
On Thu, Aug 30, 2012 at 02:05:28PM -0700, Paul E. McKenney wrote:
> From: Frederic Weisbecker <fweisbec@gmail.com>
> 
> When exceptions or irq are about to resume userspace, if
> the task needs to be rescheduled, the arch low level code
> calls schedule() directly.
> 
> At that time we may be in extended quiescent state from RCU
> POV: the exception is not anymore protected inside
> rcu_user_exit() - rcu_user_enter() and the irq has called
> rcu_irq_exit() already.
> 
> Create a new API schedule_user() that calls schedule() inside
> rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
> will need to rely on it now to implement user preemption safely.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Alessio Igor Bogani <abogani@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Chris Metcalf <cmetcalf@tilera.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Gilad Ben Yossef <gilad@benyossef.com>
> Cc: Hakan Akkan <hakanakkan@gmail.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Kevin Hilman <khilman@ti.com>
> Cc: Max Krasnyansky <maxk@qualcomm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/sched/core.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0bd599b..e841dfc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3463,6 +3463,13 @@ asmlinkage void __sched schedule(void)
>  }
>  EXPORT_SYMBOL(schedule);
>  
> +asmlinkage void __sched schedule_user(void)
> +{
> +	rcu_user_exit();
> +	schedule();
> +	rcu_user_enter();
> +}
> +
>  /**
>   * schedule_preempt_disabled - called with preemption disabled
>   *
> -- 
> 1.7.8
>
Peter Zijlstra Sept. 6, 2012, 5:02 p.m. | #2
On Thu, 2012-08-30 at 14:05 -0700, Paul E. McKenney wrote:
> From: Frederic Weisbecker <fweisbec@gmail.com>
> 
> When exceptions or irq are about to resume userspace, if
> the task needs to be rescheduled, the arch low level code
> calls schedule() directly.
> 
> At that time we may be in extended quiescent state from RCU
> POV: the exception is not anymore protected inside
> rcu_user_exit() - rcu_user_enter() and the irq has called
> rcu_irq_exit() already.
> 
> Create a new API schedule_user() that calls schedule() inside
> rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
> will need to rely on it now to implement user preemption safely.

> ---
>  kernel/sched/core.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0bd599b..e841dfc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3463,6 +3463,13 @@ asmlinkage void __sched schedule(void)
>  }
>  EXPORT_SYMBOL(schedule);
>  
> +asmlinkage void __sched schedule_user(void)
> +{
> +	rcu_user_exit();
> +	schedule();
> +	rcu_user_enter();
> +}


OK, so colour me unconvinced.. why are we doing this?

Typically when we call schedule nr_running != 1 (we need current to be
running and a possible target to switch to).

So I'd prefer to simply have schedule() disable all this adaptive tick
nonsense and leave it at that.
Peter Zijlstra Sept. 6, 2012, 5:13 p.m. | #3
On Thu, 2012-09-06 at 19:02 +0200, Peter Zijlstra wrote:
> On Thu, 2012-08-30 at 14:05 -0700, Paul E. McKenney wrote:
> > From: Frederic Weisbecker <fweisbec@gmail.com>
> > 
> > When exceptions or irq are about to resume userspace, if
> > the task needs to be rescheduled, the arch low level code
> > calls schedule() directly.
> > 
> > At that time we may be in extended quiescent state from RCU
> > POV: the exception is not anymore protected inside
> > rcu_user_exit() - rcu_user_enter() and the irq has called
> > rcu_irq_exit() already.
> > 
> > Create a new API schedule_user() that calls schedule() inside
> > rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
> > will need to rely on it now to implement user preemption safely.
> 
> > ---
> >  kernel/sched/core.c |    7 +++++++
> >  1 files changed, 7 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0bd599b..e841dfc 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3463,6 +3463,13 @@ asmlinkage void __sched schedule(void)
> >  }
> >  EXPORT_SYMBOL(schedule);
> >  
> > +asmlinkage void __sched schedule_user(void)
> > +{
> > +	rcu_user_exit();
> > +	schedule();
> > +	rcu_user_enter();
> > +}
> 
> 
> OK, so colour me unconvinced.. why are we doing this?
> 
> Typically when we call schedule nr_running != 1 (we need current to be
> running and a possible target to switch to).
> 
> So I'd prefer to simply have schedule() disable all this adaptive tick
> nonsense and leave it at that.

In fact, the only way to get here is through ttwu(), which would have
done the nr_running increment and should have disabled all this adaptive
stuff.

So again,.. why?
Frederic Weisbecker Sept. 10, 2012, 8:26 p.m. | #4
On Thu, Sep 06, 2012 at 07:13:11PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-09-06 at 19:02 +0200, Peter Zijlstra wrote:
> > On Thu, 2012-08-30 at 14:05 -0700, Paul E. McKenney wrote:
> > > From: Frederic Weisbecker <fweisbec@gmail.com>
> > > 
> > > When exceptions or irq are about to resume userspace, if
> > > the task needs to be rescheduled, the arch low level code
> > > calls schedule() directly.
> > > 
> > > At that time we may be in extended quiescent state from RCU
> > > POV: the exception is not anymore protected inside
> > > rcu_user_exit() - rcu_user_enter() and the irq has called
> > > rcu_irq_exit() already.
> > > 
> > > Create a new API schedule_user() that calls schedule() inside
> > > rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
> > > will need to rely on it now to implement user preemption safely.
> > 
> > > ---
> > >  kernel/sched/core.c |    7 +++++++
> > >  1 files changed, 7 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 0bd599b..e841dfc 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -3463,6 +3463,13 @@ asmlinkage void __sched schedule(void)
> > >  }
> > >  EXPORT_SYMBOL(schedule);
> > >  
> > > +asmlinkage void __sched schedule_user(void)
> > > +{
> > > +	rcu_user_exit();
> > > +	schedule();
> > > +	rcu_user_enter();
> > > +}
> > 
> > 
> > OK, so colour me unconvinced.. why are we doing this?
> > 
> > Typically when we call schedule nr_running != 1 (we need current to be
> > running and a possible target to switch to).
> > 
> > So I'd prefer to simply have schedule() disable all this adaptive tick
> > nonsense and leave it at that.
> 
> In fact, the only way to get here is through ttwu(), which would have
> done the nr_running increment and should have disabled all this adaptive
> stuff.
> 
> So again,.. why?

Ok, indeed if the ttwu happened locally or even remotely through an IPI, the
tick engine would stop the tick and exit that RCU extended quiescent state
before we reach that place.

I just want to be sure I'm covering every case. There are some places
around that call set_need_resched() even if no task is waiting for the CPU
(RCU is such an example). In this case the nohz engine won't exit the RCU
user mode before schedule().

Another example: a CPU does a remote wake up so it does set_need_resched()
and sends the IPI. The target CPU could see the TIF_RESCHED before resuming
userspace and call schedule_user() -> schedule() while the IPI has not
yet arrived. In this case we need that rcu_user_exit() before schedule might
make any use of RCU.

Also when the task returns from schedule(), if it's alone in the runqueue,
the rcu_user_enter() that follows can be helpful to stop the tick and
enter our RCU quiescent state.

Also I'm considering this RCU user extended quiescent state as a standalone
feature for now. Indeed the only user of it is the adaptive tickless thing.
But I'm treating that RCU part independantly for now so that it makes it
easier to merge, piecewise.

When the adaptive tickless stuff arrives, these rcu_user...() APIs will
be replaced by some user_enter() / user_exit() hooks that will all be
driven by the nohz subsystem. When we get there the rcu_user_enter()
rcu_user_exit() won't be unconditional anymore.
Peter Zijlstra Sept. 12, 2012, 9:33 a.m. | #5
On Mon, 2012-09-10 at 22:26 +0200, Frederic Weisbecker wrote:

> > > OK, so colour me unconvinced.. why are we doing this?
> > > 
> > > Typically when we call schedule nr_running != 1 (we need current to be
> > > running and a possible target to switch to).
> > > 
> > > So I'd prefer to simply have schedule() disable all this adaptive tick
> > > nonsense and leave it at that.
> > 
> > In fact, the only way to get here is through ttwu(), which would have
> > done the nr_running increment and should have disabled all this adaptive
> > stuff.
> > 
> > So again,.. why?
> 
> Ok, indeed if the ttwu happened locally or even remotely through an IPI, the
> tick engine would stop the tick and exit that RCU extended quiescent state
> before we reach that place.
> 
> I just want to be sure I'm covering every case. There are some places
> around that call set_need_resched() even if no task is waiting for the CPU
> (RCU is such an example). In this case the nohz engine won't exit the RCU
> user mode before schedule().
> 
> Another example: a CPU does a remote wake up so it does set_need_resched()
> and sends the IPI. The target CPU could see the TIF_RESCHED before resuming
> userspace and call schedule_user() -> schedule() while the IPI has not
> yet arrived. In this case we need that rcu_user_exit() before schedule might
> make any use of RCU.
> 
> Also when the task returns from schedule(), if it's alone in the runqueue,
> the rcu_user_enter() that follows can be helpful to stop the tick and
> enter our RCU quiescent state.

So all that should have been in the Changelog.

So the only real problem I see is non IPI remote wakeups, those cannot
actually restart the tick on the target cpu like we want.

The rest could all be fixed to not need this.

Hrmm. still don't like the patch, but at least I see why its needed.

> Also I'm considering this RCU user extended quiescent state as a standalone
> feature for now. Indeed the only user of it is the adaptive tickless thing.
> But I'm treating that RCU part independantly for now so that it makes it
> easier to merge, piecewise.

But its semantics are very tightly tied to the requirements of adaptive
tick, so arguing about one without the other doesn't really make sense.
Frederic Weisbecker Sept. 12, 2012, 12:06 p.m. | #6
On Wed, Sep 12, 2012 at 11:33:33AM +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-10 at 22:26 +0200, Frederic Weisbecker wrote:
> 
> > > > OK, so colour me unconvinced.. why are we doing this?
> > > > 
> > > > Typically when we call schedule nr_running != 1 (we need current to be
> > > > running and a possible target to switch to).
> > > > 
> > > > So I'd prefer to simply have schedule() disable all this adaptive tick
> > > > nonsense and leave it at that.
> > > 
> > > In fact, the only way to get here is through ttwu(), which would have
> > > done the nr_running increment and should have disabled all this adaptive
> > > stuff.
> > > 
> > > So again,.. why?
> > 
> > Ok, indeed if the ttwu happened locally or even remotely through an IPI, the
> > tick engine would stop the tick and exit that RCU extended quiescent state
> > before we reach that place.
> > 
> > I just want to be sure I'm covering every case. There are some places
> > around that call set_need_resched() even if no task is waiting for the CPU
> > (RCU is such an example). In this case the nohz engine won't exit the RCU
> > user mode before schedule().
> > 
> > Another example: a CPU does a remote wake up so it does set_need_resched()
> > and sends the IPI. The target CPU could see the TIF_RESCHED before resuming
> > userspace and call schedule_user() -> schedule() while the IPI has not
> > yet arrived. In this case we need that rcu_user_exit() before schedule might
> > make any use of RCU.
> > 
> > Also when the task returns from schedule(), if it's alone in the runqueue,
> > the rcu_user_enter() that follows can be helpful to stop the tick and
> > enter our RCU quiescent state.
> 
> So all that should have been in the Changelog.

Right, I'm fixing that.

> 
> So the only real problem I see is non IPI remote wakeups, those cannot
> actually restart the tick on the target cpu like we want.

Ok so let identify and validate the corner cases so that I can report them confidently
in the changelog:

1) This can happen if something calls set_need_resched() while no other task is
on the runqueue.

2) Remote wake up done but we haven't yet received the schedule IPI.

3) Non IPI remote wakeup you're referring above, I'm not sure
what you mean though.

> 
> The rest could all be fixed to not need this.
> 
> Hrmm. still don't like the patch, but at least I see why its needed.
> 
> > Also I'm considering this RCU user extended quiescent state as a standalone
> > feature for now. Indeed the only user of it is the adaptive tickless thing.
> > But I'm treating that RCU part independantly for now so that it makes it
> > easier to merge, piecewise.
> 
> But its semantics are very tightly tied to the requirements of adaptive
> tick, so arguing about one without the other doesn't really make sense.
> 

Yeah it's true. Now all of these hooks have been required in practice with
the adaptive tick code running.

Also keeping these hooks unconditional, until the adaptive tick code comes in
and drives that conditionally, is also a nice way to runtime selftest that code (check
there is no missing rcu_user_exit() call for example, easy to check if RCU lockdep
is enabled).
Peter Zijlstra Sept. 12, 2012, 12:41 p.m. | #7
On Wed, 2012-09-12 at 14:06 +0200, Frederic Weisbecker wrote:
> 
> 1) This can happen if something calls set_need_resched() while no other task is
> on the runqueue.

People really shouldn't be doing that... I think I know why RCU does
this, but yuck. I also think RCU can avoid doing this, but its a toss up
if that's worth the trouble.

> 2) Remote wake up done but we haven't yet received the schedule IPI.
> 
> 3) Non IPI remote wakeup you're referring above, I'm not sure
> what you mean though.

Well there's two ways of doing remote wakeups, one is doing the wakeup
from the waking cpu and sending an IPI over to reschedule iff you need
wakeup-preemption, the other is queueing the task remotely and sending
an IPI to do the wakeup on the remote cpu.

The former has the problem, the latter not. 

See ttwu_queue().

We could of course mandate that all remote wakeups to special nohz cpus
get queued. That would just leave us with RCU and it would simply not
send resched IPIs to extended quiescent CPUs anyway, right?

So at that point all return to user schedule() calls have nr_running > 1
and the tick is running and RCU is not in extended quiescent state.
Since either we had nr_running > 1 and pre and post state are the same,
or we had nr_running == 1 and we just got a fresh wakeup pushing it to
2, the wakeup will have executed on our cpu and have re-started the tick
and kicked RCU into active gear again.

We cannot hit return to user schedule() with nr_running == 0, simply
because in that case there's no userspace to return to, only the idle
thread and that's very much not userspace :-)

Hmm ?
Peter Zijlstra Sept. 12, 2012, 12:52 p.m. | #8
On Wed, 2012-09-12 at 14:41 +0200, Peter Zijlstra wrote:

> We could of course mandate that all remote wakeups to special nohz cpus
> get queued. That would just leave us with RCU and it would simply not
> send resched IPIs to extended quiescent CPUs anyway, right?
> 
> So at that point all return to user schedule() calls have nr_running > 1
> and the tick is running and RCU is not in extended quiescent state.
> Since either we had nr_running > 1 and pre and post state are the same,
> or we had nr_running == 1 and we just got a fresh wakeup pushing it to
> 2, the wakeup will have executed on our cpu and have re-started the tick
> and kicked RCU into active gear again.
> 
> We cannot hit return to user schedule() with nr_running == 0, simply
> because in that case there's no userspace to return to, only the idle
> thread and that's very much not userspace :-)
> 
> Hmm ?

Crap.. this will screw over -rt, since the wakeups batch the IPI can
take forever so we had to disable this.

Bugger it.. I so detest this patch.
Frederic Weisbecker Sept. 12, 2012, 1:52 p.m. | #9
On Wed, Sep 12, 2012 at 02:41:36PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-09-12 at 14:06 +0200, Frederic Weisbecker wrote:
> > 
> > 1) This can happen if something calls set_need_resched() while no other task is
> > on the runqueue.
> 
> People really shouldn't be doing that... I think I know why RCU does
> this, but yuck. I also think RCU can avoid doing this, but its a toss up
> if that's worth the trouble.

There are other places that do this. Look at:

$ git grep set_need_resched drivers

drivers/gpu/drm/i915/i915_gem.c:                set_need_resched();
drivers/gpu/drm/ttm/ttm_bo_vm.c:                        set_need_resched();
drivers/gpu/drm/ttm/ttm_bo_vm.c:                        set_need_resched();
drivers/gpu/drm/udl/udl_gem.c:          set_need_resched();

> 
> > 2) Remote wake up done but we haven't yet received the schedule IPI.
> > 
> > 3) Non IPI remote wakeup you're referring above, I'm not sure
> > what you mean though.
> 
> Well there's two ways of doing remote wakeups, one is doing the wakeup
> from the waking cpu and sending an IPI over to reschedule iff you need
> wakeup-preemption, the other is queueing the task remotely and sending
> an IPI to do the wakeup on the remote cpu.
> 
> The former has the problem, the latter not. 

In the former case, if we don't need preemption, we don't call resched_task()
and TIF_RESCHED is not set. So the arch code simply doesn't call schedule_user().

If we need wakeup-preemption, then the problem becomes the 2) above.

Am I missing something?

> 
> See ttwu_queue().
> 
> We could of course mandate that all remote wakeups to special nohz cpus
> get queued.

In any case, I think this is good idea to force remote wake ups in nohz cpus,
at least when rq->nr_running becomes 2.

In my draft branch, I send an IPI from inc_nr_running() when nr_running becomes 2,
so this covers every rq enqueuing scenario, not only wake up. But if we force
queued wakeups, I can avoid sending that specific IPI in wake up cases.

> That would just leave us with RCU and it would simply not
> send resched IPIs to extended quiescent CPUs anyway, right?

RCU doesn't send anymore IPIs to kick out CPUs that are delaying grace
periods. That was not really useful because it wasn't calling set_need_resched(task_cur(cpu))
before doing that. And if it was doing that, we would have missed some bugs
by runtime fixing culprits of stalls.

But RCU calls set_need_resched() from other places like rcu_pending() that
is called from rcu_check_callbacks(). IIRC, this is called
from the tick.

So if RCU sets TIF_RESCHED from the tick, we may call schedule_user()
before that tick resumes userspace.

The other one is on stall detection.

> So at that point all return to user schedule() calls have nr_running > 1
> and the tick is running and RCU is not in extended quiescent state.
> Since either we had nr_running > 1 and pre and post state are the same,
> or we had nr_running == 1 and we just got a fresh wakeup pushing it to
> 2, the wakeup will have executed on our cpu and have re-started the tick
> and kicked RCU into active gear again.

If we can guarantee that, but we have yet to make it clear with set_need_resched()
callers, then we can certainly remove the rcu_user_exit() call in that function.
RCU lockdep would detect what we forgot to think about anyway.

But the rcu_user_enter() call is still valid in the end of schedule_user(). It's going
to be useful only if we stop the tick right after returning from that schedule call
though. The possible window is very thin so it's probably not that worth the optimization.
We can still wait for another tick to stop the timer.

> 
> We cannot hit return to user schedule() with nr_running == 0, simply
> because in that case there's no userspace to return to, only the idle
> thread and that's very much not userspace :-)

Sure :)
 
> Hmm ?

Also I forgot one thing: if CONFIG_RCU_USER_QS is not set, I need to call schedule()
directly instead of schedule_user(). We don't need that intermediate call in this
configuration.
Frederic Weisbecker Sept. 12, 2012, 1:54 p.m. | #10
On Wed, Sep 12, 2012 at 02:52:40PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-09-12 at 14:41 +0200, Peter Zijlstra wrote:
> 
> > We could of course mandate that all remote wakeups to special nohz cpus
> > get queued. That would just leave us with RCU and it would simply not
> > send resched IPIs to extended quiescent CPUs anyway, right?
> > 
> > So at that point all return to user schedule() calls have nr_running > 1
> > and the tick is running and RCU is not in extended quiescent state.
> > Since either we had nr_running > 1 and pre and post state are the same,
> > or we had nr_running == 1 and we just got a fresh wakeup pushing it to
> > 2, the wakeup will have executed on our cpu and have re-started the tick
> > and kicked RCU into active gear again.
> > 
> > We cannot hit return to user schedule() with nr_running == 0, simply
> > because in that case there's no userspace to return to, only the idle
> > thread and that's very much not userspace :-)
> > 
> > Hmm ?
> 
> Crap.. this will screw over -rt, since the wakeups batch the IPI can
> take forever so we had to disable this.

I don't know that part of -rt. Probably we can deal with that later once
we have some upstream code in place?

> 
> Bugger it.. I so detest this patch.
Peter Zijlstra Sept. 12, 2012, 2:09 p.m. | #11
On Wed, 2012-09-12 at 15:54 +0200, Frederic Weisbecker wrote:
> On Wed, Sep 12, 2012 at 02:52:40PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-09-12 at 14:41 +0200, Peter Zijlstra wrote:
> > 
> > > We could of course mandate that all remote wakeups to special nohz cpus
> > > get queued. That would just leave us with RCU and it would simply not
> > > send resched IPIs to extended quiescent CPUs anyway, right?
> > > 
> > > So at that point all return to user schedule() calls have nr_running > 1
> > > and the tick is running and RCU is not in extended quiescent state.
> > > Since either we had nr_running > 1 and pre and post state are the same,
> > > or we had nr_running == 1 and we just got a fresh wakeup pushing it to
> > > 2, the wakeup will have executed on our cpu and have re-started the tick
> > > and kicked RCU into active gear again.
> > > 
> > > We cannot hit return to user schedule() with nr_running == 0, simply
> > > because in that case there's no userspace to return to, only the idle
> > > thread and that's very much not userspace :-)
> > > 
> > > Hmm ?
> > 
> > Crap.. this will screw over -rt, since the wakeups batch the IPI can
> > take forever so we had to disable this.
> 
> I don't know that part of -rt. Probably we can deal with that later once
> we have some upstream code in place?

Yeah maybe.. its going to be hard though, anyway, I'm going to stare at
wtf drm is doing for a bit :-)
Paul E. McKenney Sept. 12, 2012, 3:21 p.m. | #12
On Wed, Sep 12, 2012 at 02:41:36PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-09-12 at 14:06 +0200, Frederic Weisbecker wrote:
> > 
> > 1) This can happen if something calls set_need_resched() while no other task is
> > on the runqueue.
> 
> People really shouldn't be doing that... I think I know why RCU does
> this, but yuck. I also think RCU can avoid doing this, but its a toss up
> if that's worth the trouble.

It used to do this whenever it had to force quiescent states on a given
CPU more than one time in a given grace period.  I have removed this
(on your advice), and the only remaining use is during RCU CPU stall
warnings, where a given CPU has refused to pass through a quiescent
state for more than a minute.

I could remove that one as well, if it would help.  Though it would
be nice to have -some- way to kick the CPU in that case.  Thoughts?

> > 2) Remote wake up done but we haven't yet received the schedule IPI.
> > 
> > 3) Non IPI remote wakeup you're referring above, I'm not sure
> > what you mean though.
> 
> Well there's two ways of doing remote wakeups, one is doing the wakeup
> from the waking cpu and sending an IPI over to reschedule iff you need
> wakeup-preemption, the other is queueing the task remotely and sending
> an IPI to do the wakeup on the remote cpu.
> 
> The former has the problem, the latter not. 
> 
> See ttwu_queue().
> 
> We could of course mandate that all remote wakeups to special nohz cpus
> get queued. That would just leave us with RCU and it would simply not
> send resched IPIs to extended quiescent CPUs anyway, right?

Indeed it never has done this, at least not in the absence of bugs.

							Thanx, Paul

> So at that point all return to user schedule() calls have nr_running > 1
> and the tick is running and RCU is not in extended quiescent state.
> Since either we had nr_running > 1 and pre and post state are the same,
> or we had nr_running == 1 and we just got a fresh wakeup pushing it to
> 2, the wakeup will have executed on our cpu and have re-started the tick
> and kicked RCU into active gear again.
> 
> We cannot hit return to user schedule() with nr_running == 0, simply
> because in that case there's no userspace to return to, only the idle
> thread and that's very much not userspace :-)
> 
> Hmm ?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0bd599b..e841dfc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3463,6 +3463,13 @@  asmlinkage void __sched schedule(void)
 }
 EXPORT_SYMBOL(schedule);
 
+asmlinkage void __sched schedule_user(void)
+{
+	rcu_user_exit();
+	schedule();
+	rcu_user_enter();
+}
+
 /**
  * schedule_preempt_disabled - called with preemption disabled
  *