diff mbox

[RFC/RFT,0/2] Forced idle and Non-RCU local softirq pending

Message ID 20221215184300.1592872-1-srinivas.pandruvada@linux.intel.com
State New
Headers show

Commit Message

Srinivas Pandruvada Dec. 15, 2022, 6:42 p.m. UTC
Linux has support for idle injection for a while. To inject time
play_idle_precise() is used.

When idle time is injected using play_idle_precise(), there are couple of issues:

1. Sometimes there are Warning in kernel log:

[147777.095484] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
[147777.099719] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #288!!!
[147777.103725] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #288!!!

2. Softirq processing is delayed

A sample kernel trace is in the commit log of patch 0001.

There were offline discussion with Frederic and Peter on this issue.
Frederic sent a test patch with some todos, which I tried to address.
The solution proposed here is that if a ksoftirq is pending break the
do_idle() in loop and give 1 jiffie to process via schedule_timeout(1).

The conversation is pasted here to establish context:

On Sat, Sep 18, 2021 at 08:55:48AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 17, 2021 at 11:42:21PM +0200, Frederic Weisbecker wrote:
> > On Mon, Sep 13, 2021 at 02:58:59AM +0000, Pandruvada, Srinivas wrote:
> > > Hi Frederic,
> > > 
> > > Peter suggested to contact you regarding some issues with force idle
> > > and softirqs. You may have some changes in work or suggestions.
> > > 
> > > We are trying to use idle injection on some CPUs for thermal and
> > > performance reasons. This is done via Linux idle_injection interface
> > > (powercap/idle_inject.c) which calls scheduler function
> > > play_idle_precise(). This results in calling can_stop_idle_tick() via
> > > tick_nohz_idle_stop_tick(), which results in printing of:
> > > 
> > > [  185.765383] NOHZ tick-stop error: Non-RCU local softirq work is
> > > pending, CPU 207 handler #08!!!
> > > 
> > > 
> > > So when tick is about to be stopped, either this work needs to be
> > > migrated or we wait for softirq to be executed and then disable on the
> > > CPU. Please suggest.
> > 
> > You can't blindly migrate softirqs because they often depend on the CPU they
> > are queued on. So you need to wait for them to execute.
> > 
> > As for how to adapt the warning with taking idle injection into consideration,
> > I need to understand something first: how comes we reach this path without
> > need_resched() set?
> 
> It might be set, but the idle inject thread wins from ksoftirqd, it
> being FIFO.

Ah ok.

> > Also looking at play_idle_precise(), we only ever escape the idle loop once
> > the idle inject timer has fired. The need for resched is never checked to break
> > the loop.
> 
> do_idle() has a schedule() loop it it, it will happily schedule.

Oops, forgot my basics...

> The thing is that the idle injection thread is typically the highest
> priority runnable thread and as such will starve things (on purpose).
> 
> Only higher prio FIFO, any DEADLINE or the STOP thread can effectively
> preempt idle injection (and actual IRQs ofcourse).

I see... In fact need_resched() shouldn't even be set in this case I guess...

> 
> So I supopse an IRQ can happen, not finish the softirq in its tail, try
> and punt to ksoftirqd and not get scheduled because idle (injection)
> wins on priority.
> 
> The question is what do we want to do there... we could just run the
> softirq crap from the idle injection thread, seeing how the work
> shouldn't be there in the first place, but since it is, it need being
> done.
> 
> Feels gross tho...

How about the other gross following solution (untested)?:



> > How about the other gross following solution (untested)?:
> > 
> It causes NMI watchdog because lockup on the CPU where the idle
> injection is done. Attached the dump.
> 
> I have to add on top the following diff to avoid lockup. With this I
> don't see the 
> " NOHZ tick-stop error: Non-RCU local softirq work is pending,"
> 
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index a747a36330a8..e1ec5157a671 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -394,13 +394,18 @@ void play_idle_precise(u64 duration_ns, u64
> latency_ns)
>                 while (!READ_ONCE(it.done) &&
> !task_is_running(__this_cpu_read(ksoftirqd)))
>                         do_idle();
>  
> +               hrtimer_cancel(&it.timer);
> +       
>                 cpuidle_use_deepest_state(0);
>                 current->flags &= ~PF_IDLE;
>  
>                 preempt_fold_need_resched();
>                 preempt_enable();
>                 /* Give ksoftirqd 1 jiffy to get a chance to start its
> job */
> +               if (!READ_ONCE(it.done) &&
> task_is_running(__this_cpu_read(ksoftirqd))) {
> +                       __set_current_state(TASK_UNINTERRUPTIBLE);
>                         schedule_timeout(1);
> +               }
>         } while (!READ_ONCE(it.done));
>  }
>  EXPORT_SYMBOL_GPL(play_idle_precise);

Ah right.

Also, beware of a few details:

1) This can loop forever if there is a long and strong softirq activity.
   So we need to define some timeout. This also means play_idle_precise() should
   return some error.
   
<Patch 0002 adds a maximum limit as a parameter.>   

2) Do you need to make that loop interruptible? I don't know if the idle
   injection request comes directly from userspace or is it some kernel thread.
<This is done via a kernel thread. User space can't interrupt. It can change the idle percent, which will be picked up next time by powercap/idle_inject>

3) Do you need to substract some time spent waiting for softirqs execution to
   the idle injection time? Probably not, I guess it depends on the role played
   by this idle injection but I figured I should ask.
<Don't need to be that accurate as this is for thermal control, which doesn't need to be accurate>

4) An interrupt can fire in the middle of the idle injection, raising a softirq.
   In this case you need to re-injection the remaining idle time.
   eg: Imagine you program a 3 seconds idle injection. You sleep 1 second, an
   interrupt fires and raise a softirq, you schedule out, then once the softirq
   handled you need to reprogram 2 seconds.
<Handled in patch 0001 for starting timer for remaining time.>

5) We still need to handle __cpuidle sections.
<Add cpuidle section for your FIXME. But don't understand the need for noinstr and instrumentation_begin()/end().>


Frederic Weisbecker (1):
  sched/core: Check and schedule ksoftirq

Srinivas Pandruvada (1):
  sched/core: Define max duration to play_precise_idle()

 drivers/powercap/idle_inject.c |  4 ++-
 include/linux/cpu.h            |  4 +--
 kernel/sched/idle.c            | 66 ++++++++++++++++++++++++----------
 3 files changed, 52 insertions(+), 22 deletions(-)

Comments

Peter Zijlstra Dec. 16, 2022, 11:19 a.m. UTC | #1
On Thu, Dec 15, 2022 at 10:42:59AM -0800, Srinivas Pandruvada wrote:
> +		/* Give ksoftirqd 1 jiffy to get a chance to start its job */
> +		if (!READ_ONCE(it.done) && task_is_running(__this_cpu_read(ksoftirqd))) {
> +			__set_current_state(TASK_UNINTERRUPTIBLE);
> +			schedule_timeout(1);
> +		}

That's absolutely disgusting :-/
Srinivas Pandruvada Dec. 16, 2022, 4:58 p.m. UTC | #2
On Fri, 2022-12-16 at 12:19 +0100, Peter Zijlstra wrote:
> On Thu, Dec 15, 2022 at 10:42:59AM -0800, Srinivas Pandruvada wrote:
> > +               /* Give ksoftirqd 1 jiffy to get a chance to start
> > its job */
> > +               if (!READ_ONCE(it.done) &&
> > task_is_running(__this_cpu_read(ksoftirqd))) {
> > +                       __set_current_state(TASK_UNINTERRUPTIBLE);
> > +                       schedule_timeout(1);
> > +               }
> 
> That's absolutely disgusting :-/
What is the alternative? Process softirq in this task context for one
time?

Thanks,
Srinivas
Frederic Weisbecker Dec. 16, 2022, 10:07 p.m. UTC | #3
On Fri, Dec 16, 2022 at 12:19:34PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 15, 2022 at 10:42:59AM -0800, Srinivas Pandruvada wrote:
> > +		/* Give ksoftirqd 1 jiffy to get a chance to start its job */
> > +		if (!READ_ONCE(it.done) && task_is_running(__this_cpu_read(ksoftirqd))) {
> > +			__set_current_state(TASK_UNINTERRUPTIBLE);
> > +			schedule_timeout(1);
> > +		}
> 
> That's absolutely disgusting :-/

I know, and I hate checking task_is_running(__this_cpu_read(ksoftirqd))
everywhere in idle. And in fact it doesn't work because some cpuidle drivers
also do need_resched() checks.

I guess that either we assume that the idle injection is more important
than serving softirqs and we shutdown the warnings accordingly, or we arrange
for idle injection to have a lower prio than ksoftirqd.

Thoughts?
Sebastian Andrzej Siewior Dec. 19, 2022, 11:46 a.m. UTC | #4
On 2022-12-19 12:33:22 [+0100], Peter Zijlstra wrote:
> ksoftirq is typically a CFS task while idle injection is required to be
> a FIFO (typically FIFO-1) task -- so that would require lifting
> ksoftirqd to FIFO and that has other problems.
> 
> I'm guessing the problem case is where idle injection comes in while
> ksoftirq is running (and preempted), because in that case you cannot run
> the softirq stuff in-place.
> 
> The 'right' thing to do would be to PI boost ksoftirqd in that case I
> suppose. Perhaps something like so, it would boost ksoftirq when it's
> running, and otherwise runs the ksoftirqd thing in-situ.
> 
> I've not really throught hard about this though, so perhaps I completely
> wrecked things.

I don't know why you intend to run ksoftirqd but in general it breaks RT
left and right and we attempt to avoid running ksoftirqd as much as
possible. If it runs and you go and boost it then it probably gets even
worse from RT point of view.

ksoftirqd runs softirqs from hardirq context. Everything else is handled
in is handled within local-bh-disable+enable loop. We already have have
the boost-ksoftird hammer which is the per-CPU BLK called
local_bh_disable(). In general everything should be moved out of it.
For timers we have the ktimerd thread which needs clean up. 

Sebastian
Peter Zijlstra Dec. 19, 2022, 2:56 p.m. UTC | #5
On Mon, Dec 19, 2022 at 12:46:54PM +0100, Sebastian Andrzej Siewior wrote:
> On 2022-12-19 12:33:22 [+0100], Peter Zijlstra wrote:
> > ksoftirq is typically a CFS task while idle injection is required to be
> > a FIFO (typically FIFO-1) task -- so that would require lifting
> > ksoftirqd to FIFO and that has other problems.
> > 
> > I'm guessing the problem case is where idle injection comes in while
> > ksoftirq is running (and preempted), because in that case you cannot run
> > the softirq stuff in-place.
> > 
> > The 'right' thing to do would be to PI boost ksoftirqd in that case I
> > suppose. Perhaps something like so, it would boost ksoftirq when it's
> > running, and otherwise runs the ksoftirqd thing in-situ.
> > 
> > I've not really throught hard about this though, so perhaps I completely
> > wrecked things.
> 
> I don't know why you intend to run ksoftirqd but in general it breaks RT

So the upstream problem is where softirq is punted to ksoftirq (obvious
from hardirq tail) and idle-injection comes in and either:

 - runs before ksoftirq had a chance to run, or
 - preempts ksoftirqd.

In both cases we'll appear to go idle with softirqs pending -- which
triggers a pesky warning, because obviously undesirable.

In the first case we can run 'ksoftirqd' in-context, but then need to
serialize against the actual ksoftirqd. In the second case we need to
serialize against ksoftirqd and ensure it is complete before proceeding
with going 'idle'.

Since idle-injection runs FIFO and ksoftirqd typically does not, some
form of PI is required.

> left and right and we attempt to avoid running ksoftirqd as much as
> possible. If it runs and you go and boost it then it probably gets even
> worse from RT point of view.

So if you never run ksoftirqd, the above problem doesn't happen. If
ksoftirqd *can* run, you still need to deal with it somehow. Boosting
ksoftirqd to whatever priority the idle-injection thread has should not
be worse than anything the idle-injection already does, no?

Specifically, without idle-injection involved the patch as proposed (+-
fixes required to make it compile and work) should be a no-op.

> ksoftirqd runs softirqs from hardirq context. Everything else is handled
> in is handled within local-bh-disable+enable loop. We already have have
> the boost-ksoftird hammer which is the per-CPU BLK called
> local_bh_disable(). In general everything should be moved out of it.
> For timers we have the ktimerd thread which needs clean up.
Peter Zijlstra Dec. 20, 2022, 8:51 p.m. UTC | #6
On Mon, Dec 19, 2022 at 02:54:55PM -0800, srinivas pandruvada wrote:

> But after ksoftirqd_run_end(), which will renable local irq, this may
> further cause more soft irq pending. Here RCU soft irq entry continues

Right you are.. what about if we spell the idle.c thing like so instead?

Then we repeat the softirq thing every time we drop out of the idle loop
for a reschedule.

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f26ab2675f7d..6dce49813bcc 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -381,8 +381,13 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
 	hrtimer_start(&it.timer, ns_to_ktime(duration_ns),
 		      HRTIMER_MODE_REL_PINNED_HARD);
 
-	while (!READ_ONCE(it.done))
+	while (!READ_ONCE(it.done)) {
+		rt_mutex_lock(&per_cpu(ksoftirq_lock, cpu));
+		__run_ksoftirqd(smp_processor_id());
+		rt_mutex_unlock(&per_cpu(ksoftirq_lock, cpu));
+
 		do_idle();
+	}
 
 	cpuidle_use_deepest_state(0);
 	current->flags &= ~PF_IDLE;
Peter Zijlstra Dec. 20, 2022, 9:18 p.m. UTC | #7
On Tue, Dec 20, 2022 at 09:51:09PM +0100, Peter Zijlstra wrote:
> On Mon, Dec 19, 2022 at 02:54:55PM -0800, srinivas pandruvada wrote:
> 
> > But after ksoftirqd_run_end(), which will renable local irq, this may
> > further cause more soft irq pending. Here RCU soft irq entry continues
> 
> Right you are.. what about if we spell the idle.c thing like so instead?
> 
> Then we repeat the softirq thing every time we drop out of the idle loop
> for a reschedule.

Uff, that obviously can't work because we already have preemption
disabled here, this needs a bit more work. I think it's possible to
re-arrange things a bit.

I'll try and have a look tomorrow, but the kids have their xmas play at
school so who knows what I'll get done.

> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index f26ab2675f7d..6dce49813bcc 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -381,8 +381,13 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
>  	hrtimer_start(&it.timer, ns_to_ktime(duration_ns),
>  		      HRTIMER_MODE_REL_PINNED_HARD);
>  
> -	while (!READ_ONCE(it.done))
> +	while (!READ_ONCE(it.done)) {
> +		rt_mutex_lock(&per_cpu(ksoftirq_lock, cpu));
> +		__run_ksoftirqd(smp_processor_id());
> +		rt_mutex_unlock(&per_cpu(ksoftirq_lock, cpu));
> +
>  		do_idle();
> +	}
>  
>  	cpuidle_use_deepest_state(0);
>  	current->flags &= ~PF_IDLE;
Srinivas Pandruvada Jan. 10, 2023, 2:33 a.m. UTC | #8
On Tue, 2022-12-20 at 22:18 +0100, Peter Zijlstra wrote:
> On Tue, Dec 20, 2022 at 09:51:09PM +0100, Peter Zijlstra wrote:
> > On Mon, Dec 19, 2022 at 02:54:55PM -0800, srinivas pandruvada
> > wrote:
> > 
> > > But after ksoftirqd_run_end(), which will renable local irq, this
> > > may
> > > further cause more soft irq pending. Here RCU soft irq entry
> > > continues
> > 
> > Right you are.. what about if we spell the idle.c thing like so
> > instead?
> > 
> > Then we repeat the softirq thing every time we drop out of the idle
> > loop
> > for a reschedule.
> 
> Uff, that obviously can't work because we already have preemption
> disabled here, this needs a bit more work. I think it's possible to
> re-arrange things a bit.
Didn't work.
Also when __do_softirq returns, softirq can be pending again.  I think
if local_softirq_pending(), we can break do_idle() loop.

Thanks,
Srinivas

> 
> I'll try and have a look tomorrow, but the kids have their xmas play
> at
> school so who knows what I'll get done.
> 
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index f26ab2675f7d..6dce49813bcc 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -381,8 +381,13 @@ void play_idle_precise(u64 duration_ns, u64
> > latency_ns)
> >         hrtimer_start(&it.timer, ns_to_ktime(duration_ns),
> >                       HRTIMER_MODE_REL_PINNED_HARD);
> >  
> > -       while (!READ_ONCE(it.done))
> > +       while (!READ_ONCE(it.done)) {
> > +               rt_mutex_lock(&per_cpu(ksoftirq_lock, cpu));
> > +               __run_ksoftirqd(smp_processor_id());
> > +               rt_mutex_unlock(&per_cpu(ksoftirq_lock, cpu));
> > +
> >                 do_idle();
> > +       }
> >  
> >         cpuidle_use_deepest_state(0);
> >         current->flags &= ~PF_IDLE;
diff mbox

Patch

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index d17b0a5ce6ac..882c48441469 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -52,6 +52,12 @@  static int __init cpu_idle_nopoll_setup(char *__unused)
 __setup("hlt", cpu_idle_nopoll_setup);
 #endif
 
+/* FIXME: handle __cpuidle / instrumentation_begin()/end() */
+static bool idle_loop_should_break(void)
+{
+	return need_resched() || task_is_running(__this_cpu_read(ksoftirqd));
+}
+
 static noinline int __cpuidle cpu_idle_poll(void)
 {
 	trace_cpu_idle(0, smp_processor_id());
@@ -59,7 +65,7 @@  static noinline int __cpuidle cpu_idle_poll(void)
 	rcu_idle_enter();
 	local_irq_enable();
 
-	while (!tif_need_resched() &&
+	while (!idle_loop_should_break() &&
 	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
 		cpu_relax();
 
@@ -177,7 +183,7 @@  static void cpuidle_idle_call(void)
 	 * Check if the idle task must be rescheduled. If it is the
 	 * case, exit the function after re-enabling the local irq.
 	 */
-	if (need_resched()) {
+	if (idle_loop_should_break()) {
 		local_irq_enable();
 		return;
 	}
@@ -279,7 +285,7 @@  static void do_idle(void)
 	__current_set_polling();
 	tick_nohz_idle_enter();
 
-	while (!need_resched()) {
+	while (!idle_loop_should_break()) {
 		rmb();
 
 		local_irq_disable();
@@ -373,25 +379,31 @@  void play_idle_precise(u64 duration_ns, u64 latency_ns)
 	WARN_ON_ONCE(!duration_ns);
 	WARN_ON_ONCE(current->mm);
 
-	rcu_sleep_check();
-	preempt_disable();
-	current->flags |= PF_IDLE;
-	cpuidle_use_deepest_state(latency_ns);
+	do {
+		rcu_sleep_check();
+		preempt_disable();
+		current->flags |= PF_IDLE;
+		cpuidle_use_deepest_state(latency_ns);
 
-	it.done = 0;
-	hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
-	it.timer.function = idle_inject_timer_fn;
-	hrtimer_start(&it.timer, ns_to_ktime(duration_ns),
-		      HRTIMER_MODE_REL_PINNED_HARD);
+		it.done = 0;
+		hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
+		it.timer.function = idle_inject_timer_fn;
+		hrtimer_start(&it.timer, ns_to_ktime(duration_ns),
+			      HRTIMER_MODE_REL_PINNED_HARD);
 
-	while (!READ_ONCE(it.done))
-		do_idle();
+		while (!READ_ONCE(it.done) && !task_is_running(__this_cpu_read(ksoftirqd)))
+			do_idle();
+
+		cpuidle_use_deepest_state(0);
+		current->flags &= ~PF_IDLE;
 
-	cpuidle_use_deepest_state(0);
-	current->flags &= ~PF_IDLE;
+		preempt_fold_need_resched();
+		preempt_enable();
 
-	preempt_fold_need_resched();
-	preempt_enable();
+		/* Give ksoftirqd 1 jiffy to get a chance to start its job */
+		if (!READ_ONCE(it.done) && task_is_running(__this_cpu_read(ksoftirqd)))
+			schedule_timeout(1);
+	} while (!READ_ONCE(it.done));
 }
 EXPORT_SYMBOL_GPL(play_idle_precise);