[RFC/RFT,0/2] Forced idle and Non-RCU local softirq pending

Linux has support for idle injection for a while. To inject time
play_idle_precise() is used.

When idle time is injected using play_idle_precise(), there are couple of issues:

1. Sometimes there are Warning in kernel log:

[147777.095484] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
[147777.099719] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #288!!!
[147777.103725] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #288!!!

2. Softirq processing is delayed

A sample kernel trace is in the commit log of patch 0001.

There were offline discussion with Frederic and Peter on this issue.
Frederic sent a test patch with some todos, which I tried to address.
The solution proposed here is that if a ksoftirq is pending break the
do_idle() in loop and give 1 jiffie to process via schedule_timeout(1).

The conversation is pasted here to establish context:

On Sat, Sep 18, 2021 at 08:55:48AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 17, 2021 at 11:42:21PM +0200, Frederic Weisbecker wrote:
> > On Mon, Sep 13, 2021 at 02:58:59AM +0000, Pandruvada, Srinivas wrote:
> > > Hi Frederic,
> > > 
> > > Peter suggested to contact you regarding some issues with force idle
> > > and softirqs. You may have some changes in work or suggestions.
> > > 
> > > We are trying to use idle injection on some CPUs for thermal and
> > > performance reasons. This is done via Linux idle_injection interface
> > > (powercap/idle_inject.c) which calls scheduler function
> > > play_idle_precise(). This results in calling can_stop_idle_tick() via
> > > tick_nohz_idle_stop_tick(), which results in printing of:
> > > 
> > > [  185.765383] NOHZ tick-stop error: Non-RCU local softirq work is
> > > pending, CPU 207 handler #08!!!
> > > 
> > > 
> > > So when tick is about to be stopped, either this work needs to be
> > > migrated or we wait for softirq to be executed and then disable on the
> > > CPU. Please suggest.
> > 
> > You can't blindly migrate softirqs because they often depend on the CPU they
> > are queued on. So you need to wait for them to execute.
> > 
> > As for how to adapt the warning with taking idle injection into consideration,
> > I need to understand something first: how comes we reach this path without
> > need_resched() set?
> 
> It might be set, but the idle inject thread wins from ksoftirqd, it
> being FIFO.

Ah ok.

> > Also looking at play_idle_precise(), we only ever escape the idle loop once
> > the idle inject timer has fired. The need for resched is never checked to break
> > the loop.
> 
> do_idle() has a schedule() loop it it, it will happily schedule.

Oops, forgot my basics...

> The thing is that the idle injection thread is typically the highest
> priority runnable thread and as such will starve things (on purpose).
> 
> Only higher prio FIFO, any DEADLINE or the STOP thread can effectively
> preempt idle injection (and actual IRQs ofcourse).

I see... In fact need_resched() shouldn't even be set in this case I guess...

> 
> So I supopse an IRQ can happen, not finish the softirq in its tail, try
> and punt to ksoftirqd and not get scheduled because idle (injection)
> wins on priority.
> 
> The question is what do we want to do there... we could just run the
> softirq crap from the idle injection thread, seeing how the work
> shouldn't be there in the first place, but since it is, it need being
> done.
> 
> Feels gross tho...

How about the other gross following solution (untested)?:

> > How about the other gross following solution (untested)?:
> > 
> It causes NMI watchdog because lockup on the CPU where the idle
> injection is done. Attached the dump.
> 
> I have to add on top the following diff to avoid lockup. With this I
> don't see the 
> " NOHZ tick-stop error: Non-RCU local softirq work is pending,"
> 
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index a747a36330a8..e1ec5157a671 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -394,13 +394,18 @@ void play_idle_precise(u64 duration_ns, u64
> latency_ns)
>                 while (!READ_ONCE(it.done) &&
> !task_is_running(__this_cpu_read(ksoftirqd)))
>                         do_idle();
>  
> +               hrtimer_cancel(&it.timer);
> +       
>                 cpuidle_use_deepest_state(0);
>                 current->flags &= ~PF_IDLE;
>  
>                 preempt_fold_need_resched();
>                 preempt_enable();
>                 /* Give ksoftirqd 1 jiffy to get a chance to start its
> job */
> +               if (!READ_ONCE(it.done) &&
> task_is_running(__this_cpu_read(ksoftirqd))) {
> +                       __set_current_state(TASK_UNINTERRUPTIBLE);
>                         schedule_timeout(1);
> +               }
>         } while (!READ_ONCE(it.done));
>  }
>  EXPORT_SYMBOL_GPL(play_idle_precise);

Ah right.

Also, beware of a few details:

1) This can loop forever if there is a long and strong softirq activity.
   So we need to define some timeout. This also means play_idle_precise() should
   return some error.

<Patch 0002 adds a maximum limit as a parameter.>   

2) Do you need to make that loop interruptible? I don't know if the idle
   injection request comes directly from userspace or is it some kernel thread.
<This is done via a kernel thread. User space can't interrupt. It can change the idle percent, which will be picked up next time by powercap/idle_inject>

3) Do you need to substract some time spent waiting for softirqs execution to
   the idle injection time? Probably not, I guess it depends on the role played
   by this idle injection but I figured I should ask.
<Don't need to be that accurate as this is for thermal control, which doesn't need to be accurate>

4) An interrupt can fire in the middle of the idle injection, raising a softirq.
   In this case you need to re-injection the remaining idle time.
   eg: Imagine you program a 3 seconds idle injection. You sleep 1 second, an
   interrupt fires and raise a softirq, you schedule out, then once the softirq
   handled you need to reprogram 2 seconds.
<Handled in patch 0001 for starting timer for remaining time.>

5) We still need to handle __cpuidle sections.
<Add cpuidle section for your FIXME. But don't understand the need for noinstr and instrumentation_begin()/end().>

Frederic Weisbecker (1):
  sched/core: Check and schedule ksoftirq

Srinivas Pandruvada (1):
  sched/core: Define max duration to play_precise_idle()

 drivers/powercap/idle_inject.c |  4 ++-
 include/linux/cpu.h            |  4 +--
 kernel/sched/idle.c            | 66 ++++++++++++++++++++++++----------
 3 files changed, 52 insertions(+), 22 deletions(-)

[RFC/RFT,0/2] Forced idle and Non-RCU local softirq pending

Commit Message

Comments

Patch