diff mbox

Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment)

Message ID CAKohpok3U_8zQurs219fOmwERFrnZ0sOE-YUc7B+fKYfFLmEpg@mail.gmail.com
State New
Headers show

Commit Message

Viresh Kumar Nov. 12, 2014, 6:11 a.m. UTC
On 11 November 2014 22:45, Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Here is a summarized list:
>
> * Unbound workqueues affinity (to housekeeper)
> * Unbound timers affinity (to housekeeper)
> * 1 Hz residual scheduler tick offlining to housekeeper
> * Fix some scheduler accounting that don't even work with 1 Hz: cpu load
>   accounting, rt_scale, load balancing, etc...
> * Lighten the syscall path and get rid of cputime accounting + RCU hooks
>   for people who want isolation + fast syscalls and faults.
> * Work on non-affinable workqueues
> * Work on non-affinable timers
> * ...

+ spurious interrupts with NOHZ_FULL on all architectures which break isolation
but doesn't get caught with traces. Can be observed with this:

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Frederic Weisbecker Nov. 12, 2014, 1:54 p.m. UTC | #1
On Wed, Nov 12, 2014 at 11:41:09AM +0530, Viresh Kumar wrote:
> On 11 November 2014 22:45, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > Here is a summarized list:
> >
> > * Unbound workqueues affinity (to housekeeper)
> > * Unbound timers affinity (to housekeeper)
> > * 1 Hz residual scheduler tick offlining to housekeeper
> > * Fix some scheduler accounting that don't even work with 1 Hz: cpu load
> >   accounting, rt_scale, load balancing, etc...
> > * Lighten the syscall path and get rid of cputime accounting + RCU hooks
> >   for people who want isolation + fast syscalls and faults.
> > * Work on non-affinable workqueues
> > * Work on non-affinable timers
> > * ...
> 
> + spurious interrupts with NOHZ_FULL on all architectures which break isolation
> but doesn't get caught with traces. Can be observed with this:
> 
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 481fa54..91d490d 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1244,7 +1244,8 @@ void hrtimer_interrupt(struct clock_event_device *dev)
>  {
>         struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
>         ktime_t expires_next, now, entry_time, delta;
> -       int i, retries = 0;
> +       int i, retries = 0, count = 0;
> +       static int total_spurious;
> 
>         BUG_ON(!cpu_base->hres_active);
>         cpu_base->nr_events++;
> @@ -1304,10 +1305,14 @@ void hrtimer_interrupt(struct clock_event_device *dev)
>                                 break;
>                         }
> 
> +                       count++;
>                         __run_hrtimer(timer, &basenow);
>                 }
>         }
> 
> +       if (!count)
> +               pr_err("____%s: Totalspurious: %d\n", __func__,
> ++total_spurious);
> +

I'd rather leave that to tracepoints. Like trace_hrtimer_spurious().

Or better yet: have trace_hrtimer_interrupt() which we can compare against
trace_hrtimer_expire_entry/exit() to check if any hrtimer callback have run
in the interrupt. This way we avoid workarounds like the above count.

>         /*
>          * Store the new expiry value so the migration code can verify
>          * against it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Viresh Kumar Nov. 12, 2014, 2:56 p.m. UTC | #2
On 12 November 2014 19:24, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> I'd rather leave that to tracepoints. Like trace_hrtimer_spurious().

Yeah, it was just to prove things right on the console without getting
into traces.

> Or better yet: have trace_hrtimer_interrupt() which we can compare against
> trace_hrtimer_expire_entry/exit() to check if any hrtimer callback have run
> in the interrupt. This way we avoid workarounds like the above count.

Yeah, I also believe we better add this debug information to mainline kernel.
I will try to get a patch for that soon.

Would it be recommended to add both trace points?
i.e. trace_hrtimer_interrupt() and trace_hrtimer_spurious() ?

--
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Nov. 12, 2014, 3:06 p.m. UTC | #3
On Wed, Nov 12, 2014 at 08:26:05PM +0530, Viresh Kumar wrote:
> On 12 November 2014 19:24, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > I'd rather leave that to tracepoints. Like trace_hrtimer_spurious().
> 
> Yeah, it was just to prove things right on the console without getting
> into traces.
> 
> > Or better yet: have trace_hrtimer_interrupt() which we can compare against
> > trace_hrtimer_expire_entry/exit() to check if any hrtimer callback have run
> > in the interrupt. This way we avoid workarounds like the above count.
> 
> Yeah, I also believe we better add this debug information to mainline kernel.
> I will try to get a patch for that soon.
> 
> Would it be recommended to add both trace points?
> i.e. trace_hrtimer_interrupt() and trace_hrtimer_spurious() 

I don't think you need to add anything. We already have tracepoints for
every single interrupt (and therefore also for the hrtimer one) and we
have expiry tracepoints.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Viresh Kumar Nov. 12, 2014, 3:16 p.m. UTC | #4
On 12 November 2014 20:36, Peter Zijlstra <peterz@infradead.org> wrote:
> I don't think you need to add anything. We already have tracepoints for
> every single interrupt (and therefore also for the hrtimer one) and we
> have expiry tracepoints.

I will crosscheck this again but as far as I remember these spurious interrupts
aren't caught with tracers currently. Not even the irq_enter/exit ones.

I saw that piece of code long back (and obviously don't understand it
completely). But the problem was that because these are spurious
clockevent interrupts, we don't have anything to do on that interrupt.
No tick/sched/softirq processing. And the irq-handling code returns before
even calling trace_irq_enter().

And that's why I had to add those prints initially.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Viresh Kumar Nov. 13, 2014, 7:22 a.m. UTC | #5
On 12 November 2014 20:46, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 12 November 2014 20:36, Peter Zijlstra <peterz@infradead.org> wrote:
>> I don't think you need to add anything. We already have tracepoints for
>> every single interrupt (and therefore also for the hrtimer one) and we
>> have expiry tracepoints.
>
> I will crosscheck this again but as far as I remember these spurious interrupts
> aren't caught with tracers currently. Not even the irq_enter/exit ones.

So yes, we do get irq_handler_entry/exit traces for the clockevent device.

The problem here is that people miss these as there are no warnings issued
here. And so nobody tried to resolve these spurious interrupts.

This is how it looks on a ARM machine (With trace_hrtimer_spurious), though
it is relevant for all architectures which emulate ONESHOT mode over
PERIODIC mode:

          <idle>-0     [001] d.h2   719.642085: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.646506: hrtimer_spurious: count:105
          <idle>-0     [001] d.h2   719.646511: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.646522: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719640000000
softexpires=719640000000
          <idle>-0     [001] d..3   719.646545: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.647086: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.651540: hrtimer_spurious: count:106
          <idle>-0     [001] d.h2   719.651545: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.651554: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719645000000
softexpires=719645000000
          <idle>-0     [001] d..3   719.651576: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.652084: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.656571: hrtimer_spurious: count:107
          <idle>-0     [001] d.h2   719.656575: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.656583: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719650000000
softexpires=719650000000
          <idle>-0     [001] d..3   719.656601: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.657084: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.661611: hrtimer_spurious: count:108
          <idle>-0     [001] d.h2   719.661616: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.661627: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719655000000
softexpires=719655000000
          <idle>-0     [001] d..3   719.661652: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.662086: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.666643: hrtimer_spurious: count:109
          <idle>-0     [001] d.h2   719.666648: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.666657: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719660000000
softexpires=719660000000
          <idle>-0     [001] d..3   719.666680: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.667084: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.671678: hrtimer_spurious: count:110
          <idle>-0     [001] d.h2   719.671682: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.671693: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719665000000
softexpires=719665000000
          <idle>-0     [001] d..3   719.671714: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.672085: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.676712: hrtimer_spurious: count:111
          <idle>-0     [001] d.h2   719.676717: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.676726: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719670000000
softexpires=719670000000
          <idle>-0     [001] d..3   719.676749: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.677087: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.681745: hrtimer_spurious: count:112
          <idle>-0     [001] d.h2   719.681750: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.681759: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719675000000
softexpires=719675000000
          <idle>-0     [001] d..3   719.681780: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.682084: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.686783: hrtimer_spurious: count:113
          <idle>-0     [001] d.h2   719.686787: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.686797: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719680000000
softexpires=719680000000
          <idle>-0     [001] d..3   719.686819: hrtimer_cancel: hrtimer=ee7bb740


Normally this happens on a non-spurious interrupt:

 is-cpu-isolated-1963  [000] d.h1   719.642080: irq_handler_entry:
irq=30 name=arch_timer
 is-cpu-isolated-1963  [000] d.h2   719.642086: hrtimer_cancel: hrtimer=ee7b3740
 is-cpu-isolated-1963  [000] d.h1   719.642089: hrtimer_expire_entry:
hrtimer=ee7b3740 function=tick_sched_timer now=719635008926
 is-cpu-isolated-1963  [000] d.h1   719.642095: softirq_raise: vec=1
[action=TIMER]
 is-cpu-isolated-1963  [000] d.h1   719.642099: softirq_raise: vec=9
[action=RCU]
 is-cpu-isolated-1963  [000] d.h2   719.642104: sched_stat_runtime:
comm=trace-isolation pid=1963 runtime=348042 [ns] vruntime=7804506025
[ns]
 is-cpu-isolated-1963  [000] d.h1   719.642111: hrtimer_expire_exit:
hrtimer=ee7b3740
 is-cpu-isolated-1963  [000] d.h2   719.642115: hrtimer_start:
hrtimer=ee7b3740 function=tick_sched_timer expires=719640000000
softexpires=719640000000
 is-cpu-isolated-1963  [000] d.h1   719.642121: irq_handler_exit:
irq=30 ret=handled

.. followed by softirq processing.

Another trace point with trace_hrtimer_spurious() might end up being useful :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 481fa54..91d490d 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1244,7 +1244,8 @@  void hrtimer_interrupt(struct clock_event_device *dev)
 {
        struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
        ktime_t expires_next, now, entry_time, delta;
-       int i, retries = 0;
+       int i, retries = 0, count = 0;
+       static int total_spurious;

        BUG_ON(!cpu_base->hres_active);
        cpu_base->nr_events++;
@@ -1304,10 +1305,14 @@  void hrtimer_interrupt(struct clock_event_device *dev)
                                break;
                        }

+                       count++;
                        __run_hrtimer(timer, &basenow);
                }
        }

+       if (!count)
+               pr_err("____%s: Totalspurious: %d\n", __func__,
++total_spurious);
+
        /*
         * Store the new expiry value so the migration code can verify
         * against it.