diff mbox series

[RFC,2/3] cpuidle,teo: Improve NOHZ management

Message ID 20230728145808.902892871@infradead.org
State New
Headers show
Series cpuidle,teo: Improve TEO tick decisions | expand

Commit Message

Peter Zijlstra July 28, 2023, 2:55 p.m. UTC
With cpuidle having added a TICK bucket, TEO will account all TICK and
longer idles there. This means we can now make an informed decision
about stopping the tick. If the sum of 'hit+intercepts' of all states
below the TICK bucket is more than 50%, it is most likely we'll not
reach the tick this time around either, so stopping the tick doesn't
make sense.

If we don't stop the tick, don't bother calling
tick_nohz_get_sleep_length() and assume duration is no longer than a
tick (could be improved to still look at the current pending time and
timers).

Since we have this extra state, remove the state_count based early
decisions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/cpuidle/governors/teo.c |   97 ++++++++++++++--------------------------
 1 file changed, 34 insertions(+), 63 deletions(-)

Comments

Rafael J. Wysocki July 31, 2023, 10:17 a.m. UTC | #1
On Sat, Jul 29, 2023 at 12:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 28, 2023 at 06:56:24PM +0200, Rafael J. Wysocki wrote:
> > On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > @@ -276,11 +276,11 @@ static void teo_update(struct cpuidle_dr
> > >
> > >                 cpu_data->total += bin->hits + bin->intercepts;
> > >
> > > -               if (target_residency_ns <= cpu_data->sleep_length_ns) {
> > > +               if (target_residency_ns <= cpu_data->sleep_length_ns)
> > >                         idx_timer = i;
> > > -                       if (target_residency_ns <= measured_ns)
> > > -                               idx_duration = i;
> > > -               }
> > > +
> > > +               if (target_residency_ns <= measured_ns)
> > > +                       idx_duration = i;
> >
> > I'm not quite sure what happens here.
>
> Oh, I couldn't convince myself that measured_ns <= sleep_length_ns. If
> measured was longer we still want the higher index.
>
> But yeah, I forgots I had that hunk in.
>
> > >         }
> > >
> > >         i = cpu_data->next_recent_idx++;
> > > @@ -362,11 +362,12 @@ static int teo_select(struct cpuidle_dri
> > >         unsigned int recent_sum = 0;
> > >         unsigned int idx_hit_sum = 0;
> > >         unsigned int hit_sum = 0;
> > > +       unsigned int tick_sum = 0;
> > >         int constraint_idx = 0;
> > >         int idx0 = 0, idx = -1;
> > >         bool alt_intercepts, alt_recent;
> > >         ktime_t delta_tick;
> > > -       s64 duration_ns;
> > > +       s64 duration_ns = TICK_NSEC;
> > >         int i;
> > >
> > >         if (dev->last_state_idx >= 0) {
> > > @@ -376,36 +377,26 @@ static int teo_select(struct cpuidle_dri
> > >
> > >         cpu_data->time_span_ns = local_clock();
> > >
> > > -       duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> > > -       cpu_data->sleep_length_ns = duration_ns;
> > > +       /* Should we stop the tick? */
> >
> > Who's we?  I'd prefer something like "Should the tick be stopped?"
> > here (analogously below).
>
> Sure.
>
> > > +       for (i = 1; i < drv->state_count; i++) {
> > > +               struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
> > > +               struct cpuidle_state *s = &drv->states[i];
> > >
> > > -       /* Check if there is any choice in the first place. */
> > > -       if (drv->state_count < 2) {
> > > -               idx = 0;
> > > -               goto end;
> > > -       }
> > > -       if (!dev->states_usage[0].disable) {
> > > -               idx = 0;
> > > -               if (drv->states[1].target_residency_ns > duration_ns)
> > > -                       goto end;
> > > -       }
> > > +               tick_sum += prev_bin->intercepts;
> > > +               tick_sum += prev_bin->hits;
> > >
> > > -       cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
> > > -       /*
> > > -        * If the CPU is being utilized over the threshold and there are only 2
> > > -        * states to choose from, the metrics need not be considered, so choose
> > > -        * the shallowest non-polling state and exit.
> > > -        */
> > > -       if (drv->state_count < 3 && cpu_data->utilized) {
> > > -               for (i = 0; i < drv->state_count; ++i) {
> > > -                       if (!dev->states_usage[i].disable &&
> > > -                           !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
> > > -                               idx = i;
> > > -                               goto end;
> > > -                       }
> > > -               }
> > > +               if (s->target_residency_ns >= SHORT_TICK_NSEC)
> > > +                       break;
> > >         }
> > >
> > > +       if (2*tick_sum > cpu_data->total)
> > > +               *stop_tick = false;
> >
> > This means "if over 50% of all the events fall into the buckets below
> > the tick period length, don't stop the tick".  Fair enough, but this
> > covers long-term only and what about the most recent events?  I think
> > that they need to be taken into account here too.
>
> From looking at a few traces this 'long' term is around 8-10 samples.
> Which I figured was quick enough.
>
> Note that DECAY_SHIFT is 3, while the pulse is 10 bits, so 3-4 cycles
> will drain most of the history when there's a distinct phase shift.
>
> That said; I did look at the recent thing and those seem geared towards
> the intercepts, while I think hits+intercepts makes more sense here.
> Given it adjusted quickly enough I didn't put more time in it.
>
> > > +
> > > +       /* If we do stop the tick, ask for the next timer. */
> > > +       if (*stop_tick)
> > > +               duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> > > +       cpu_data->sleep_length_ns = duration_ns;
> >
> > If the decision is made to retain the tick and the time to the closest
> > tick event is very small, it would be better to refine the state
> > selection so as to avoid returning a state with the target residency
> > above that time (which essentially is wasting energy).  That's what
> > delta_tick above is for, but now tick_nohz_get_sleep_length() is never
> > called when the tick is not going to be stopped.
>
> Right, so I did ponder using something like
> ktime_sub(tick_nohz_get_next_hrtimer(), now) instead of TICK_NSEC to get
> a more accurate measure, but I didn't do so yet.
>
> > Besides, if I'm not mistaken, setting sleep_length_ns to TICK_NSEC
> > every time the tick is not stopped will not really work on systems
> > where there are real idle states with target residencies beyond
> > TICK_NSEC.
>
> It does work; you really don't want to select such a state if the tick
> is still active -- you'll never get your residency. Such a state should
> really only be used when the tick is off.
>
> > > +
> > >         /*
> > >          * Find the deepest idle state whose target residency does not exceed
> > >          * the current sleep length and the deepest idle state not deeper than
> > > @@ -446,13 +437,13 @@ static int teo_select(struct cpuidle_dri
> > >                 idx_recent_sum = recent_sum;
> > >         }
> > >
> > > -       /* Avoid unnecessary overhead. */
> > > -       if (idx < 0) {
> > > -               idx = 0; /* No states enabled, must use 0. */
> > > -               goto end;
> > > -       } else if (idx == idx0) {
> > > -               goto end;
> > > -       }
> > > +       /* No states enabled, must use 0 */
> > > +       if (idx < 0)
> > > +               return 0;
> > > +
> > > +       /* No point looking for something shallower than the first enabled state */
> > > +       if (idx == idx0)
> > > +               return idx;
> > >
> > >         /*
> > >          * If the sum of the intercepts metric for all of the idle states
> > > @@ -541,29 +532,9 @@ static int teo_select(struct cpuidle_dri
> > >          * If the CPU is being utilized over the threshold, choose a shallower
> > >          * non-polling state to improve latency
> > >          */
> > > -       if (cpu_data->utilized)
> > > +       if (teo_cpu_is_utilized(dev->cpu, cpu_data))
> > >                 idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
> > >
> > > -end:
> > > -       /*
> > > -        * Don't stop the tick if the selected state is a polling one or if the
> > > -        * expected idle duration is shorter than the tick period length.
> > > -        */
> > > -       if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
> > > -           duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {
> > > -               *stop_tick = false;
> > > -
> > > -               /*
> > > -                * The tick is not going to be stopped, so if the target
> > > -                * residency of the state to be returned is not within the time
> > > -                * till the closest timer including the tick, try to correct
> > > -                * that.
> > > -                */
> > > -               if (idx > idx0 &&
> > > -                   drv->states[idx].target_residency_ns > delta_tick)
> > > -                       idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false);
> > > -       }
> > > -
> > >         return idx;
> > >  }
> >
> > Overall, I think that the problem with calling
> > tick_nohz_get_sleep_length() is limited to the cases when the CPU is
> > almost fully loaded, so the overall amount of idle time on it is tiny.
> > I would rather use a special pah for those cases and I would register
> > all of the wakeups as "intercepts" in those cases.
>
> I'm not sure what you're proposing.

Something really simple like:

1. Check sched_cpu_util() (which is done by teo anyway).
2. If that is around 90% of the maximum CPU capacity, select the first
non-polling idle state and be done (don't stop the tick as my other
replay earlier today).

In 2, tick_nohz_get_sleep_length() need not be checked, because the
idle state selection doesn't depend on it, and sleep_length_ns can be
set to KTIME_MAX (or indeed anything greater than TICK_NSEC), because
if the CPU is woken up by the tick, teo_reflect() will take care of
this and otherwise the wakeup should be recorded as an "intercept".

> If we track the tick+ bucket -- as
> we must in order to say anything useful about it, then we can decide the
> tick state before (as I do here) calling sleep_length().
>
> The timer-pull rework from Anna-Maria unfortunately makes the
> tick_nohz_get_sleep_length() thing excessively expensive and it really
> doesn't make sense to call it when we retain the tick.
>
> It's all a bit of a chicken-egg situation, cpuidle wants to know when
> the next timer is, but telling when that is, wants to know if the tick
> stays. We need to break that somehow -- I propose by not calling it when
> we know we'll keep the tick.

By selecting a state whose target residency will not be met, we lose
on both energy and performance, so doing this really should be
avoided, unless the state is really shallow in which case there may be
no time for making this consideration.
Peter Zijlstra July 31, 2023, 12:02 p.m. UTC | #2
On Mon, Jul 31, 2023 at 12:17:27PM +0200, Rafael J. Wysocki wrote:

> Something really simple like:
> 
> 1. Check sched_cpu_util() (which is done by teo anyway).
> 2. If that is around 90% of the maximum CPU capacity, select the first
> non-polling idle state and be done (don't stop the tick as my other
> replay earlier today).

So I really don't like using cpu_util() here, yes, 90% is a high number,
but it doesn't say *anything* about the idle duration. Remember, this is
a 32ms window, so 90% of that is 28.8ms.

(not entirely accurate, since it's an exponential average, but that
doesn't change the overal argument, only some of the particulars)

That is, 90% util, at best, says there is no idle longer than 3.2 ms.
But that is still vastly longer than pretty much all residencies. Heck,
that is still 3 ticks worth of HZ=1000 ticks. So 90% util should not
preclude disabling the tick (at HZ=1000).

Now, typically this won't be the case, and at 90% you'll have lots of
small idles adding up to 3.2ms total idle. But the point is, you can't
tell the difference. And as such util is a horrible measure to use for
cpuidle.

> > If we track the tick+ bucket -- as
> > we must in order to say anything useful about it, then we can decide the
> > tick state before (as I do here) calling sleep_length().
> >
> > The timer-pull rework from Anna-Maria unfortunately makes the
> > tick_nohz_get_sleep_length() thing excessively expensive and it really
> > doesn't make sense to call it when we retain the tick.
> >
> > It's all a bit of a chicken-egg situation, cpuidle wants to know when
> > the next timer is, but telling when that is, wants to know if the tick
> > stays. We need to break that somehow -- I propose by not calling it when
> > we know we'll keep the tick.
> 
> By selecting a state whose target residency will not be met, we lose
> on both energy and performance, so doing this really should be
> avoided, unless the state is really shallow in which case there may be
> no time for making this consideration.

I'm not sure how that relates to what I propose above. By adding the
tick+ bucket we have more historical information as related to the tick
boundary, how does that make us select states we won't match residency
for?
diff mbox series

Patch

--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -139,6 +139,7 @@ 
 #include <linux/sched/clock.h>
 #include <linux/sched/topology.h>
 #include <linux/tick.h>
+#include "../cpuidle.h"
 
 /*
  * The number of bits to shift the CPU's capacity by in order to determine
@@ -197,7 +198,6 @@  struct teo_cpu {
 	int next_recent_idx;
 	int recent_idx[NR_RECENT];
 	unsigned long util_threshold;
-	bool utilized;
 };
 
 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
@@ -276,11 +276,11 @@  static void teo_update(struct cpuidle_dr
 
 		cpu_data->total += bin->hits + bin->intercepts;
 
-		if (target_residency_ns <= cpu_data->sleep_length_ns) {
+		if (target_residency_ns <= cpu_data->sleep_length_ns)
 			idx_timer = i;
-			if (target_residency_ns <= measured_ns)
-				idx_duration = i;
-		}
+
+		if (target_residency_ns <= measured_ns)
+			idx_duration = i;
 	}
 
 	i = cpu_data->next_recent_idx++;
@@ -362,11 +362,12 @@  static int teo_select(struct cpuidle_dri
 	unsigned int recent_sum = 0;
 	unsigned int idx_hit_sum = 0;
 	unsigned int hit_sum = 0;
+	unsigned int tick_sum = 0;
 	int constraint_idx = 0;
 	int idx0 = 0, idx = -1;
 	bool alt_intercepts, alt_recent;
 	ktime_t delta_tick;
-	s64 duration_ns;
+	s64 duration_ns = TICK_NSEC;
 	int i;
 
 	if (dev->last_state_idx >= 0) {
@@ -376,36 +377,26 @@  static int teo_select(struct cpuidle_dri
 
 	cpu_data->time_span_ns = local_clock();
 
-	duration_ns = tick_nohz_get_sleep_length(&delta_tick);
-	cpu_data->sleep_length_ns = duration_ns;
+	/* Should we stop the tick? */
+	for (i = 1; i < drv->state_count; i++) {
+		struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
+		struct cpuidle_state *s = &drv->states[i];
 
-	/* Check if there is any choice in the first place. */
-	if (drv->state_count < 2) {
-		idx = 0;
-		goto end;
-	}
-	if (!dev->states_usage[0].disable) {
-		idx = 0;
-		if (drv->states[1].target_residency_ns > duration_ns)
-			goto end;
-	}
+		tick_sum += prev_bin->intercepts;
+		tick_sum += prev_bin->hits;
 
-	cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
-	/*
-	 * If the CPU is being utilized over the threshold and there are only 2
-	 * states to choose from, the metrics need not be considered, so choose
-	 * the shallowest non-polling state and exit.
-	 */
-	if (drv->state_count < 3 && cpu_data->utilized) {
-		for (i = 0; i < drv->state_count; ++i) {
-			if (!dev->states_usage[i].disable &&
-			    !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
-				idx = i;
-				goto end;
-			}
-		}
+		if (s->target_residency_ns >= SHORT_TICK_NSEC)
+			break;
 	}
 
+	if (2*tick_sum > cpu_data->total)
+		*stop_tick = false;
+
+	/* If we do stop the tick, ask for the next timer. */
+	if (*stop_tick)
+		duration_ns = tick_nohz_get_sleep_length(&delta_tick);
+	cpu_data->sleep_length_ns = duration_ns;
+
 	/*
 	 * Find the deepest idle state whose target residency does not exceed
 	 * the current sleep length and the deepest idle state not deeper than
@@ -446,13 +437,13 @@  static int teo_select(struct cpuidle_dri
 		idx_recent_sum = recent_sum;
 	}
 
-	/* Avoid unnecessary overhead. */
-	if (idx < 0) {
-		idx = 0; /* No states enabled, must use 0. */
-		goto end;
-	} else if (idx == idx0) {
-		goto end;
-	}
+	/* No states enabled, must use 0 */
+	if (idx < 0)
+		return 0;
+
+	/* No point looking for something shallower than the first enabled state */
+	if (idx == idx0)
+		return idx;
 
 	/*
 	 * If the sum of the intercepts metric for all of the idle states
@@ -541,29 +532,9 @@  static int teo_select(struct cpuidle_dri
 	 * If the CPU is being utilized over the threshold, choose a shallower
 	 * non-polling state to improve latency
 	 */
-	if (cpu_data->utilized)
+	if (teo_cpu_is_utilized(dev->cpu, cpu_data))
 		idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
 
-end:
-	/*
-	 * Don't stop the tick if the selected state is a polling one or if the
-	 * expected idle duration is shorter than the tick period length.
-	 */
-	if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
-	    duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {
-		*stop_tick = false;
-
-		/*
-		 * The tick is not going to be stopped, so if the target
-		 * residency of the state to be returned is not within the time
-		 * till the closest timer including the tick, try to correct
-		 * that.
-		 */
-		if (idx > idx0 &&
-		    drv->states[idx].target_residency_ns > delta_tick)
-			idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false);
-	}
-
 	return idx;
 }