diff mbox series

[RFC,1/3] cpuidle: Inject tick boundary state

Message ID 20230728145808.835742568@infradead.org
State Superseded
Headers show
Series cpuidle,teo: Improve TEO tick decisions | expand

Commit Message

Peter Zijlstra July 28, 2023, 2:55 p.m. UTC
In order to facilitate governors that track history in idle-state
buckets (TEO) making a useful decision about NOHZ, make sure we have a
bucket that counts tick-and-longer.

In order to be inclusive of the tick itself -- after all, if we do not
disable NOHZ we'll sleep for a full tick, the actual boundary should
be just short of a full tick.

IOW, when registering the idle-states, add one that is always
disabled, just to have a bucket.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/cpuidle/cpuidle.h |    2 +
 drivers/cpuidle/driver.c  |   48 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/cpuidle.h   |    2 -
 3 files changed, 50 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra July 29, 2023, 8:44 a.m. UTC | #1
On Fri, Jul 28, 2023 at 05:36:55PM +0200, Rafael J. Wysocki wrote:
> On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > In order to facilitate governors that track history in idle-state
> > buckets (TEO) making a useful decision about NOHZ, make sure we have a
> > bucket that counts tick-and-longer.
> >
> > In order to be inclusive of the tick itself -- after all, if we do not
> > disable NOHZ we'll sleep for a full tick, the actual boundary should
> > be just short of a full tick.
> >
> > IOW, when registering the idle-states, add one that is always
> > disabled, just to have a bucket.
> 
> This extra bucket can be created in the governor itself, can't it?

I couldn't find a nice spot for the governor to add idle-states.
Rafael J. Wysocki July 31, 2023, 8:01 a.m. UTC | #2
On Sat, Jul 29, 2023 at 10:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 28, 2023 at 05:36:55PM +0200, Rafael J. Wysocki wrote:
> > On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > In order to facilitate governors that track history in idle-state
> > > buckets (TEO) making a useful decision about NOHZ, make sure we have a
> > > bucket that counts tick-and-longer.
> > >
> > > In order to be inclusive of the tick itself -- after all, if we do not
> > > disable NOHZ we'll sleep for a full tick, the actual boundary should
> > > be just short of a full tick.
> > >
> > > IOW, when registering the idle-states, add one that is always
> > > disabled, just to have a bucket.
> >
> > This extra bucket can be created in the governor itself, can't it?
>
> I couldn't find a nice spot for the governor to add idle-states.

Well, I've thought this through and recalled a couple of things and my
conclusion is that the decision whether or not to stop the tick really
depends on the idle state choice.

There are three cases:

1. The selected idle state is shallow (that is, its target residency
is below the tick period length), but it is not the deepest one.
2. The selected idle state is shallow, but it is the deepest one (or
at least the deepest enabled one).
3. The selected idle state is deep (that is, its target residency is
above the tick length period).

In case 1, the tick should not be stopped so as to prevent the CPU
from spending too much time in a suboptimal idle state.

In case 3, the tick needs to be stopped, because otherwise the target
residency of the selected state would not be met.

Case 2 is somewhat a mixed bag, but generally speaking stopping the
tick doesn't hurt if the selected idle state is the deepest one,
because in that case the governor kind of expects a significant exit
latency anyway.  If it is not the deepest one (which is disabled),
it's better to let the tick tick.
Rafael J. Wysocki July 31, 2023, 5:27 p.m. UTC | #3
On Mon, Jul 31, 2023 at 6:55 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Mon, Jul 31, 2023 at 1:39 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Jul 31, 2023 at 12:35:20PM +0200, Rafael J. Wysocki wrote:
> >
> > > > So I agree with 1.
> > > >
> > > > I do not agree with 2. Disabling the tick is costly, doubly so with the
> > > > timer-pull thing, but even today. Simply disabling it because we picked
> > > > the deepest idle state, irrespective of the expected duration is wrong
> > > > as it will incur this significant cost.
> > > >
> > > > With 3 there is the question of how we get the expected sleep duration;
> > > > this is especially important with timer-pull, where we have this
> > > > chicken-and-egg thing.
> > > >
> > > > Notably: tick_nohz_get_sleep_length() wants to know if the tick gets
> > > > disabled
> > >
> > > Well, it shouldn't.  Or at least it didn't before.
> >
> > Correct, this is new in the timer-pull thing.
> >
> > > It is expected to produce two values, one with the tick stopped (this
> > > is the return value of the function) and the other with the tick
> > > ticking (this is the one written under the address passed as the arg).
> > > This cannot depend on whether or not the tick will be stopped.  Both
> > > are good to know.
> > >
> > > Now, I understand that getting these two values may be costly, so
> > > there is an incentive to avoid calling it, but then the governor needs
> > > to figure this out from its crystal ball and so care needs to be taken
> > > to limit the possible damage in case the crystal ball is not right.
> >
> > If we can get the governor to decide the tick state up-front we can
> > avoid a lot of the expensive parts.
>
> I claim that in the vast majority of cases this is the same as
> deciding the state.
>
> The case when it is not is when the target residency of the deepest
> idle state is below the tick period length and the governor is about
> to select that state.  According to the data I've seen so far this is
> a tiny fraction of all the cases.
>
> > > > and cpuilde wants to use tick_nohz_get_sleep_length() to
> > > > determine if to disable the tick. This cycle needs to be broken for
> > > > timer-pull.
> > > >
> > > > Hence my proposal to introduce the extra tick state, that allows fixing
> > > > both 2 and 3.
> > >
> > > I'm not sure about 3 TBH.
> > >
> > > Say there are 2 idle states, one shallow (say its target residency is
> > > 10 us) and one deep (say its target residency is T = 2 * TICK_NSEC).
> >
> > This is the easy case and that actually 'works' today.
>
> But this is my case 3 which you said you didn't agree with.  I don't
> see why it needs to be fixed.
>
> > The interesting case is where your deepest state has a target residency that
> > is below the tick (because for HZ=100, we have a 10ms tick and pretty
> > much all idle states are below that).
>
> What about HZ=1000, though?
>
> > In that case you cannot tell the difference between I'm good to use this
> > state and I'm good to disable the tick and still use this state.
>
> No, you don't, but is it really worth the fuss?
>
> The state is high-latency anyway and tick_nohz_get_sleep_length()
> needs to be called anyway in that case in order to determine if a
> shallower state wouldn't be better.  At this point the statistics have
> already told the governor otherwise and a misprediction would be a
> double loss.
>
> So yes, you can gain performance by avoiding to call
> tick_nohz_get_sleep_length(), but then you can also lose it by
> selecting a state that is too deep (and that can be determined exactly
> with respect to timers).

BTW, note that teo records timers as "hits", because it has an idea
about when the next timer should occur and that's because it calls
tick_nohz_get_sleep_length().

If it doesn't call that function, it will not be able to tell the
difference between a non-tick timer and any other wakeup, so the
non-tick timer wakeups will then be recorded as "intercepts" which
will skew it towards shallow states over time.  That's one of the
reasons why I would prefer to only avoid calling
tick_nohz_get_sleep_length() when the candidate idle state is really
shallow.
Peter Zijlstra Aug. 2, 2023, 10:34 a.m. UTC | #4
On Mon, Jul 31, 2023 at 06:55:35PM +0200, Rafael J. Wysocki wrote:

> > In that case you cannot tell the difference between I'm good to use this
> > state and I'm good to disable the tick and still use this state.
> 
> No, you don't, but is it really worth the fuss?

My somewhat aged IVB-EP sits around 25 us for restarting the tick.

Depending on the C state, that is a significant chunk of exit latency,
and depending on how often you do the whole NOHZ dance, this can add up
to significant lost runtime too.

And these are all machines that have a usable TSC, these numbers all go
up significantly when you somehow end up on the HPET or similar wreckage.

Stopping the tick is slightly more expensive, but in the same order, I
get around 30 us on the IVB, vs 25 for restarting it. Reprogramming the
timer (LAPIC/TSC-DEADLINE) is the main chunk of it I suspect.

So over-all that's 55 us extra latency for the full idle path, which can
definitely hurt.

So yeah, I would say this is all worth it.

My ADL is somewhat better, but also much higher clocked, and gets around
10 us for a big core and 16 us for a little core for restarting the
tick.
Rafael J. Wysocki Aug. 2, 2023, 12:44 p.m. UTC | #5
On Wed, Aug 2, 2023 at 12:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 31, 2023 at 06:55:35PM +0200, Rafael J. Wysocki wrote:
>
> > > In that case you cannot tell the difference between I'm good to use this
> > > state and I'm good to disable the tick and still use this state.
> >
> > No, you don't, but is it really worth the fuss?
>
> My somewhat aged IVB-EP sits around 25 us for restarting the tick.
>
> Depending on the C state, that is a significant chunk of exit latency,
> and depending on how often you do the whole NOHZ dance, this can add up
> to significant lost runtime too.
>
> And these are all machines that have a usable TSC, these numbers all go
> up significantly when you somehow end up on the HPET or similar wreckage.
>
> Stopping the tick is slightly more expensive, but in the same order, I
> get around 30 us on the IVB, vs 25 for restarting it. Reprogramming the
> timer (LAPIC/TSC-DEADLINE) is the main chunk of it I suspect.
>
> So over-all that's 55 us extra latency for the full idle path, which can
> definitely hurt.
>
> So yeah, I would say this is all worth it.

I agree that, in general, it is good to avoid stopping the tick when
it is not necessary to stop it.

> My ADL is somewhat better, but also much higher clocked, and gets around
> 10 us for a big core and 16 us for a little core for restarting the
> tick.

But my overall point is different.

An additional bin would possibly help if the deepest state has been
selected and its target residency is below the tick, and the closest
timer (other than the tick) is beyond the tick.  So how much of a
difference would be made by making this particular case more accurate?
Peter Zijlstra Aug. 2, 2023, 1:23 p.m. UTC | #6
On Wed, Aug 02, 2023 at 02:44:33PM +0200, Rafael J. Wysocki wrote:
> On Wed, Aug 2, 2023 at 12:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Jul 31, 2023 at 06:55:35PM +0200, Rafael J. Wysocki wrote:
> >
> > > > In that case you cannot tell the difference between I'm good to use this
> > > > state and I'm good to disable the tick and still use this state.
> > >
> > > No, you don't, but is it really worth the fuss?
> >
> > My somewhat aged IVB-EP sits around 25 us for restarting the tick.
> >
> > Depending on the C state, that is a significant chunk of exit latency,
> > and depending on how often you do the whole NOHZ dance, this can add up
> > to significant lost runtime too.
> >
> > And these are all machines that have a usable TSC, these numbers all go
> > up significantly when you somehow end up on the HPET or similar wreckage.
> >
> > Stopping the tick is slightly more expensive, but in the same order, I
> > get around 30 us on the IVB, vs 25 for restarting it. Reprogramming the
> > timer (LAPIC/TSC-DEADLINE) is the main chunk of it I suspect.
> >
> > So over-all that's 55 us extra latency for the full idle path, which can
> > definitely hurt.
> >
> > So yeah, I would say this is all worth it.
> 
> I agree that, in general, it is good to avoid stopping the tick when
> it is not necessary to stop it.
> 
> > My ADL is somewhat better, but also much higher clocked, and gets around
> > 10 us for a big core and 16 us for a little core for restarting the
> > tick.
> 
> But my overall point is different.
> 
> An additional bin would possibly help if the deepest state has been
> selected and its target residency is below the tick, and the closest
> timer (other than the tick) is beyond the tick.  So how much of a
> difference would be made by making this particular case more accurate?

Many of the server parts have a deepest idle state around 600us, distros
have HZ=250. So every idle 600us < x < 4000us would unnecessarily
disable the tick.

How often this happens is of course workload dependent, but if unlucky
it could be a lot. It also adds the above mentioned latency to the idle
state, which for those parts is a significant chunk of the exit latency
extra.

The fix is 'trivial', why not do it?

Anyway, let me post my latest hackery :-)
diff mbox series

Patch

--- a/drivers/cpuidle/cpuidle.h
+++ b/drivers/cpuidle/cpuidle.h
@@ -72,4 +72,6 @@  static inline void cpuidle_coupled_unreg
 }
 #endif
 
+#define SHORT_TICK_NSEC (TICK_NSEC - TICK_NSEC/32)
+
 #endif /* __DRIVER_CPUIDLE_H */
--- a/drivers/cpuidle/driver.c
+++ b/drivers/cpuidle/driver.c
@@ -147,13 +147,37 @@  static void cpuidle_setup_broadcast_time
 		tick_broadcast_disable();
 }
 
+static int tick_enter(struct cpuidle_device *dev,
+		      struct cpuidle_driver *drv,
+		      int index)
+{
+	return -ENODEV;
+}
+
+static void __cpuidle_state_init_tick(struct cpuidle_state *s)
+{
+	strcpy(s->name, "TICK");
+	strcpy(s->desc, "(no-op)");
+
+	s->target_residency_ns = SHORT_TICK_NSEC;
+	s->target_residency = div_u64(SHORT_TICK_NSEC, NSEC_PER_USEC);
+
+	s->exit_latency_ns = 0;
+	s->exit_latency = 0;
+
+	s->flags |= CPUIDLE_FLAG_UNUSABLE;
+
+	s->enter = tick_enter;
+	s->enter_s2idle = tick_enter;
+}
+
 /**
  * __cpuidle_driver_init - initialize the driver's internal data
  * @drv: a valid pointer to a struct cpuidle_driver
  */
 static void __cpuidle_driver_init(struct cpuidle_driver *drv)
 {
-	int i;
+	int tick = 0, i;
 
 	/*
 	 * Use all possible CPUs as the default, because if the kernel boots
@@ -163,6 +187,9 @@  static void __cpuidle_driver_init(struct
 	if (!drv->cpumask)
 		drv->cpumask = (struct cpumask *)cpu_possible_mask;
 
+	if (WARN_ON_ONCE(drv->state_count >= CPUIDLE_STATE_MAX-2))
+		tick = 1;
+
 	for (i = 0; i < drv->state_count; i++) {
 		struct cpuidle_state *s = &drv->states[i];
 
@@ -192,6 +219,25 @@  static void __cpuidle_driver_init(struct
 			s->exit_latency_ns =  0;
 		else
 			s->exit_latency = div_u64(s->exit_latency_ns, NSEC_PER_USEC);
+
+		if (!tick && s->target_residency_ns >= SHORT_TICK_NSEC) {
+			tick = 1;
+
+			if (s->target_residency_ns == SHORT_TICK_NSEC)
+				continue;
+
+			memmove(&drv->states[i+1], &drv->states[i],
+				sizeof(struct cpuidle_state) * (CPUIDLE_STATE_MAX - i - 1));
+			__cpuidle_state_init_tick(s);
+			drv->state_count++;
+			i++;
+		}
+	}
+
+	if (!tick) {
+		struct cpuidle_state *s = &drv->states[i];
+		__cpuidle_state_init_tick(s);
+		drv->state_count++;
 	}
 }
 
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -16,7 +16,7 @@ 
 #include <linux/hrtimer.h>
 #include <linux/context_tracking.h>
 
-#define CPUIDLE_STATE_MAX	10
+#define CPUIDLE_STATE_MAX	16
 #define CPUIDLE_NAME_LEN	16
 #define CPUIDLE_DESC_LEN	32