[RESEND,v1,1/2] cpuidle: menu: Correct the criteria for stopping tick

Message ID 1533835203-5789-2-git-send-email-leo.yan@linaro.org
State New
Headers show
Series
  • Optimization CPU idle state impacted by tick
Related show

Commit Message

Leo Yan Aug. 9, 2018, 5:20 p.m.
The criteria for keeping tick running is the prediction duration is less
than TICK_USEC, the mainline kernel configures HZ=250 so TICK_USEC equals
to 4000us, so any prediction is less than 4000us will not stop tick and
the idle state will be fixed up to one shallow state.  On the other hand,
let's use 96boards Hikey (CA53 octa-CPUs) as an example, the platform has
the deepest C-state is cluster off state which its 'target_residency' is
2700us, if the 'menu' governor predicts the next idle duration is any
value fallen into the range [2700us, 4000us), then the 'menu' governor
will keep sched tick running and and roll back to a shallow CPU off state
rather than cluster off state.  Finally we can see the CPU has much less
chance to run into deepest state when a task repeatedly running on it
with 5000us period and 40% duty cycle (so the task runs for 2000us and
then sleep for 3000us in every period).  In theory, we should permit the
CPU to stay in cluster off state due the CPU sleeping time 3000us is
over its 'target_residency' 2700us.

This issue is caused by the 'menu' governor's criteria for decision if
need to enable tick and roll back to shallow state, the criteria is:
'expected_interval < TICK_USEC'.  This criteria is only considering from
tick aspect, but it doesn't consider idle state residency so misses
better choice for deeper idle state; e.g., the deepest idle state
'target_residency' is less than TICK_USEC, which is quite common on Arm
platforms.

To fix this issue, this patch is to add one extra variable
'stop_tick_point' to help decision if need to stop tick or not.  If
prediction is longer than 'stop_tick_point' then we can stop tick,
otherwise it will keep tick running.

For 'stop_tick_point', except we need to compare prediction period with
TICK_USEC, we also need consider from the perspective of deepest idle
state 'target_residency'.  Finally, 'stop_tick_point' is coming from the
minimum value within the deepest idle state 'target_residency' and
TICK_USEC.

Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Leo Yan <leo.yan@linaro.org>

---
 drivers/cpuidle/governors/menu.c | 41 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

-- 
2.7.4

Comments

Rafael J. Wysocki Aug. 9, 2018, 8:47 p.m. | #1
On Thu, Aug 9, 2018 at 7:20 PM, Leo Yan <leo.yan@linaro.org> wrote:
> The criteria for keeping tick running is the prediction duration is less

> than TICK_USEC,


Yes, because if the predicted idle duration is less than the tick
period, stopping the tick is pointless overhead (if the governor
predicts a CPU wakeup within the tick period length range, it may as
well let the tick run, because the CPU will be woken up relatively
shortly anyway).

> the mainline kernel configures HZ=250 so TICK_USEC equals


To be precise, other values of HZ may be used too, depending on how
the kernel is configured.

> to 4000us, so any prediction is less than 4000us will not stop tick and

> the idle state will be fixed up to one shallow state.  On the other hand,

> let's use 96boards Hikey (CA53 octa-CPUs) as an example, the platform has

> the deepest C-state is cluster off state which its 'target_residency' is

> 2700us, if the 'menu' governor predicts the next idle duration is any

> value fallen into the range [2700us, 4000us), then the 'menu' governor

> will keep sched tick running and and roll back to a shallow CPU off state

> rather than cluster off state.


Which is as expected.

> Finally we can see the CPU has much less

> chance to run into deepest state when a task repeatedly running on it

> with 5000us period and 40% duty cycle (so the task runs for 2000us and

> then sleep for 3000us in every period).  In theory, we should permit the

> CPU to stay in cluster off state due the CPU sleeping time 3000us is

> over its 'target_residency' 2700us.


For every particular choice of the criteria you can find a particular
case in which it will be suboptimal.

> This issue is caused by the 'menu' governor's criteria for decision if

> need to enable tick and roll back to shallow state, the criteria is:

> 'expected_interval < TICK_USEC'.  This criteria is only considering from

> tick aspect, but it doesn't consider idle state residency so misses

> better choice for deeper idle state; e.g., the deepest idle state

> 'target_residency' is less than TICK_USEC, which is quite common on Arm

> platforms.

>

> To fix this issue, this patch is to add one extra variable

> 'stop_tick_point' to help decision if need to stop tick or not.  If

> prediction is longer than 'stop_tick_point' then we can stop tick,

> otherwise it will keep tick running.


Opinions may differ on whether or not it is an issue that needs to be fixed.

> For 'stop_tick_point', except we need to compare prediction period with

> TICK_USEC, we also need consider from the perspective of deepest idle

> state 'target_residency'.  Finally, 'stop_tick_point' is coming from the

> minimum value within the deepest idle state 'target_residency' and

> TICK_USEC.

>

> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>

> Cc: Vincent Guittot <vincent.guittot@linaro.org>

> Signed-off-by: Leo Yan <leo.yan@linaro.org>

> ---

>  drivers/cpuidle/governors/menu.c | 41 ++++++++++++++++++++++++++++++++++++++--

>  1 file changed, 39 insertions(+), 2 deletions(-)

>

> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c

> index 30ab759..2ce4068 100644

> --- a/drivers/cpuidle/governors/menu.c

> +++ b/drivers/cpuidle/governors/menu.c

> @@ -294,6 +294,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,

>         unsigned int expected_interval;

>         unsigned long nr_iowaiters, cpu_load;

>         ktime_t delta_next;

> +       unsigned int stop_tick_point;

>

>         if (data->needs_update) {

>                 menu_update(drv, dev);

> @@ -406,11 +407,47 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,

>                 idx = 0; /* No states enabled. Must use 0. */

>

>         /*

> +        * Decide the time point for tick stopping, if the prediction is before

> +        * this time point it's better to keep the tick enabled and after the

> +        * time point it means the CPU can stay in idle state for enough long

> +        * time so should stop the tick.  This point needs to consider two

> +        * factors: the first one is tick period and the another factor is the

> +        * maximum target residency.

> +        *

> +        * We can divide into below cases:

> +        *

> +        * The first case is the prediction is shorter than the maximum target

> +        * residency and also shorter than tick period, this means the

> +        * prediction isn't to use deepest idle state and it's suppose the CPU

> +        * will be waken up within tick period, for this case we should keep

> +        * the tick to be enabled;

> +        *

> +        * The second case is the prediction is shorter than the maximum target

> +        * residency and longer than tick period, for this case the idle state

> +        * selection has already based on the prediction for shallow state and

> +        * we will expect some events can arrive later than tick to wake up the

> +        * CPU; another thinking for this case is the CPU is likely to stay in

> +        * the expected idle state for long while (which should be longer than

> +        * tick period), so it's reasonable to stop the tick.

> +        *

> +        * The third case is the prediction is longer than the maximum target

> +        * residency, but weather it's longer or shorter than tick period; for

> +        * this case we have selected the deepest idle state so it's pointless

> +        * to enable tick to wake up CPU from deepest state.

> +        *

> +        * To summary upper cases, we use the value of min(TICK_USEC,

> +        * maximum_target_residency) as the critical point to decide if need to

> +        * stop tick.

> +        */

> +       stop_tick_point = min_t(unsigned int, TICK_USEC,

> +                       drv->states[drv->state_count-1].target_residency);

> +

> +       /*

>          * Don't stop the tick if the selected state is a polling one or if the

> -        * expected idle duration is shorter than the tick period length.

> +        * expected idle duration is shorter than the estimated stop tick point.

>          */

>         if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||

> -           expected_interval < TICK_USEC) {

> +           expected_interval < stop_tick_point) {


And that will cause the tick to be stopped unnecessarily in certain
situations, so why is this better?

>                 unsigned int delta_next_us = ktime_to_us(delta_next);

>

>                 *stop_tick = false;

> --
Rafael J. Wysocki Aug. 10, 2018, 7:22 a.m. | #2
On Fri, Aug 10, 2018 at 9:13 AM,  <leo.yan@linaro.org> wrote:
> On Thu, Aug 09, 2018 at 10:47:17PM +0200, Rafael J. Wysocki wrote:

>> On Thu, Aug 9, 2018 at 7:20 PM, Leo Yan <leo.yan@linaro.org> wrote:


[cut]

>> And that will cause the tick to be stopped unnecessarily in certain

>> situations, so why is this better?

>

> Let's see below two cases, the first one case we configure

> TICK_USEC=1000 (1ms) and the second case we configure TICK_USEC=4000

> (4ms).

>

> Let's assume we do the testing one the same platform and have two runs,

> in the Case 1 we configure HZ=1000 so TICK_USEC=1ms, expected_interval

> is 3ms and deepest idle state target residency is 2ms, finally the idle

> governor will choose the deepest state and skip to calibrate to shallow

> state caused by 'expected_interval' > TICK_USEC;

>

> In the Case 2 we configure HZ=250 so TICK_USE=4ms, expected_interval

> (3ms) and deepest idle state target residency (2ms) are same with the

> Case 1; but because expected_interval < TICK_USEC so the idle governor

> will do calibration to select a shallower state.  If we image on one

> platform, the deepest idle state's target residency is smaller value,

> then it has bigger gap with TICK_USEC, the deepest idle state is harder

> to be selected due 'expected_interval' can be easily hit the range

> [Deepest target residency..TICK_USEC).

>

> This patch has no any change for Case 1 and it wants to optimize for

> Case 2 so Case 2 has chance to stay in deepest idle state.  I

> understand from the performance pespective, we need to avoid to stop

> tick for shallow states; on the other hand we cannot prevent CPU run

> into deepest idle state just only we want to keep the tick running,

> especially the expected interval is longer than the deepest state

> target residency.

>

> Case 1:

>       Deepest idle state's target residency=2ms

>                      |

>                      V

> |--------------------------------------------------------> time (ms)

>       ^                              ^

>       |                              |

> TICK_USEC=1ms           expected_interval=3ms

>

>

> Case 2:

>       Deepest idle state's target residency = 2ms

>                      |

>                      V

> |--------------------------------------------------------> time (ms)

>                                      ^                  ^

>                                      |                  |

>                           expected_interval = 3ms   TICK_USEC = 4ms

>

>

>

>> >                 unsigned int delta_next_us = ktime_to_us(delta_next);

>> >

>> >                 *stop_tick = false;

>> > --


Well, I don't quite agree with the approach here, then.

As I said in the previous reply, IMO restarting the stopped tick
before leaving the loop in do_idle() is pointless overhead.  It is not
necessary to do that to avoid leaving CPUs in shallow idle states for
too long (I'll send an alternative patch to fix this issue shortly).

While you may think that pointless overhead is not a problem, I don't
quite agree with that.
Leo Yan Aug. 10, 2018, 9:03 a.m. | #3
On Fri, Aug 10, 2018 at 04:49:06PM +0800, Leo Yan wrote:
> On Fri, Aug 10, 2018 at 09:22:10AM +0200, Rafael J. Wysocki wrote:

> > On Fri, Aug 10, 2018 at 9:13 AM,  <leo.yan@linaro.org> wrote:

> > > On Thu, Aug 09, 2018 at 10:47:17PM +0200, Rafael J. Wysocki wrote:

> > >> On Thu, Aug 9, 2018 at 7:20 PM, Leo Yan <leo.yan@linaro.org> wrote:

> > 

> > [cut]

> > 

> > >> And that will cause the tick to be stopped unnecessarily in certain

> > >> situations, so why is this better?

> > >

> > > Let's see below two cases, the first one case we configure

> > > TICK_USEC=1000 (1ms) and the second case we configure TICK_USEC=4000

> > > (4ms).

> > >

> > > Let's assume we do the testing one the same platform and have two runs,

> > > in the Case 1 we configure HZ=1000 so TICK_USEC=1ms, expected_interval

> > > is 3ms and deepest idle state target residency is 2ms, finally the idle

> > > governor will choose the deepest state and skip to calibrate to shallow

> > > state caused by 'expected_interval' > TICK_USEC;

> > >

> > > In the Case 2 we configure HZ=250 so TICK_USE=4ms, expected_interval

> > > (3ms) and deepest idle state target residency (2ms) are same with the

> > > Case 1; but because expected_interval < TICK_USEC so the idle governor

> > > will do calibration to select a shallower state.  If we image on one

> > > platform, the deepest idle state's target residency is smaller value,

> > > then it has bigger gap with TICK_USEC, the deepest idle state is harder

> > > to be selected due 'expected_interval' can be easily hit the range

> > > [Deepest target residency..TICK_USEC).

> > >

> > > This patch has no any change for Case 1 and it wants to optimize for

> > > Case 2 so Case 2 has chance to stay in deepest idle state.  I

> > > understand from the performance pespective, we need to avoid to stop

> > > tick for shallow states; on the other hand we cannot prevent CPU run

> > > into deepest idle state just only we want to keep the tick running,

> > > especially the expected interval is longer than the deepest state

> > > target residency.

> > >

> > > Case 1:

> > >       Deepest idle state's target residency=2ms

> > >                      |

> > >                      V

> > > |--------------------------------------------------------> time (ms)

> > >       ^                              ^

> > >       |                              |

> > > TICK_USEC=1ms           expected_interval=3ms

> > >

> > >

> > > Case 2:

> > >       Deepest idle state's target residency = 2ms

> > >                      |

> > >                      V

> > > |--------------------------------------------------------> time (ms)

> > >                                      ^                  ^

> > >                                      |                  |

> > >                           expected_interval = 3ms   TICK_USEC = 4ms

> > >

> > >

> > >

> > >> >                 unsigned int delta_next_us = ktime_to_us(delta_next);

> > >> >

> > >> >                 *stop_tick = false;

> > >> > --

> > 

> > Well, I don't quite agree with the approach here, then.

> > 

> > As I said in the previous reply, IMO restarting the stopped tick

> > before leaving the loop in do_idle() is pointless overhead.  It is not

> > necessary to do that to avoid leaving CPUs in shallow idle states for

> > too long (I'll send an alternative patch to fix this issue shortly).

> > 

> > While you may think that pointless overhead is not a problem, I don't

> > quite agree with that.

> 

> I disagree this patch will introduce any extra overhead.

> 

> Firstly, the idle loop doesn't support restarting tick even this patch

> tells idle loop to restart the tick; secondly this patch is mainly to

> resolve issue for the CPU cannot stay in deepest state in Case 2, as a

> side effect it also can tell idle loop to restart the tick for case 3

> in below, actually IMHO this makes sense to tell the idle loop to

> enable the tick but idle loop can ignore this info.

> 

> Furthermore, we have another thread for the patch to always stop

> tick after the the tick has been stopped in the idle loop.

> 

> So this patch is still valid.


Correct for Case 3 as below, actually this case will disappear if we
force to set expected_interval=ktime_to_us(delta_next) in another
proposaled patch.  If so, this patch will have no any chance to
introduce extra ticks.

 expected_interval                             Deepest idle state's
 = min(TICK_USEC,ktime_to_us(delta_next))      target residency = 2ms
 = TICK_USEC = 1ms                                    |
      |                                               |
      V                                               V
|--------------------------------------------------------> time (ms)
      ^
      |
TICK_USEC=1ms
Leo Yan Aug. 12, 2018, 4:07 p.m. | #4
On Sun, Aug 12, 2018 at 01:12:41PM +0200, Rafael J. Wysocki wrote:
> On Fri, Aug 10, 2018 at 11:03 AM <leo.yan@linaro.org> wrote:

> >

> > On Fri, Aug 10, 2018 at 04:49:06PM +0800, Leo Yan wrote:

> > > On Fri, Aug 10, 2018 at 09:22:10AM +0200, Rafael J. Wysocki wrote:

> > > > On Fri, Aug 10, 2018 at 9:13 AM,  <leo.yan@linaro.org> wrote:

> > > > > On Thu, Aug 09, 2018 at 10:47:17PM +0200, Rafael J. Wysocki wrote:

> > > > >> On Thu, Aug 9, 2018 at 7:20 PM, Leo Yan <leo.yan@linaro.org> wrote:

> > > >

> > > > [cut]

> > > >

> > > > >> And that will cause the tick to be stopped unnecessarily in certain

> > > > >> situations, so why is this better?

> > > > >

> > > > > Let's see below two cases, the first one case we configure

> > > > > TICK_USEC=1000 (1ms) and the second case we configure TICK_USEC=4000

> > > > > (4ms).

> > > > >

> > > > > Let's assume we do the testing one the same platform and have two runs,

> > > > > in the Case 1 we configure HZ=1000 so TICK_USEC=1ms, expected_interval

> > > > > is 3ms and deepest idle state target residency is 2ms, finally the idle

> > > > > governor will choose the deepest state and skip to calibrate to shallow

> > > > > state caused by 'expected_interval' > TICK_USEC;

> > > > >

> > > > > In the Case 2 we configure HZ=250 so TICK_USE=4ms, expected_interval

> > > > > (3ms) and deepest idle state target residency (2ms) are same with the

> > > > > Case 1; but because expected_interval < TICK_USEC so the idle governor

> > > > > will do calibration to select a shallower state.  If we image on one

> > > > > platform, the deepest idle state's target residency is smaller value,

> > > > > then it has bigger gap with TICK_USEC, the deepest idle state is harder

> > > > > to be selected due 'expected_interval' can be easily hit the range

> > > > > [Deepest target residency..TICK_USEC).

> > > > >

> > > > > This patch has no any change for Case 1 and it wants to optimize for

> > > > > Case 2 so Case 2 has chance to stay in deepest idle state.  I

> > > > > understand from the performance pespective, we need to avoid to stop

> > > > > tick for shallow states; on the other hand we cannot prevent CPU run

> > > > > into deepest idle state just only we want to keep the tick running,

> > > > > especially the expected interval is longer than the deepest state

> > > > > target residency.

> > > > >

> > > > > Case 1:

> > > > >       Deepest idle state's target residency=2ms

> > > > >                      |

> > > > >                      V

> > > > > |--------------------------------------------------------> time (ms)

> > > > >       ^                              ^

> > > > >       |                              |

> > > > > TICK_USEC=1ms           expected_interval=3ms

> > > > >

> > > > >

> > > > > Case 2:

> > > > >       Deepest idle state's target residency = 2ms

> > > > >                      |

> > > > >                      V

> > > > > |--------------------------------------------------------> time (ms)

> > > > >                                      ^                  ^

> > > > >                                      |                  |

> > > > >                           expected_interval = 3ms   TICK_USEC = 4ms

> > > > >

> > > > >

> > > > >

> > > > >> >                 unsigned int delta_next_us = ktime_to_us(delta_next);

> > > > >> >

> > > > >> >                 *stop_tick = false;

> > > > >> > --

> > > >

> > > > Well, I don't quite agree with the approach here, then.

> > > >

> > > > As I said in the previous reply, IMO restarting the stopped tick

> > > > before leaving the loop in do_idle() is pointless overhead.  It is not

> > > > necessary to do that to avoid leaving CPUs in shallow idle states for

> > > > too long (I'll send an alternative patch to fix this issue shortly).

> > > >

> > > > While you may think that pointless overhead is not a problem, I don't

> > > > quite agree with that.

> > >

> > > I disagree this patch will introduce any extra overhead.

> > >

> > > Firstly, the idle loop doesn't support restarting tick even this patch

> > > tells idle loop to restart the tick;

> 

> I'm not talking about restarting the tick, but about stopping it more

> often on average.


Ah, yes, I agree.

> > > secondly this patch is mainly to

> > > resolve issue for the CPU cannot stay in deepest state in Case 2,

> 

> I understand what you are trying to achieve here, but I don't agree with it.


I agree we need find more general method for fixing.

> The condition modified by this patch is not about how much time the

> CPU can potentially be idle, but about when it is expected to wake up.

> The "expected" part is really key here.

> 

> The governor has gone through the effort of making an idle duration

> prediction and it now it has a certain expectation regarding when the

> CPU will wake up.  If the governor's prediction is any good at all and

> this expectation is in the tick range, the CPU will be woken up by

> something close enough to the tick in the majority of cases, so there

> is no need to stop the tick.  Not because the CPU cannot be idle

> longer, but because it is expected to wake up early enough anyway (and

> yes, you can argue that 2 times the tick range may still be "early

> enough" and so on, but then I'd like to see numbers in support of

> that).


Thanks for explaination; I also think this is good methodology, but
just want to improve a bit based on this.  For example, the governor
is always to use 'expectation' to compare with TICK_USEC, TICK_USEC is a
predefined interval as a boundary, but in reality the tick incoming time
is in the range of [0..TICK_USEC]; so currently method we cannot make
decision according to the tick's delta in realtime.

I'd like take this issue as 'how to improve the decision for stopping
tick?', if we can make better decision for stopping tick, then it's
possible to resolve Case 2 and without stopping tick more offten, e.g.
the CPU even can run into deepest idle state without stopping the tick
if the prediction is less than the tick.

I will send out a new patch set based on these ideas for reviewing.

> Now, if the governor is junk and its predictions are useless, the

> above will not be the case any more, but then I'm not sure what the

> benefit from using that governor at all is. :-)


I really don't think the governor and predictions are useless :)

Just want to remind the side topic, after introducing tick in idle loop,
the tick also can impact the predictions (e.g. it have some impactions
on correction factors but need more time for modeling on this).

> > > as a side effect it also can tell idle loop to restart the tick for case 3

> > > in below, actually IMHO this makes sense to tell the idle loop to

> > > enable the tick but idle loop can ignore this info.

> > >

> > > Furthermore, we have another thread for the patch to always stop

> > > tick after the the tick has been stopped in the idle loop.

> > >

> > > So this patch is still valid.

> >

> > Correct for Case 3 as below, actually this case will disappear if we

> > force to set expected_interval=ktime_to_us(delta_next) in another

> > proposaled patch.  If so, this patch will have no any chance to

> > introduce extra ticks.

> 

> Yes, it will or at least it may.

> 

> Assuming shot noise wakeups, if

> drv->states[drv->state_count-1].target_residency is less than

> TICK_USEC, the tick will be stopped for CPUs more often on average

> with the patch applied (simply because the idle duration range for

> which it will not be stopped is narrower then).


Yes, good point, so in the new approach I try to change the code
to compare with next tick delta rather than TICK_USEC, it can keeps
running tick for the tick with long expire time (similiar with
comparing with TICK_USEC) but we also can stop tick if the tick is likely
to break idle residency.

Thanks,
Leo Yan
Leo Yan Aug. 13, 2018, 9:58 a.m. | #5
On Mon, Aug 13, 2018 at 10:01:20AM +0200, Rafael J. Wysocki wrote:
> On Sun, Aug 12, 2018 at 6:07 PM <leo.yan@linaro.org> wrote:

> >

> > On Sun, Aug 12, 2018 at 01:12:41PM +0200, Rafael J. Wysocki wrote:

> > > On Fri, Aug 10, 2018 at 11:03 AM <leo.yan@linaro.org> wrote:

> 

> [cut]

> 

> > >

> > > Assuming shot noise wakeups, if

> > > drv->states[drv->state_count-1].target_residency is less than

> > > TICK_USEC, the tick will be stopped for CPUs more often on average

> > > with the patch applied (simply because the idle duration range for

> > > which it will not be stopped is narrower then).

> >

> > Yes, good point, so in the new approach I try to change the code

> > to compare with next tick delta rather than TICK_USEC, it can keeps

> > running tick for the tick with long expire time (similiar with

> > comparing with TICK_USEC) but we also can stop tick if the tick is likely

> > to break idle residency.

> 

> We tried something similar and the results were not encouraging.

> Please see https://lore.kernel.org/lkml/079e16b6-2e04-2518-0553-66becab226a6@tu-dresden.de/


I reviewed the result, in the shared page, it said to use next tick
delta and results in the power data increasing, I think this can be
explained.

If we use prediction to compare with next tick delta rather than
TICK_USEC, Thomas gave the expectation is 'This works as a I expect in
the sense of stopping the tick more often and allowing deeper sleep
states in some cases.'; but we also need to expect the change will give
more chance for powernightmares to occurring, if the expect_interval
just falls into the range [delta_next_us..TICK_USEC) then idle
governor will stop tick after comparing with the next tick delta, but
at the meantime the idle governor is very likely to select one shallow
state for expect_interval is a small value.  So though the change
gives more chance for staying deeper state but it also give more
chance for staying in shallow state for longer time.

From the testing report, I don't find it do statistics for idle state
duration time.  The email said 'No more sched tick, no more C1E
requests, but increased power.', so this is just for statistics for
idle state requests (entering/exiting times), but not for duration
statistics.  So this isn't clear for me how the difference for idle
duration.

Because the change gives more chance for powernightmares, so we need
use extra method to avoid it.  This is why I add one extra patch for
this [1], it checks for shallow state with long expire timer, for this
case we should not stop the tick.

Actually the powernightmares issue is not completely resolved so it
still impact the power data; after we really get rid of the impaction
of powernightmares, I think we may have chance to see the benefits of
comparing with next tick delta.

Based on these ideas, I worked out the patch set 'Improvement stopping
tick decision making in 'menu' idle governor' [2], the testing result
supports the idle duration improvement, which shared in the cover letter.

Please help review and let me know if it's doable or not.  Thanks for
the suggestion.

Thanks,
Leo Yan

[1] https://lkml.org/lkml/2018/8/12/84
[2] https://lkml.org/lkml/2018/8/12/82

Patch

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 30ab759..2ce4068 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -294,6 +294,7 @@  static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	unsigned int expected_interval;
 	unsigned long nr_iowaiters, cpu_load;
 	ktime_t delta_next;
+	unsigned int stop_tick_point;
 
 	if (data->needs_update) {
 		menu_update(drv, dev);
@@ -406,11 +407,47 @@  static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		idx = 0; /* No states enabled. Must use 0. */
 
 	/*
+	 * Decide the time point for tick stopping, if the prediction is before
+	 * this time point it's better to keep the tick enabled and after the
+	 * time point it means the CPU can stay in idle state for enough long
+	 * time so should stop the tick.  This point needs to consider two
+	 * factors: the first one is tick period and the another factor is the
+	 * maximum target residency.
+	 *
+	 * We can divide into below cases:
+	 *
+	 * The first case is the prediction is shorter than the maximum target
+	 * residency and also shorter than tick period, this means the
+	 * prediction isn't to use deepest idle state and it's suppose the CPU
+	 * will be waken up within tick period, for this case we should keep
+	 * the tick to be enabled;
+	 *
+	 * The second case is the prediction is shorter than the maximum target
+	 * residency and longer than tick period, for this case the idle state
+	 * selection has already based on the prediction for shallow state and
+	 * we will expect some events can arrive later than tick to wake up the
+	 * CPU; another thinking for this case is the CPU is likely to stay in
+	 * the expected idle state for long while (which should be longer than
+	 * tick period), so it's reasonable to stop the tick.
+	 *
+	 * The third case is the prediction is longer than the maximum target
+	 * residency, but weather it's longer or shorter than tick period; for
+	 * this case we have selected the deepest idle state so it's pointless
+	 * to enable tick to wake up CPU from deepest state.
+	 *
+	 * To summary upper cases, we use the value of min(TICK_USEC,
+	 * maximum_target_residency) as the critical point to decide if need to
+	 * stop tick.
+	 */
+	stop_tick_point = min_t(unsigned int, TICK_USEC,
+			drv->states[drv->state_count-1].target_residency);
+
+	/*
 	 * Don't stop the tick if the selected state is a polling one or if the
-	 * expected idle duration is shorter than the tick period length.
+	 * expected idle duration is shorter than the estimated stop tick point.
 	 */
 	if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
-	    expected_interval < TICK_USEC) {
+	    expected_interval < stop_tick_point) {
 		unsigned int delta_next_us = ktime_to_us(delta_next);
 
 		*stop_tick = false;