Message ID | 3851791.kQq0lBPeGt@rjwysocki.net |
---|---|
State | New |
Headers | show |
Series | cpuidle: teo: Cleanups and very frequent wakeups handling update | expand |
On 1/13/25 18:48, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Title has a typo: s/is/if > Commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() > call in some cases") attempted to reduce the governor overhead in some > cases by making it avoid obtaining the sleep length (the time till the > next timer event) which may be costly. > > Among other things, after the above commit, tick_nohz_get_sleep_length() > was not called any more when idle state 0 was to be returned, which > turned out to be problematic and the previous behavior in that respect > was restored by commit 4b20b07ce72f ("cpuidle: teo: Don't count non- > existent intercepts"). > > However, commit 6da8f9ba5a87 also caused the governor to avoid calling > tick_nohz_get_sleep_length() on systems where idle state 0 is a "polling" > one (that is, it is not really an idle state, but a loop continuously > executed by the CPU) when the target residency of the idle state to be > returned was low enough, so there was no practical need to refine the > idle state selection in any way. This change was not removed by the > other commit, so now on systems where idle state 0 is a "polling" one, > tick_nohz_get_sleep_length() is called when idle state 0 is to be > returned, but it is not called when a deeper idle state with > sufficiently low target residency is to be returned. That is arguably > confusing and inconsistent. > > Moreover, there is no specific reason why the behavior in question > should depend on whether or not idle state 0 is a "polling" one. > > One way to address this would be to make the governor always call > tick_nohz_get_sleep_length() to obtain the sleep length, but that would > effectively mean reverting commit 6da8f9ba5a87 and restoring the latency > issue that was the reason for doing it. This approach is thus not > particularly attractive. > > To address it differently, notice that if a CPU is woken up very often, > this is not likely to be caused by timers in the first place (user space > has a default timer slack of 50 us and there are relatively few timers > with a deadline shorter than several microseconds in the kernel) and > even if it were the case, the potential benefit from using a deep idle > state would then be questionable for latency reasons. Therefore, if the > majority of CPU wakeups occur within several microseconds, it can be > assumed that all wakeups in that range are non-timer and the sleep > length need not be determined. > > Accordingly, introduce a new metric for counting wakeups with the > measured idle duration below RESIDENCY_THRESHOLD_NS and modify the idle > state selection to skip the tick_nohz_get_sleep_length() invocation if > idle state 0 has been selected or the target residency of the candidate > idle state is below RESIDENCY_THRESHOLD_NS and the value of the new > metric is at least 1/2 of the total event count. > > Since the above requires the measured idle duration to be determined > every time, except for the cases when one of the safety nets has > triggered in which the wakeup is counted as a hit in the deepest > idle state idle residency range, update the handling of those cases > to avoid skipping the idle duration computation when the CPU wakeup > is "genuine". > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > drivers/cpuidle/governors/teo.c | 58 ++++++++++++++++++++++++---------------- > 1 file changed, 36 insertions(+), 22 deletions(-) > > --- a/drivers/cpuidle/governors/teo.c > +++ b/drivers/cpuidle/governors/teo.c > @@ -129,6 +129,7 @@ > * @state_bins: Idle state data bins for this CPU. > * @total: Grand total of the "intercepts" and "hits" metrics for all bins. > * @tick_intercepts: "Intercepts" before TICK_NSEC. > + * @short_idle: Wakeups after short idle periods. > */ > struct teo_cpu { > s64 time_span_ns; > @@ -136,6 +137,7 @@ > struct teo_bin state_bins[CPUIDLE_STATE_MAX]; > unsigned int total; > unsigned int tick_intercepts; > + unsigned int short_idle; Maybe call these short_idles? > }; > > static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); > @@ -152,12 +154,12 @@ > s64 target_residency_ns; > u64 measured_ns; > > - if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) { > + cpu_data->short_idle -= cpu_data->short_idle >> DECAY_SHIFT; > + > + if (cpu_data->time_span_ns < 0) { > /* > - * This causes the wakeup to be counted as a hit regardless of > - * regardless of the real idle duration which doesn't need to be > - * computed because the wakeup has been close enough to an > - * anticipated timer. > + * If one of the safety nets has triggered, assume that this > + * might have been a long sleep. > */ > measured_ns = U64_MAX; > } else { > @@ -177,10 +179,14 @@ > * time, so take 1/2 of the exit latency as a very rough > * approximation of the average of it. > */ > - if (measured_ns >= lat_ns) > + if (measured_ns >= lat_ns) { > measured_ns -= lat_ns / 2; > - else > + if (measured_ns < RESIDENCY_THRESHOLD_NS) > + cpu_data->short_idle += PULSE; > + } else { > measured_ns /= 2; > + cpu_data->short_idle += PULSE; > + } > } > > cpu_data->total = 0; > @@ -419,27 +425,35 @@ > if (idx > constraint_idx) > idx = constraint_idx; > > - if (!idx) { > - /* > - * Query the sleep length to be able to count the wakeup as a > - * hit if it is caused by a timer. > - */ > - cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick); > - goto out_tick; > - } > - > /* > - * If state 0 is a polling one, check if the target residency of > - * the current candidate state is low enough and skip the timers > - * check in that case too. > + * If either the candidate state is state 0 or its target residency is > + * low enough, there is basically nothing more to do, but if the sleep > + * length is not updated, the subsequent wakeup will be counted as an > + * "intercept" which may be problematic in the cases when timer wakeups > + * are dominant. Namely, it may effectively prevent deeper idle states > + * from being selected at one point even if no imminent timers are > + * scheduled. > + * > + * However, frequent timers in the RESIDENCY_THRESHOLD_NS range on one > + * CPU are unlikely (user space has a default 50 us slack value for > + * hrtimers and there are relatively few timers with a lower deadline > + * value in the kernel), and even if they did happene, the potential s/happene/happen > + * benefit from using a deep idle state in that case would be > + * questionable anyway for latency reasons. Thus if the measured idle > + * duration falls into that range in the majority of cases, assume > + * non-timer wakeups to be dominant and skip updating the sleep length > + * to reduce latency. > */ > - if ((drv->states[0].flags & CPUIDLE_FLAG_POLLING) && > - drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) > + if ((!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) && > + 2 * cpu_data->short_idle >= cpu_data->total) > goto out_tick; > > duration_ns = tick_nohz_get_sleep_length(&delta_tick); > cpu_data->sleep_length_ns = duration_ns; > > + if (!idx) > + goto out_tick; > + > /* > * If the closest expected timer is before the target residency of the > * candidate state, a shallower one needs to be found. > @@ -501,7 +515,7 @@ > if (dev->poll_time_limit || > (tick_nohz_idle_got_tick() && cpu_data->sleep_length_ns > TICK_NSEC)) { > dev->poll_time_limit = false; > - cpu_data->time_span_ns = cpu_data->sleep_length_ns; > + cpu_data->time_span_ns = KTIME_MIN; > } else { > cpu_data->time_span_ns = local_clock() - cpu_data->time_span_ns; > } > Thanks, I like this approach. Reviewed-by: Christian Loehle <christian.loehle@arm.com>
On Mon, Jan 20, 2025 at 1:08 PM Christian Loehle <christian.loehle@arm.com> wrote: > > On 1/13/25 18:48, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > Title has a typo: s/is/if Fixed when applying the patch. > > Commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() > > call in some cases") attempted to reduce the governor overhead in some > > cases by making it avoid obtaining the sleep length (the time till the > > next timer event) which may be costly. > > > > Among other things, after the above commit, tick_nohz_get_sleep_length() > > was not called any more when idle state 0 was to be returned, which > > turned out to be problematic and the previous behavior in that respect > > was restored by commit 4b20b07ce72f ("cpuidle: teo: Don't count non- > > existent intercepts"). > > > > However, commit 6da8f9ba5a87 also caused the governor to avoid calling > > tick_nohz_get_sleep_length() on systems where idle state 0 is a "polling" > > one (that is, it is not really an idle state, but a loop continuously > > executed by the CPU) when the target residency of the idle state to be > > returned was low enough, so there was no practical need to refine the > > idle state selection in any way. This change was not removed by the > > other commit, so now on systems where idle state 0 is a "polling" one, > > tick_nohz_get_sleep_length() is called when idle state 0 is to be > > returned, but it is not called when a deeper idle state with > > sufficiently low target residency is to be returned. That is arguably > > confusing and inconsistent. > > > > Moreover, there is no specific reason why the behavior in question > > should depend on whether or not idle state 0 is a "polling" one. > > > > One way to address this would be to make the governor always call > > tick_nohz_get_sleep_length() to obtain the sleep length, but that would > > effectively mean reverting commit 6da8f9ba5a87 and restoring the latency > > issue that was the reason for doing it. This approach is thus not > > particularly attractive. > > > > To address it differently, notice that if a CPU is woken up very often, > > this is not likely to be caused by timers in the first place (user space > > has a default timer slack of 50 us and there are relatively few timers > > with a deadline shorter than several microseconds in the kernel) and > > even if it were the case, the potential benefit from using a deep idle > > state would then be questionable for latency reasons. Therefore, if the > > majority of CPU wakeups occur within several microseconds, it can be > > assumed that all wakeups in that range are non-timer and the sleep > > length need not be determined. > > > > Accordingly, introduce a new metric for counting wakeups with the > > measured idle duration below RESIDENCY_THRESHOLD_NS and modify the idle > > state selection to skip the tick_nohz_get_sleep_length() invocation if > > idle state 0 has been selected or the target residency of the candidate > > idle state is below RESIDENCY_THRESHOLD_NS and the value of the new > > metric is at least 1/2 of the total event count. > > > > Since the above requires the measured idle duration to be determined > > every time, except for the cases when one of the safety nets has > > triggered in which the wakeup is counted as a hit in the deepest > > idle state idle residency range, update the handling of those cases > > to avoid skipping the idle duration computation when the CPU wakeup > > is "genuine". > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > drivers/cpuidle/governors/teo.c | 58 ++++++++++++++++++++++++---------------- > > 1 file changed, 36 insertions(+), 22 deletions(-) > > > > --- a/drivers/cpuidle/governors/teo.c > > +++ b/drivers/cpuidle/governors/teo.c > > @@ -129,6 +129,7 @@ > > * @state_bins: Idle state data bins for this CPU. > > * @total: Grand total of the "intercepts" and "hits" metrics for all bins. > > * @tick_intercepts: "Intercepts" before TICK_NSEC. > > + * @short_idle: Wakeups after short idle periods. > > */ > > struct teo_cpu { > > s64 time_span_ns; > > @@ -136,6 +137,7 @@ > > struct teo_bin state_bins[CPUIDLE_STATE_MAX]; > > unsigned int total; > > unsigned int tick_intercepts; > > + unsigned int short_idle; > > Maybe call these short_idles? I renamed it accordingly when applying the patch. > > }; > > > > static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); > > @@ -152,12 +154,12 @@ > > s64 target_residency_ns; > > u64 measured_ns; > > > > - if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) { > > + cpu_data->short_idle -= cpu_data->short_idle >> DECAY_SHIFT; > > + > > + if (cpu_data->time_span_ns < 0) { > > /* > > - * This causes the wakeup to be counted as a hit regardless of > > - * regardless of the real idle duration which doesn't need to be > > - * computed because the wakeup has been close enough to an > > - * anticipated timer. > > + * If one of the safety nets has triggered, assume that this > > + * might have been a long sleep. > > */ > > measured_ns = U64_MAX; > > } else { > > @@ -177,10 +179,14 @@ > > * time, so take 1/2 of the exit latency as a very rough > > * approximation of the average of it. > > */ > > - if (measured_ns >= lat_ns) > > + if (measured_ns >= lat_ns) { > > measured_ns -= lat_ns / 2; > > - else > > + if (measured_ns < RESIDENCY_THRESHOLD_NS) > > + cpu_data->short_idle += PULSE; > > + } else { > > measured_ns /= 2; > > + cpu_data->short_idle += PULSE; > > + } > > } > > > > cpu_data->total = 0; > > @@ -419,27 +425,35 @@ > > if (idx > constraint_idx) > > idx = constraint_idx; > > > > - if (!idx) { > > - /* > > - * Query the sleep length to be able to count the wakeup as a > > - * hit if it is caused by a timer. > > - */ > > - cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick); > > - goto out_tick; > > - } > > - > > /* > > - * If state 0 is a polling one, check if the target residency of > > - * the current candidate state is low enough and skip the timers > > - * check in that case too. > > + * If either the candidate state is state 0 or its target residency is > > + * low enough, there is basically nothing more to do, but if the sleep > > + * length is not updated, the subsequent wakeup will be counted as an > > + * "intercept" which may be problematic in the cases when timer wakeups > > + * are dominant. Namely, it may effectively prevent deeper idle states > > + * from being selected at one point even if no imminent timers are > > + * scheduled. > > + * > > + * However, frequent timers in the RESIDENCY_THRESHOLD_NS range on one > > + * CPU are unlikely (user space has a default 50 us slack value for > > + * hrtimers and there are relatively few timers with a lower deadline > > + * value in the kernel), and even if they did happene, the potential > > s/happene/happen Fixed when applying the patch. > > + * benefit from using a deep idle state in that case would be > > + * questionable anyway for latency reasons. Thus if the measured idle > > + * duration falls into that range in the majority of cases, assume > > + * non-timer wakeups to be dominant and skip updating the sleep length > > + * to reduce latency. > > */ > > - if ((drv->states[0].flags & CPUIDLE_FLAG_POLLING) && > > - drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) > > + if ((!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) && > > + 2 * cpu_data->short_idle >= cpu_data->total) > > goto out_tick; > > > > duration_ns = tick_nohz_get_sleep_length(&delta_tick); > > cpu_data->sleep_length_ns = duration_ns; > > > > + if (!idx) > > + goto out_tick; > > + > > /* > > * If the closest expected timer is before the target residency of the > > * candidate state, a shallower one needs to be found. > > @@ -501,7 +515,7 @@ > > if (dev->poll_time_limit || > > (tick_nohz_idle_got_tick() && cpu_data->sleep_length_ns > TICK_NSEC)) { > > dev->poll_time_limit = false; > > - cpu_data->time_span_ns = cpu_data->sleep_length_ns; > > + cpu_data->time_span_ns = KTIME_MIN; > > } else { > > cpu_data->time_span_ns = local_clock() - cpu_data->time_span_ns; > > } > > > > Thanks, I like this approach. > Reviewed-by: Christian Loehle <christian.loehle@arm.com> Thank you!
--- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -129,6 +129,7 @@ * @state_bins: Idle state data bins for this CPU. * @total: Grand total of the "intercepts" and "hits" metrics for all bins. * @tick_intercepts: "Intercepts" before TICK_NSEC. + * @short_idle: Wakeups after short idle periods. */ struct teo_cpu { s64 time_span_ns; @@ -136,6 +137,7 @@ struct teo_bin state_bins[CPUIDLE_STATE_MAX]; unsigned int total; unsigned int tick_intercepts; + unsigned int short_idle; }; static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); @@ -152,12 +154,12 @@ s64 target_residency_ns; u64 measured_ns; - if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) { + cpu_data->short_idle -= cpu_data->short_idle >> DECAY_SHIFT; + + if (cpu_data->time_span_ns < 0) { /* - * This causes the wakeup to be counted as a hit regardless of - * regardless of the real idle duration which doesn't need to be - * computed because the wakeup has been close enough to an - * anticipated timer. + * If one of the safety nets has triggered, assume that this + * might have been a long sleep. */ measured_ns = U64_MAX; } else { @@ -177,10 +179,14 @@ * time, so take 1/2 of the exit latency as a very rough * approximation of the average of it. */ - if (measured_ns >= lat_ns) + if (measured_ns >= lat_ns) { measured_ns -= lat_ns / 2; - else + if (measured_ns < RESIDENCY_THRESHOLD_NS) + cpu_data->short_idle += PULSE; + } else { measured_ns /= 2; + cpu_data->short_idle += PULSE; + } } cpu_data->total = 0; @@ -419,27 +425,35 @@ if (idx > constraint_idx) idx = constraint_idx; - if (!idx) { - /* - * Query the sleep length to be able to count the wakeup as a - * hit if it is caused by a timer. - */ - cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick); - goto out_tick; - } - /* - * If state 0 is a polling one, check if the target residency of - * the current candidate state is low enough and skip the timers - * check in that case too. + * If either the candidate state is state 0 or its target residency is + * low enough, there is basically nothing more to do, but if the sleep + * length is not updated, the subsequent wakeup will be counted as an + * "intercept" which may be problematic in the cases when timer wakeups + * are dominant. Namely, it may effectively prevent deeper idle states + * from being selected at one point even if no imminent timers are + * scheduled. + * + * However, frequent timers in the RESIDENCY_THRESHOLD_NS range on one + * CPU are unlikely (user space has a default 50 us slack value for + * hrtimers and there are relatively few timers with a lower deadline + * value in the kernel), and even if they did happene, the potential + * benefit from using a deep idle state in that case would be + * questionable anyway for latency reasons. Thus if the measured idle + * duration falls into that range in the majority of cases, assume + * non-timer wakeups to be dominant and skip updating the sleep length + * to reduce latency. */ - if ((drv->states[0].flags & CPUIDLE_FLAG_POLLING) && - drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) + if ((!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) && + 2 * cpu_data->short_idle >= cpu_data->total) goto out_tick; duration_ns = tick_nohz_get_sleep_length(&delta_tick); cpu_data->sleep_length_ns = duration_ns; + if (!idx) + goto out_tick; + /* * If the closest expected timer is before the target residency of the * candidate state, a shallower one needs to be found. @@ -501,7 +515,7 @@ if (dev->poll_time_limit || (tick_nohz_idle_got_tick() && cpu_data->sleep_length_ns > TICK_NSEC)) { dev->poll_time_limit = false; - cpu_data->time_span_ns = cpu_data->sleep_length_ns; + cpu_data->time_span_ns = KTIME_MIN; } else { cpu_data->time_span_ns = local_clock() - cpu_data->time_span_ns; }