Message ID | 2239639.irdbgypaU6@rjwysocki.net |
---|---|
State | New |
Headers | show |
Series | cpuidle: teo: Refine handling of short idle intervals | expand |
On 4/16/25 16:28, Rafael J. Wysocki wrote: > On Wed, Apr 16, 2025 at 5:00 PM Christian Loehle > <christian.loehle@arm.com> wrote: >> >> On 4/3/25 20:18, Rafael J. Wysocki wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >>> Make teo take all recent wakeups (both timer and non-timer) into >>> account when looking for a new candidate idle state in the cases >>> when the majority of recent idle intervals are within the >>> LATENCY_THRESHOLD_NS range or the latency limit is within the >>> LATENCY_THRESHOLD_NS range. >>> >>> Since the tick_nohz_get_sleep_length() invocation is likely to be >>> skipped in those cases, timer wakeups should arguably be taken into >>> account somehow in case they are significant while the current code >>> mostly looks at non-timer wakeups under the assumption that frequent >>> timer wakeups are unlikely in the given idle duration range which >>> may or may not be accurate. >>> >>> The most natural way to do that is to add the "hits" metric to the >>> sums used during the new candidate idle state lookup which effectively >>> means the above. >>> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> Hi Rafael, >> I might be missing something so bare with me. >> Quoting the cover-letter too: >> "In those cases, timer wakeups are not taken into account when they are >> within the LATENCY_THRESHOLD_NS range and the idle state selection may >> be based entirely on non-timer wakeups which may be rare. This causes >> the prediction accuracy to be low and too much energy may be used as >> a result. >> >> The first patch is preparatory and it is not expected to make any >> functional difference. >> >> The second patch causes teo to take timer wakeups into account if it >> is about to skip the tick_nohz_get_sleep_length() invocation, so they >> get a chance to influence the idle state selection." >> >> If the timer wakeups are < LATENCY_THRESHOLD_NS we will not do >> >> cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick); >> >> but >> >> cpu_data->sleep_length_ns = KTIME_MAX; >> >> therefore >> idx_timer = drv->state_count - 1 >> idx_duration = some state with residency < LATENCY_THRESHOLD_NS >> >> For any reasonable system therefore idx_timer != idx_duration >> (i.e. there's an idle state deeper than LATENCY_THRESHOLD_NS). >> So hits will never be incremented? > > Why never? > > First of all, you need to get into the "2 * cpu_data->short_idles >= > cpu_data->total" case somehow and this may be through timer wakeups. Okay, maybe I had a too static scenario in mind here. Let me think it through one more time. > >> How would adding hits then help this case? > > They may be dominant when this condition triggers for the first time. I see. Anything in particular this would help a lot with? There's no noticeable behavior change in my usual tests, which is expected, given we have only WFI in LATENCY_THRESHOLD_NS. I did fake a WFI2 with residency=5 latency=1, teo-m is mainline, teo is with series applied: device gov iter iops idles idle_misses idle_miss_ratio belows aboves WFI WFI2 ------- ----- ----- ------ -------- ------------ ---------------- -------- ------- -------- -------- nvme0n1 teo 0 80223 8601862 1079609 0.126 918363 161246 205096 4080894 nvme0n1 teo 1 78522 8488322 1054171 0.124 890420 163751 208664 4020130 nvme0n1 teo 2 77901 8375258 1031275 0.123 878083 153192 194500 3977655 nvme0n1 teo 3 77517 8344681 1023423 0.123 869548 153875 195262 3961675 nvme0n1 teo 4 77934 8356760 1027556 0.123 876438 151118 191848 3971578 nvme0n1 teo 5 77864 8371566 1033686 0.123 877745 155941 197903 3972844 nvme0n1 teo 6 78057 8417326 1040512 0.124 881420 159092 201922 3991785 nvme0n1 teo 7 78214 8490292 1050379 0.124 884528 165851 210860 4019102 nvme0n1 teo 8 78100 8357664 1034487 0.124 882781 151706 192728 3971505 nvme0n1 teo 9 76895 8316098 1014695 0.122 861950 152745 193680 3948573 nvme0n1 teo-m 0 76729 8261670 1032158 0.125 845247 186911 237147 3877992 nvme0n1 teo-m 1 77763 8344526 1053266 0.126 867094 186172 237526 3919320 nvme0n1 teo-m 2 76717 8285070 1034706 0.125 848385 186321 236956 3889534 nvme0n1 teo-m 3 76920 8270834 1030223 0.125 847490 182733 232081 3887525 nvme0n1 teo-m 4 77198 8329578 1044724 0.125 855438 189286 240947 3908194 nvme0n1 teo-m 5 77361 8338772 1046903 0.126 857291 189612 241577 3912576 nvme0n1 teo-m 6 76827 8346204 1037520 0.124 846008 191512 243167 3914194 nvme0n1 teo-m 7 77931 8367212 1053337 0.126 866549 186788 237852 3930510 nvme0n1 teo-m 8 77870 8358306 1056011 0.126 867167 188844 240602 3923417 nvme0n1 teo-m 9 77405 8338356 1046012 0.125 856605 189407 240694 3913012 The difference is small, but it's there even though this isn't a timer-heavy workload at all.
On Thu, Apr 17, 2025 at 1:58 PM Christian Loehle <christian.loehle@arm.com> wrote: > > On 4/16/25 16:28, Rafael J. Wysocki wrote: > > On Wed, Apr 16, 2025 at 5:00 PM Christian Loehle > > <christian.loehle@arm.com> wrote: > >> > >> On 4/3/25 20:18, Rafael J. Wysocki wrote: > >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>> > >>> Make teo take all recent wakeups (both timer and non-timer) into > >>> account when looking for a new candidate idle state in the cases > >>> when the majority of recent idle intervals are within the > >>> LATENCY_THRESHOLD_NS range or the latency limit is within the > >>> LATENCY_THRESHOLD_NS range. > >>> > >>> Since the tick_nohz_get_sleep_length() invocation is likely to be > >>> skipped in those cases, timer wakeups should arguably be taken into > >>> account somehow in case they are significant while the current code > >>> mostly looks at non-timer wakeups under the assumption that frequent > >>> timer wakeups are unlikely in the given idle duration range which > >>> may or may not be accurate. > >>> > >>> The most natural way to do that is to add the "hits" metric to the > >>> sums used during the new candidate idle state lookup which effectively > >>> means the above. > >>> > >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >> > >> Hi Rafael, > >> I might be missing something so bare with me. > >> Quoting the cover-letter too: > >> "In those cases, timer wakeups are not taken into account when they are > >> within the LATENCY_THRESHOLD_NS range and the idle state selection may > >> be based entirely on non-timer wakeups which may be rare. This causes > >> the prediction accuracy to be low and too much energy may be used as > >> a result. > >> > >> The first patch is preparatory and it is not expected to make any > >> functional difference. > >> > >> The second patch causes teo to take timer wakeups into account if it > >> is about to skip the tick_nohz_get_sleep_length() invocation, so they > >> get a chance to influence the idle state selection." > >> > >> If the timer wakeups are < LATENCY_THRESHOLD_NS we will not do > >> > >> cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick); > >> > >> but > >> > >> cpu_data->sleep_length_ns = KTIME_MAX; > >> > >> therefore > >> idx_timer = drv->state_count - 1 > >> idx_duration = some state with residency < LATENCY_THRESHOLD_NS > >> > >> For any reasonable system therefore idx_timer != idx_duration > >> (i.e. there's an idle state deeper than LATENCY_THRESHOLD_NS). > >> So hits will never be incremented? > > > > Why never? > > > > First of all, you need to get into the "2 * cpu_data->short_idles >= > > cpu_data->total" case somehow and this may be through timer wakeups. > > Okay, maybe I had a too static scenario in mind here. > Let me think it through one more time. Well, this is subtle and your question is actually a good one. > > > >> How would adding hits then help this case? > > > > They may be dominant when this condition triggers for the first time. > > I see. > > Anything in particular this would help a lot with? So I've been trying to reproduce my own results using essentially the linux-next branch of mine (6.15-rc2 with some material on top) as the baseline and so far I've been unable to do that. There's no significant difference from these patches or at least they don't help as much as I thought they would. > There's no noticeable behavior change in my usual tests, which is > expected, given we have only WFI in LATENCY_THRESHOLD_NS. > > I did fake a WFI2 with residency=5 latency=1, teo-m is mainline, teo > is with series applied: > > device gov iter iops idles idle_misses idle_miss_ratio belows aboves WFI WFI2 > ------- ----- ----- ------ -------- ------------ ---------------- -------- ------- -------- -------- > nvme0n1 teo 0 80223 8601862 1079609 0.126 918363 161246 205096 4080894 > nvme0n1 teo 1 78522 8488322 1054171 0.124 890420 163751 208664 4020130 > nvme0n1 teo 2 77901 8375258 1031275 0.123 878083 153192 194500 3977655 > nvme0n1 teo 3 77517 8344681 1023423 0.123 869548 153875 195262 3961675 > nvme0n1 teo 4 77934 8356760 1027556 0.123 876438 151118 191848 3971578 > nvme0n1 teo 5 77864 8371566 1033686 0.123 877745 155941 197903 3972844 > nvme0n1 teo 6 78057 8417326 1040512 0.124 881420 159092 201922 3991785 > nvme0n1 teo 7 78214 8490292 1050379 0.124 884528 165851 210860 4019102 > nvme0n1 teo 8 78100 8357664 1034487 0.124 882781 151706 192728 3971505 > nvme0n1 teo 9 76895 8316098 1014695 0.122 861950 152745 193680 3948573 > nvme0n1 teo-m 0 76729 8261670 1032158 0.125 845247 186911 237147 3877992 > nvme0n1 teo-m 1 77763 8344526 1053266 0.126 867094 186172 237526 3919320 > nvme0n1 teo-m 2 76717 8285070 1034706 0.125 848385 186321 236956 3889534 > nvme0n1 teo-m 3 76920 8270834 1030223 0.125 847490 182733 232081 3887525 > nvme0n1 teo-m 4 77198 8329578 1044724 0.125 855438 189286 240947 3908194 > nvme0n1 teo-m 5 77361 8338772 1046903 0.126 857291 189612 241577 3912576 > nvme0n1 teo-m 6 76827 8346204 1037520 0.124 846008 191512 243167 3914194 > nvme0n1 teo-m 7 77931 8367212 1053337 0.126 866549 186788 237852 3930510 > nvme0n1 teo-m 8 77870 8358306 1056011 0.126 867167 188844 240602 3923417 > nvme0n1 teo-m 9 77405 8338356 1046012 0.125 856605 189407 240694 3913012 > > The difference is small, but it's there even though this isn't > a timer-heavy workload at all. This is interesting, so thanks for doing it, but the goal really was to help with the polling state usage on x86 and that doesn't appear to be happening, so I'm going to drop these patches at least for now.
On Thu, Apr 17, 2025 at 7:18 PM Christian Loehle <christian.loehle@arm.com> wrote: > > On 4/17/25 16:21, Rafael J. Wysocki wrote: > > On Thu, Apr 17, 2025 at 1:58 PM Christian Loehle > > <christian.loehle@arm.com> wrote: > >> > >> On 4/16/25 16:28, Rafael J. Wysocki wrote: > >>> On Wed, Apr 16, 2025 at 5:00 PM Christian Loehle > >>> <christian.loehle@arm.com> wrote: > >>>> > >>>> On 4/3/25 20:18, Rafael J. Wysocki wrote: > >>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>>>> > >>>>> Make teo take all recent wakeups (both timer and non-timer) into > >>>>> account when looking for a new candidate idle state in the cases > >>>>> when the majority of recent idle intervals are within the > >>>>> LATENCY_THRESHOLD_NS range or the latency limit is within the > >>>>> LATENCY_THRESHOLD_NS range. > >>>>> > >>>>> Since the tick_nohz_get_sleep_length() invocation is likely to be > >>>>> skipped in those cases, timer wakeups should arguably be taken into > >>>>> account somehow in case they are significant while the current code > >>>>> mostly looks at non-timer wakeups under the assumption that frequent > >>>>> timer wakeups are unlikely in the given idle duration range which > >>>>> may or may not be accurate. > >>>>> > >>>>> The most natural way to do that is to add the "hits" metric to the > >>>>> sums used during the new candidate idle state lookup which effectively > >>>>> means the above. > >>>>> > >>>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>>> > >>>> Hi Rafael, > >>>> I might be missing something so bare with me. > >>>> Quoting the cover-letter too: > >>>> "In those cases, timer wakeups are not taken into account when they are > >>>> within the LATENCY_THRESHOLD_NS range and the idle state selection may > >>>> be based entirely on non-timer wakeups which may be rare. This causes > >>>> the prediction accuracy to be low and too much energy may be used as > >>>> a result. > >>>> > >>>> The first patch is preparatory and it is not expected to make any > >>>> functional difference. > >>>> > >>>> The second patch causes teo to take timer wakeups into account if it > >>>> is about to skip the tick_nohz_get_sleep_length() invocation, so they > >>>> get a chance to influence the idle state selection." > >>>> > >>>> If the timer wakeups are < LATENCY_THRESHOLD_NS we will not do > >>>> > >>>> cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick); > >>>> > >>>> but > >>>> > >>>> cpu_data->sleep_length_ns = KTIME_MAX; > >>>> > >>>> therefore > >>>> idx_timer = drv->state_count - 1 > >>>> idx_duration = some state with residency < LATENCY_THRESHOLD_NS > >>>> > >>>> For any reasonable system therefore idx_timer != idx_duration > >>>> (i.e. there's an idle state deeper than LATENCY_THRESHOLD_NS). > >>>> So hits will never be incremented? > >>> > >>> Why never? > >>> > >>> First of all, you need to get into the "2 * cpu_data->short_idles >= > >>> cpu_data->total" case somehow and this may be through timer wakeups. > >> > >> Okay, maybe I had a too static scenario in mind here. > >> Let me think it through one more time. > > > > Well, this is subtle and your question is actually a good one. > > > >>> > >>>> How would adding hits then help this case? > >>> > >>> They may be dominant when this condition triggers for the first time. > >> > >> I see. > >> > >> Anything in particular this would help a lot with? > > > > So I've been trying to reproduce my own results using essentially the > > linux-next branch of mine (6.15-rc2 with some material on top) as the > > baseline and so far I've been unable to do that. There's no > > significant difference from these patches or at least they don't help > > as much as I thought they would. > > > >> There's no noticeable behavior change in my usual tests, which is > >> expected, given we have only WFI in LATENCY_THRESHOLD_NS. > >> > >> I did fake a WFI2 with residency=5 latency=1, teo-m is mainline, teo > >> is with series applied: > >> > >> device gov iter iops idles idle_misses idle_miss_ratio belows aboves WFI WFI2 > >> ------- ----- ----- ------ -------- ------------ ---------------- -------- ------- -------- -------- > >> nvme0n1 teo 0 80223 8601862 1079609 0.126 918363 161246 205096 4080894 > >> nvme0n1 teo 1 78522 8488322 1054171 0.124 890420 163751 208664 4020130 > >> nvme0n1 teo 2 77901 8375258 1031275 0.123 878083 153192 194500 3977655 > >> nvme0n1 teo 3 77517 8344681 1023423 0.123 869548 153875 195262 3961675 > >> nvme0n1 teo 4 77934 8356760 1027556 0.123 876438 151118 191848 3971578 > >> nvme0n1 teo 5 77864 8371566 1033686 0.123 877745 155941 197903 3972844 > >> nvme0n1 teo 6 78057 8417326 1040512 0.124 881420 159092 201922 3991785 > >> nvme0n1 teo 7 78214 8490292 1050379 0.124 884528 165851 210860 4019102 > >> nvme0n1 teo 8 78100 8357664 1034487 0.124 882781 151706 192728 3971505 > >> nvme0n1 teo 9 76895 8316098 1014695 0.122 861950 152745 193680 3948573 > >> nvme0n1 teo-m 0 76729 8261670 1032158 0.125 845247 186911 237147 3877992 > >> nvme0n1 teo-m 1 77763 8344526 1053266 0.126 867094 186172 237526 3919320 > >> nvme0n1 teo-m 2 76717 8285070 1034706 0.125 848385 186321 236956 3889534 > >> nvme0n1 teo-m 3 76920 8270834 1030223 0.125 847490 182733 232081 3887525 > >> nvme0n1 teo-m 4 77198 8329578 1044724 0.125 855438 189286 240947 3908194 > >> nvme0n1 teo-m 5 77361 8338772 1046903 0.126 857291 189612 241577 3912576 > >> nvme0n1 teo-m 6 76827 8346204 1037520 0.124 846008 191512 243167 3914194 > >> nvme0n1 teo-m 7 77931 8367212 1053337 0.126 866549 186788 237852 3930510 > >> nvme0n1 teo-m 8 77870 8358306 1056011 0.126 867167 188844 240602 3923417 > >> nvme0n1 teo-m 9 77405 8338356 1046012 0.125 856605 189407 240694 3913012 > >> > >> The difference is small, but it's there even though this isn't > >> a timer-heavy workload at all. > > > > This is interesting, so thanks for doing it, but the goal really was > > to help with the polling state usage on x86 and that doesn't appear to > > be happening, so I'm going to drop these patches at least for now. > > Alright, well my testing on x86 is limited, but I assume you are > referring to systems were we do have > state0 latency=0 residency=0 polling > state1 latency=1 residency=1 > in theory teo shouldn't be super aggressive on state0 then with the > intercept logic, unless the idle durations are recorded as <1us. > I wonder what goes wrong, any traces or workloads you recommend looking > at? I've observed state0 being selected too often and being too shallow 90% or so of the time. I don't have anything showing this specifically in a dramatic fashion and yes, state1 is usually selected much more often than state0.
--- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -261,11 +261,12 @@ static int teo_get_candidate(struct cpuidle_driver *drv, struct cpuidle_device *dev, - struct teo_cpu *cpu_data, - int idx, unsigned int idx_intercepts) + struct teo_cpu *cpu_data, int constraint_idx, + int idx, unsigned int idx_events, + bool count_all_events) { int first_suitable_idx = idx; - unsigned int intercepts = 0; + unsigned int events = 0; int i; /* @@ -277,8 +278,11 @@ * has been stopped already into account. */ for (i = idx - 1; i >= 0; i--) { - intercepts += cpu_data->state_bins[i].intercepts; - if (2 * intercepts > idx_intercepts) { + events += cpu_data->state_bins[i].intercepts; + if (count_all_events) + events += cpu_data->state_bins[i].hits; + + if (2 * events > idx_events) { /* * Use the current state unless it is too * shallow or disabled, in which case take the @@ -316,6 +320,12 @@ if (first_suitable_idx == idx) break; } + /* + * If there is a latency constraint, it may be necessary to select an + * idle state shallower than the current candidate one. + */ + if (idx > constraint_idx) + return constraint_idx; return idx; } @@ -410,49 +420,50 @@ } /* - * If the sum of the intercepts metric for all of the idle states - * shallower than the current candidate one (idx) is greater than the - * sum of the intercepts and hits metrics for the candidate state and - * all of the deeper states, a shallower idle state is likely to be a - * better choice. - */ - if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) - idx = teo_get_candidate(drv, dev, cpu_data, idx, idx_intercept_sum); - - /* - * If there is a latency constraint, it may be necessary to select an - * idle state shallower than the current candidate one. - */ - if (idx > constraint_idx) - idx = constraint_idx; - - /* - * If either the candidate state is state 0 or its target residency is - * low enough, there is basically nothing more to do, but if the sleep - * length is not updated, the subsequent wakeup will be counted as an - * "intercept" which may be problematic in the cases when timer wakeups - * are dominant. Namely, it may effectively prevent deeper idle states - * from being selected at one point even if no imminent timers are - * scheduled. - * - * However, frequent timers in the RESIDENCY_THRESHOLD_NS range on one - * CPU are unlikely (user space has a default 50 us slack value for - * hrtimers and there are relatively few timers with a lower deadline - * value in the kernel), and even if they did happen, the potential - * benefit from using a deep idle state in that case would be - * questionable anyway for latency reasons. Thus if the measured idle - * duration falls into that range in the majority of cases, assume - * non-timer wakeups to be dominant and skip updating the sleep length - * to reduce latency. + * If the measured idle duration has fallen into the + * RESIDENCY_THRESHOLD_NS range in the majority of recent cases, it is + * likely to fall into that range next time, so it is better to avoid + * adding latency to the idle state selection path. Accordingly, aim + * for skipping the sleep length update in that case. * * Also, if the latency constraint is sufficiently low, it will force * shallow idle states regardless of the wakeup type, so the sleep - * length need not be known in that case. + * length need not be known in that case either. */ - if ((!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) && - (2 * cpu_data->short_idles >= cpu_data->total || - latency_req < LATENCY_THRESHOLD_NS)) - goto out_tick; + if (2 * cpu_data->short_idles >= cpu_data->total || + latency_req < LATENCY_THRESHOLD_NS) { + /* + * Look for a new candidate idle state and use all events (both + * "intercepts" and "hits") because the sleep length update is + * likely to be skipped and timer wakeups need to be taken into + * account in a different way in case they are significant. + */ + idx = teo_get_candidate(drv, dev, cpu_data, idx, constraint_idx, + idx_intercept_sum + idx_hit_sum, true); + /* + * If the new candidate state is state 0 or its target residency + * is low enough, return it right away without stopping the + * scheduler tick. + */ + if (!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) + goto out_tick; + } else if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) { + /* + * Look for a new candidate state because the current one is + * likely too deep, but use the "intercepts" metric only because + * the sleep length is going to be determined later and for now + * it is only necessary to find a state that will be suitable + * in case the CPU is "intercepted". + */ + idx = teo_get_candidate(drv, dev, cpu_data, idx, constraint_idx, + idx_intercept_sum, false); + } else if (idx > constraint_idx) { + /* + * The current candidate state is too deep for the latency + * constraint at hand, so change it to a suitable one. + */ + idx = constraint_idx; + } duration_ns = tick_nohz_get_sleep_length(&delta_tick); cpu_data->sleep_length_ns = duration_ns;