mbox series

[0/4] sched: cpufreq: Remove uclamp max-aggregation

Message ID 20231208015242.385103-1-qyousef@layalina.io
Headers show
Series sched: cpufreq: Remove uclamp max-aggregation | expand

Message

Qais Yousef Dec. 8, 2023, 1:52 a.m. UTC
One of the practical issues that has risen up when trying to deploy uclamp in
practice, was making uclamp_max more effective. Due to max-aggregation at rq,
the effective uclamp_max value for a capped task can easily be lifted by any
task that gets enqueued. Which can be often enough in practice to cause it to
be ineffective and leaving much to be desired.

One of the solutions attempted out of tree was to filter out the short running
tasks based on sched_slice(), and this proved to be effective enough

	https://android-review.googlesource.com/c/kernel/common/+/1914696

But the solution is not upstream friendly as it introduces more magic
thresholds and not sure how it would work after EEVDF changes got merged in.

In principle, the max-aggregation is required to address the frequency hint
part of uclamp. We don't need this at wake up path as the per-task value should
let us know with the task fits a CPU or not on HMP systems.

As pointed out in remove magic hardcoded margins series [1], the limitation is
actually in the governor/hardware where we are constrained to changing
frequencies at a high rate, which uclamp usage can induce.

To address the problems, we can move the problem to be a cpufreq governor issue
to deal with whatever limitation it has to respond to task's perf requirement
hints. This means we need to tell the governor that we need a frequency update
to cater for a task perf hints, hence adding a new SCHED_CPUFREQ_PERF_HINTS
flag.

With the new flag, we can then send special updates to cpufreq governor on
context switch *if* it has perf requirements that aren't met already.

The governor can then choose to honour or ignore these request based on
whatever criteria it sees fit. For schedutil I made use of the new
approximate_util_avg() to create a threshold based on rate_limit_us to ignore
perf requirements for tasks that their util_avg tells us they were historically
running shorter than hardware's ability to change frequency. Which means they
will actually go back to sleep before they see the freq change, so honouring
their request is pointless, hence we ignore it.

Instead of being exact, I made an assumption that the task has to run for at
least 500us above rate_limit_us which is magical but what I've seen in practice
as a meaningful runtime where task can actually do meaningful work that is
worthwhile. But this exact definition of the threshold is subject for debate.
We can make it 1.5 rate_limit_us for example. I preferred the absolute number
as even in lower end hardware; I think this is a meaningful unit of time for
producing meaningful results that can make can impact. There's the hidden
assumption that most modern hardware already has fast enough DVFS. Absolute
number fails for super fast hardware though..

This allows us to remove uclamp max-aggregation logic completely and moves the
problem to be a cpufreq governor problem instead. Which I think is the right
thing to do as the scheduler was overcompensating for what is in reality
a cpufreq governor limitation and policy. We just need to improve the
interface with the governor.

Implementing different policies/strategies to deal with the problem would be
easier now that the problem space has moved to the governor. And it keeps
scheduler code clean and focus on things that matter from scheduling point of
view only.

For example max-aggregation can be implemented in the governor by adding new
flag when sending cpufreq_update_util() at enqueue/dequeue_task(). Not that
I think it's a good idea, but the possibility is there. Especially if platforms
like x86 has a lot of intelligence in firmware and they'd like to implement
something smarter at that level. They'll just need to improve the interface
with the governor.

===

This patch is based on remove margins series [1] and data is collected it
against it as a baseline.

Testing on pixel 6 with mainline(ish) kernel

==

Speedometer browser benchmark

       | baseline  | 1.25 headroom |   patch   | patch + 1.25 headroom
-------+-----------+---------------+-----------+---------------------
score  |  108.03   |     135.72    |   108.09  |    137.47
-------+-----------+---------------+-----------+---------------------
power  |  1204.75  |    1451.79    |  1216.17  |    1484.73
-------+-----------+---------------+-----------+---------------------

No regressions.

===

UiBench

       | baseline  | 1.25 headroom |   patch   | patch + 1.25 headroom
-------+-----------+---------------+-----------+---------------------
jank   |    68.0   |      56.0     |    65.6   |    57.3
-------+-----------+---------------+-----------+---------------------
power  |   146.54  |     164.49    |   144.91  |    167.57
-------+-----------+---------------+-----------+---------------------

No regressions.

===

Spawning 8 busyloop threads each pinned to a CPU with uclamp_max set to low-ish
OPP

```
adb shell "uclampset -M 90 taskset -a 01 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 90 taskset -a 02 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 90 taskset -a 04 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 90 taskset -a 08 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 270 taskset -a 10 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 270 taskset -a 20 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 670 taskset -a 40 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 670 taskset -a 80 cat /dev/zero > /dev/null &" &
```

And running speedometer for a single iteration

       | baseline  |   patch   |
-------+-----------+-----------+
score  |   73.44   |   75.62   |
-------+-----------+-----------+
power  |   520.46  |   489.49  |
-------+-----------+-----------+

Similar score at lower power.

Little's Freq Residency:

         | baseline  |   patch   |
---------+-----------+-----------+
OPP@90   |   29.59%  |   49.25%  |
---------+-----------+-----------+
OPP@max  |   40.02%  |   12.31%  |
---------+-----------+-----------+

Mid's Freq Residency:

         | baseline  |   patch   |
---------+-----------+-----------+
OPP@270  |   50.02%  |   77.53%  |
---------+-----------+-----------+
OPP@max  |   49.17%  |   22.46%  |
---------+-----------+-----------+

Big's Freq Residency:

         | baseline  |   patch   |
---------+-----------+-----------+
OPP@670  |   46.43%  |   54.44%  |
---------+-----------+-----------+
OPP@max  |   1.76%   |   4.57%   |
---------+-----------+-----------+

As you can see the residency at the capped frequency has increased considerably
for all clusters. The time spent running at max frequency is reduced
significantly for little and mid. For big there's a slight increase. Both
numbers are suspiciously low. With the busy loops in background; these cores
are more subject to throttling. So the higher number indicates we've been
throttled less.

---

Patch 1 clean ups the code that sends cpufreq_update_util() to be more
intentional and less noisy.

Patch 2 removes uclamp-max aggregation and implements sending
cpufreq_update_util() updates at context switch instead.

Patch 3 implements the logic to filter short running tasks compared to
rate_limit_us and add perf hint flag at task_tick_fair().

Patch 4 updates uclamp docs to reflect the new changes.

[1] https://lore.kernel.org/lkml/20231208002342.367117-1-qyousef@layalina.io/

Thanks!

--
Qais Yousef

Qais Yousef (4):
  sched/fair: Be less aggressive in calling cpufreq_update_util()
  sched/uclamp: Remove rq max aggregation
  sched/schedutil: Ignore update requests for short running tasks
  sched/documentation: Remove reference to max aggregation

 Documentation/scheduler/sched-util-clamp.rst | 239 ++---------
 include/linux/sched.h                        |  16 -
 include/linux/sched/cpufreq.h                |   3 +-
 init/Kconfig                                 |  31 --
 kernel/sched/core.c                          | 426 +++----------------
 kernel/sched/cpufreq_schedutil.c             |  61 ++-
 kernel/sched/fair.c                          | 157 ++-----
 kernel/sched/rt.c                            |   4 -
 kernel/sched/sched.h                         | 150 ++-----
 9 files changed, 244 insertions(+), 843 deletions(-)

Comments

Qais Yousef Dec. 17, 2023, 9:23 p.m. UTC | #1
On 12/18/23 09:19, Dietmar Eggemann wrote:
> On 08/12/2023 02:52, Qais Yousef wrote:
> 
> [...]
> 
> > ===
> > 
> > This patch is based on remove margins series [1] and data is collected it
> > against it as a baseline.
> > 
> > Testing on pixel 6 with mainline(ish) kernel
> 
> How is the Pixel6 configured in terms of per-policy rate_limit_us and
> response_time_ms ? Is this the now default 2ms and whatever the systems
> calculates for response_time_ms ?

Yes.

> 
> Pixel6 is still a slow switching device, rigth?
> 
> root           297     2 1 08:58:01 ?     00:00:13 [sugov:0]
> root           298     2 0 08:58:01 ?     00:00:03 [sugov:4]
> root           299     2 1 08:58:01 ?     00:00:05 [sugov:6]

Yes.

> 
> > ==
> > 
> > Speedometer browser benchmark
> > 
> >        | baseline  | 1.25 headroom |   patch   | patch + 1.25 headroom
> > -------+-----------+---------------+-----------+---------------------
> > score  |  108.03   |     135.72    |   108.09  |    137.47
> > -------+-----------+---------------+-----------+---------------------
> > power  |  1204.75  |    1451.79    |  1216.17  |    1484.73
> > -------+-----------+---------------+-----------+---------------------
> 
> What's the difference between baseline & 1.25 headroom. IMHO, we have:

Baseline is the remove-margins patches as specified in the quoted text above

	> > This patch is based on remove margins series [1] and data is collected it
	> > against it as a baseline.

The series were stacked on top of each others. Results from this run should be
compared to remove-margins[1] tables too.

> 
>  static inline unsigned long map_util_perf(unsigned long util)
>  {
>    return util + (util >> 2);
>  }
> 
> on baseline?

This is not baseline. See above.

> 
> By patch you refer to the whole patch-set + [1]?
> 
> And I assume 'patch + 1.25 headroom' is 'response_time_ms' tuned to
> reach 1.25 ?

Yes. Which is done by multiplying the response_time_ms with 0.8.


Cheers

--
Qais Yousef
Dietmar Eggemann Dec. 18, 2023, 8:19 a.m. UTC | #2
On 08/12/2023 02:52, Qais Yousef wrote:

[...]

> ===
> 
> This patch is based on remove margins series [1] and data is collected it
> against it as a baseline.
> 
> Testing on pixel 6 with mainline(ish) kernel

How is the Pixel6 configured in terms of per-policy rate_limit_us and
response_time_ms ? Is this the now default 2ms and whatever the systems
calculates for response_time_ms ?

Pixel6 is still a slow switching device, rigth?

root           297     2 1 08:58:01 ?     00:00:13 [sugov:0]
root           298     2 0 08:58:01 ?     00:00:03 [sugov:4]
root           299     2 1 08:58:01 ?     00:00:05 [sugov:6]

> ==
> 
> Speedometer browser benchmark
> 
>        | baseline  | 1.25 headroom |   patch   | patch + 1.25 headroom
> -------+-----------+---------------+-----------+---------------------
> score  |  108.03   |     135.72    |   108.09  |    137.47
> -------+-----------+---------------+-----------+---------------------
> power  |  1204.75  |    1451.79    |  1216.17  |    1484.73
> -------+-----------+---------------+-----------+---------------------

What's the difference between baseline & 1.25 headroom. IMHO, we have:

 static inline unsigned long map_util_perf(unsigned long util)
 {
   return util + (util >> 2);
 }

on baseline?

By patch you refer to the whole patch-set + [1]?

And I assume 'patch + 1.25 headroom' is 'response_time_ms' tuned to
reach 1.25 ?

[...]