mbox series

[0/4] Fix dvfs_headroom escaping uclamp constraints

Message ID 20230820210640.585311-1-qyousef@layalina.io
Headers show
Series Fix dvfs_headroom escaping uclamp constraints | expand

Message

Qais Yousef Aug. 20, 2023, 9:06 p.m. UTC
DVFS headroom, or map_util_perf(), is applied after uclamp restrictions are
applied in effective_cpu_util(). This will lead to two problems for uclamp:

	1. If util < uclamp_min, we'll run faster than uclamp_min. For example
	   util = 50, uclamp_min = 100. map_util_perf() = 125 instead of 100.

	2. If util > uclamp_max, we'll run faster than uclamp_max. For example
	   util = 900, uclamp_max = 800, map_util_perf() = 1000.

First patch rename the function to apply_dvfs_headroom() to reflect what it
really does. It is not really mapping util, but provides some headroom for the
util to grow. Provide a documentation for the function too.

Second patch is the actual fix.

Third patch moves apply_dvfs_headroom() to sched.h as there are no longer
users outside the scheduler.

Fourth patch is an RFC to redefine what the headroom means for RT, DL and IRQ
pressures.

Thanks!

--
Qais Yousef

Qais Yousef (4):
  sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom
  sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  sched: cpufreq: Move apply_dvfs_headroom() to sched.h
  sched: cpufreq: Apply DVFS headroom to CFS only

 include/linux/energy_model.h     |  1 -
 include/linux/sched/cpufreq.h    |  5 -----
 kernel/sched/core.c              |  7 +++++--
 kernel/sched/cpufreq_schedutil.c |  5 ++---
 kernel/sched/sched.h             | 24 ++++++++++++++++++++++++
 5 files changed, 31 insertions(+), 11 deletions(-)

Comments

Rafael J. Wysocki Aug. 21, 2023, 10:34 a.m. UTC | #1
On Sun, Aug 20, 2023 at 11:08 PM Qais Yousef <qyousef@layalina.io> wrote:
>
> DVFS headroom, or map_util_perf(), is applied after uclamp restrictions are
> applied in effective_cpu_util(). This will lead to two problems for uclamp:
>
>         1. If util < uclamp_min, we'll run faster than uclamp_min. For example
>            util = 50, uclamp_min = 100. map_util_perf() = 125 instead of 100.
>
>         2. If util > uclamp_max, we'll run faster than uclamp_max. For example
>            util = 900, uclamp_max = 800, map_util_perf() = 1000.
>
> First patch rename the function to apply_dvfs_headroom() to reflect what it
> really does. It is not really mapping util, but provides some headroom for the
> util to grow. Provide a documentation for the function too.
>
> Second patch is the actual fix.
>
> Third patch moves apply_dvfs_headroom() to sched.h as there are no longer
> users outside the scheduler.
>
> Fourth patch is an RFC to redefine what the headroom means for RT, DL and IRQ
> pressures.
>
> Thanks!

For the first 3 patches in the series

Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Dietmar Eggemann Aug. 21, 2023, 4:39 p.m. UTC | #2
On 20/08/2023 23:06, Qais Yousef wrote:
> DVFS headroom is applied after we calculate the effective_cpu_util()
> which is where we honour uclamp constraints. It makes more sense to
> apply the headroom there once and let all users naturally get the right
> thing without having to sprinkle the call around in various places.

uclamp is applied in effective_cpu_util(..., FREQUENCY_UTIL, ...) which
IMHO currently has 2 power callers: (1) schedutil: sugov_get_util() and
(2) EAS: eenv_pd_max_util()

> Before this fix running
> 
> 	uclampset -M 800 cat /dev/zero > /dev/null
> 
> Will cause the test system to run at max freq of 2.8GHz. After the fix
> it runs at 2.2GHz instead which is the correct value that matches the
> capacity of 800.

IMHO, a system at util = 800 (w/o uclamp) would also run at 2.8Ghz since
we would call map_util_to_perf() on 800, no matter from where we call it.

> Note that similar problem exist for uclamp_min. If util was 50, and
> uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> constraints, we'll end up with util of 125 instead of 100. IOW, we get
> boosted twice, first time by uclamp_min, and second time by dvfs
> headroom.

I see what you want to change here but:

So far we have `util -> uclamp -> map_util_to_perf()`

which is fine when we see uclamp as an entity which constrains util, not
the util after being mapped to a capacity constraint.

[...]
Qais Yousef Aug. 26, 2023, 7:17 p.m. UTC | #3
On 08/21/23 12:34, Rafael J. Wysocki wrote:
> On Sun, Aug 20, 2023 at 11:08 PM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > DVFS headroom, or map_util_perf(), is applied after uclamp restrictions are
> > applied in effective_cpu_util(). This will lead to two problems for uclamp:
> >
> >         1. If util < uclamp_min, we'll run faster than uclamp_min. For example
> >            util = 50, uclamp_min = 100. map_util_perf() = 125 instead of 100.
> >
> >         2. If util > uclamp_max, we'll run faster than uclamp_max. For example
> >            util = 900, uclamp_max = 800, map_util_perf() = 1000.
> >
> > First patch rename the function to apply_dvfs_headroom() to reflect what it
> > really does. It is not really mapping util, but provides some headroom for the
> > util to grow. Provide a documentation for the function too.
> >
> > Second patch is the actual fix.
> >
> > Third patch moves apply_dvfs_headroom() to sched.h as there are no longer
> > users outside the scheduler.
> >
> > Fourth patch is an RFC to redefine what the headroom means for RT, DL and IRQ
> > pressures.
> >
> > Thanks!
> 
> For the first 3 patches in the series
> 
> Acked-by: Rafael J. Wysocki <rafael@kernel.org>

Thanks for having a look!


Cheers

--
Qais Yousef
Qais Yousef Aug. 26, 2023, 8:08 p.m. UTC | #4
On 08/21/23 18:39, Dietmar Eggemann wrote:
> On 20/08/2023 23:06, Qais Yousef wrote:
> > DVFS headroom is applied after we calculate the effective_cpu_util()
> > which is where we honour uclamp constraints. It makes more sense to
> > apply the headroom there once and let all users naturally get the right
> > thing without having to sprinkle the call around in various places.
> 
> uclamp is applied in effective_cpu_util(..., FREQUENCY_UTIL, ...) which
> IMHO currently has 2 power callers: (1) schedutil: sugov_get_util() and
> (2) EAS: eenv_pd_max_util()
> 
> > Before this fix running
> > 
> > 	uclampset -M 800 cat /dev/zero > /dev/null
> > 
> > Will cause the test system to run at max freq of 2.8GHz. After the fix
> > it runs at 2.2GHz instead which is the correct value that matches the
> > capacity of 800.
> 
> IMHO, a system at util = 800 (w/o uclamp) would also run at 2.8Ghz since
> we would call map_util_to_perf() on 800, no matter from where we call it.

Sorry, I would very strongly disagree here. What you're saying the effective
range of uclamp_max is 800 and anything above that will always go to max. How
can this be acceptable?

> 
> > Note that similar problem exist for uclamp_min. If util was 50, and
> > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> > constraints, we'll end up with util of 125 instead of 100. IOW, we get
> > boosted twice, first time by uclamp_min, and second time by dvfs
> > headroom.
> 
> I see what you want to change here but:
> 
> So far we have `util -> uclamp -> map_util_to_perf()`

:-O

So when I set the system uclamp_max to 800 it will still run at max; and this
is normal?!!

> 
> which is fine when we see uclamp as an entity which constrains util, not
> the util after being mapped to a capacity constraint.

-ENOPARSE.


Cheers

--
Qais Yousef
Dietmar Eggemann Sept. 12, 2023, 2:34 p.m. UTC | #5
On 26/08/2023 22:08, Qais Yousef wrote:
> On 08/21/23 18:39, Dietmar Eggemann wrote:
>> On 20/08/2023 23:06, Qais Yousef wrote:
>>> DVFS headroom is applied after we calculate the effective_cpu_util()
>>> which is where we honour uclamp constraints. It makes more sense to
>>> apply the headroom there once and let all users naturally get the right
>>> thing without having to sprinkle the call around in various places.
>>
>> uclamp is applied in effective_cpu_util(..., FREQUENCY_UTIL, ...) which
>> IMHO currently has 2 power callers: (1) schedutil: sugov_get_util() and
>> (2) EAS: eenv_pd_max_util()
>>
>>> Before this fix running
>>>
>>> 	uclampset -M 800 cat /dev/zero > /dev/null
>>>
>>> Will cause the test system to run at max freq of 2.8GHz. After the fix
>>> it runs at 2.2GHz instead which is the correct value that matches the
>>> capacity of 800.
>>
>> IMHO, a system at util = 800 (w/o uclamp) would also run at 2.8Ghz since
>> we would call map_util_to_perf() on 800, no matter from where we call it.
> 
> Sorry, I would very strongly disagree here. What you're saying the effective
> range of uclamp_max is 800 and anything above that will always go to max. How
> can this be acceptable?

No that's not what I wanted to say here.

I wanted to highlight the different treatment of `(1) a task with
(natural) util = 800` and `(2) a task with uclamp_max = 800`.

(1) util = 800

util = (1.25 * 800 * (1024 - irq) / 1024 + ...

        <-      ->
        uclamped(cfs+rt+headroom(cfs+rt))


(2) uclamp_max = 800

util = (800 * (1024 - irq) / 1024 + ...

        <->
        uclamped(cfs+rt+headroom(cfs+rt))

So for (1) the scheduler would ask for more than in (2).

That's essentially the same question which was raised here:

https://lkml.kernel.org/r/CAKfTPtDY48jpO+b-2KXawzxh-ty+FMKX6YUXioNR7kpgO=ua6Q@mail.gmail.com

>>> Note that similar problem exist for uclamp_min. If util was 50, and
>>> uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
>>> constraints, we'll end up with util of 125 instead of 100. IOW, we get
>>> boosted twice, first time by uclamp_min, and second time by dvfs
>>> headroom.
>>
>> I see what you want to change here but:
>>
>> So far we have `util -> uclamp -> map_util_to_perf()`
> 
> :-O
> 
> So when I set the system uclamp_max to 800 it will still run at max; and this
> is normal?!!

No that's an issue (A) as well. But the diff between (1) and (2) is IMHO a
new issue introduced by this patch-set.

>> which is fine when we see uclamp as an entity which constrains util, not
>> the util after being mapped to a capacity constraint.
> 
> -ENOPARSE.

What I meant is 'clamping the util' before scheduler hands over to
'cpufreq' is fine:

    util -> uclamp -> map_util_to_perf()
               
    scheduler    -->|<-- cpufreq

I do understand that you guys are already discussing a new
cpufreq_give_me_freq_for_this_utilization_ctx() between EM and CPUfreq
in the other thread of this patch to maybe sort out (A).

[...]
Qais Yousef Sept. 16, 2023, 8:30 p.m. UTC | #6
On 09/12/23 16:34, Dietmar Eggemann wrote:

> No that's not what I wanted to say here.
> 
> I wanted to highlight the different treatment of `(1) a task with
> (natural) util = 800` and `(2) a task with uclamp_max = 800`.
> 
> (1) util = 800
> 
> util = (1.25 * 800 * (1024 - irq) / 1024 + ...
> 
>         <-      ->
>         uclamped(cfs+rt+headroom(cfs+rt))
> 
> 
> (2) uclamp_max = 800
> 
> util = (800 * (1024 - irq) / 1024 + ...
> 
>         <->
>         uclamped(cfs+rt+headroom(cfs+rt))
> 
> So for (1) the scheduler would ask for more than in (2).

Yes, which is the intention IMHO. If they behave the same, then uclamp_max is
not doing anything. And really this is the critical region in the system as
this is where the expensive OPPs lie.

> 
> That's essentially the same question which was raised here:
> 
> https://lkml.kernel.org/r/CAKfTPtDY48jpO+b-2KXawzxh-ty+FMKX6YUXioNR7kpgO=ua6Q@mail.gmail.com
> 
> >>> Note that similar problem exist for uclamp_min. If util was 50, and
> >>> uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> >>> constraints, we'll end up with util of 125 instead of 100. IOW, we get
> >>> boosted twice, first time by uclamp_min, and second time by dvfs
> >>> headroom.
> >>
> >> I see what you want to change here but:
> >>
> >> So far we have `util -> uclamp -> map_util_to_perf()`
> > 
> > :-O
> > 
> > So when I set the system uclamp_max to 800 it will still run at max; and this
> > is normal?!!
> 
> No that's an issue (A) as well. But the diff between (1) and (2) is IMHO a
> new issue introduced by this patch-set.

It is also the problem to fix :-)

> 
> >> which is fine when we see uclamp as an entity which constrains util, not
> >> the util after being mapped to a capacity constraint.
> > 
> > -ENOPARSE.
> 
> What I meant is 'clamping the util' before scheduler hands over to
> 'cpufreq' is fine:
> 
>     util -> uclamp -> map_util_to_perf()
>                
>     scheduler    -->|<-- cpufreq
> 
> I do understand that you guys are already discussing a new
> cpufreq_give_me_freq_for_this_utilization_ctx() between EM and CPUfreq
> in the other thread of this patch to maybe sort out (A).

This will help with abstraction. But behavior wise, I still think that if
a task (or a cgroup, the system) has uclamp_max to 800, then we should not run
faster than this because of dvfs headroom.

If we don't do this then max performance is effectively mapped at anything at
800 and above. And likely less than that.

For example if

	util@OPP[Fmax-1] = 950

then any util value at

	util = 950 * 0.8 = 760

will saturate the system and run at Fmax, because all values from above 950
will map to Fmax, and because of the headroom this becomes all values at 760 or
above.

This mean from user's perspective uclamp_min and uclamp_max is from 0-760 on
that system. But the abstraction we're providing is that the performance range
is from 0-1024.

If uclamp_min is set to 512 and the task has a util of 100, why should we run
at 512*1.25?

Similarly if uclamp_max wants to cap the task/cgroup/system from consuming the
top resources, why it needs the 1.25 headroom? If there are two tasks running
on the CPU and one of them is not capped, then yeah the headroom will be
applied. But if the rq->uclamp[UCLAMP_MAX] = 800, then the vote from freq from
this CPU should not exceed this. I don't see what the headroom is for in this
case.

As another example, do you expect a task that has a util of 1024 but uclamp_max
= 800 to run at max or at an OPP equivalent to 800 which would be 2 or 3 below
Fmax usually? The latter is the expectation from the interface. Users shouldn't
need to know there's a dvfs headroom and cater for that when they set
uclamp_max or uclamp_min.


Thanks!

--
Qais Yousef