mbox series

[RFC,v3,0/5] Add capacity capping support to the CPU controller

Message ID 1488292722-19410-1-git-send-email-patrick.bellasi@arm.com
Headers show
Series Add capacity capping support to the CPU controller | expand

Message

Patrick Bellasi Feb. 28, 2017, 2:38 p.m. UTC
Was: SchedTune: central, scheduler-driven, power-perfomance control

This series presents a possible alternative design for what has been presented
in the past as SchedTune. This redesign has been defined to address the main
concerns and comments collected in the LKML discussion [1] as well at the last
LPC [2].
The aim of this posting is to present a working prototype which implements
what has been discussed [2] with people like PeterZ, PaulT and TejunH.

The main differences with respect to the previous proposal [1] are:
 1. Task boosting/capping is now implemented as an extension on top of
    the existing CGroup CPU controller.
 2. The previous boosting strategy, based on the inflation of the CPU's
    utilization, has been now replaced by a more simple yet effective set
    of capacity constraints.

The proposed approach allows to constrain the minimum and maximum capacity
of a CPU depending on the set of tasks currently RUNNABLE on that CPU.
The set of active constraints are tracked by the core scheduler, thus they
apply across all the scheduling classes. The value of the constraints are
used to clamp the CPU utilization when the schedutil CPUFreq's governor
selects a frequency for that CPU.

This means that the new proposed approach allows to extend the concept of
tasks classification to frequencies selection, thus allowing informed
run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different
optimization policies such as:
 a) Boosting of important tasks, by enforcing a minimum capacity in the
    CPUs where they are enqueued for execution.
 b) Capping of background tasks, by enforcing a maximum capacity.
 c) Containment of OPPs for RT tasks which cannot easily be switched to
    the usage of the DL class, but still don't need to run at the maximum
    frequency.

The new approach has also been designed to be compliant to CGroups v2
principles, such as the support for single hierarchy and the "limit"
resource distribution model (described in Documentation/cgroup-v2.txt).

A further development of this idea is under development and will allow to
exploit the same capacity capping attributes, in conjunction to the recently
merged capacity awareness bits [3], in order to achieve a more complete tasks
boosting/capping strategy which is completely scheduler driven and based on
user-space defined tasks classification.

The first three patches of this series introduces capacity_{min,max} tracking
in the core scheduler, as an extension of the CPU controller.
The fourth patch integrates capacity capping with schedutil for FAIR tasks,
while the last patch does the same for RT/DL tasks.

This series is based on top of today's tip/sched/core and is public available
from this repository:

  git://www.linux-arm.com/linux-pb eas/stune/rfcv3

Cheers Patrick

.:: References
[1] https://lkml.org/lkml/2016/10/27/503
[2] https://lkml.org/lkml/2016/11/25/342
[3] https://lkml.org/lkml/2016/10/14/312

Patrick Bellasi (5):
  sched/core: add capacity constraints to CPU controller
  sched/core: track CPU's capacity_{min,max}
  sched/core: sync capacity_{min,max} between slow and fast paths
  sched/{core,cpufreq_schedutil}: add capacity clamping for FAIR tasks
  sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks

 include/linux/sched.h            |   3 +
 init/Kconfig                     |  17 ++
 kernel/sched/core.c              | 352 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpufreq_schedutil.c |  83 ++++++++-
 kernel/sched/sched.h             |  31 ++++
 5 files changed, 478 insertions(+), 8 deletions(-)

-- 
2.7.4

Comments

Patrick Bellasi March 15, 2017, 11:40 a.m. UTC | #1
On 13-Mar 03:08, Joel Fernandes (Google) wrote:
> Hi Patrick,

> 

> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi

> <patrick.bellasi@arm.com> wrote:

> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.

> > Such a mandatory policy can be made more tunable from userspace thus

> > allowing for example to define a reasonable max capacity (i.e.

> > frequency) which is required for the execution of a specific RT/DL

> > workload. This will contribute to make the RT class more "friendly" for

> > power/energy sensible applications.

> >

> > This patch extends the usage of capacity_{min,max} to the RT/DL classes.

> > Whenever a task in these classes is RUNNABLE, the capacity required is

> > defined by the constraints of the control group that task belongs to.

> >

> 

> We briefly discussed this at Linaro Connect that this works well for

> sporadic RT tasks that run briefly and then sleep for long periods of

> time - so certainly this patch is good, but its only a partial

> solution to the problem of frequent and short-sleepers and something

> is required to keep the boost active for short non-RUNNABLE as well.

> The behavior with many periodic RT tasks is that they will sleep for

> short intervals and run for short intervals periodically. In this case

> removing the clamp (or the boost as in schedtune v2) on a dequeue will

> essentially mean during a narrow window cpufreq can drop the frequency

> and only to make it go back up again.

> 

> Currently for schedtune v2, I am working on prototyping something like

> the following for Android:

> - if RT task is enqueue, introduce the boost.

> - When task is dequeued, start a timer for a  "minimum deboost delay

> time" before taking out the boost.

> - If task is enqueued again before the timer fires, then cancel the timer.

> 

> I don't think any "fix" to this particular issue should be to the

> schedutil governor and should be sorted before going to cpufreq itself

> (that is before making the request). What do you think about this?


My short observations are:

1) for certain RT tasks, which have a quite "predictable" activation
   pattern, we should definitively try to use DEADLINE... which will
   factor out all "boosting potential races" since the bandwidth
   requirements are well defined at task description time.

2) CPU boosting is, at least for the time being, a best-effort feature
   which is introduced mainly for FAIR tasks.

3) Tracking the boost at enqueue/dequeue time matches with the design
   to track features/properties of the currently RUNNABLE tasks, while
   avoiding to add yet another signal to track CPUs utilization.

4) Previous point is about "separation of concerns", thus IMHO any
   policy defining how to consume the CPU utilization signal
   (whether it is boosted or not) should be responsibility of
   schedutil, which eventually does not exclude useful input from the
   scheduler.

5) I understand the usefulness of a scale down threshold for schedutil
   to reduce the current OPP, while I don't get the point for a scale
   up threshold. If the system is demanding more capacity and there
   are not HW constrains (e.g. pending changes) then we should go up
   as soon as possible.

Finally, I think we can improve quite a lot the boosting issues you
are having with RT tasks by better refining the schedutil thresholds
implementation.

We already have some patches pending for review:
   https://lkml.org/lkml/2017/3/2/385
which fixes some schedutil issue and we will follow up with others
trying to improve the rate-limiting to not compromise responsiveness.


> Thanks,

> Joel


Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi
Rafael J. Wysocki March 15, 2017, 11:41 a.m. UTC | #2
On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote:
> Was: SchedTune: central, scheduler-driven, power-perfomance control

> 

> This series presents a possible alternative design for what has been presented

> in the past as SchedTune. This redesign has been defined to address the main

> concerns and comments collected in the LKML discussion [1] as well at the last

> LPC [2].

> The aim of this posting is to present a working prototype which implements

> what has been discussed [2] with people like PeterZ, PaulT and TejunH.

> 

> The main differences with respect to the previous proposal [1] are:

>  1. Task boosting/capping is now implemented as an extension on top of

>     the existing CGroup CPU controller.

>  2. The previous boosting strategy, based on the inflation of the CPU's

>     utilization, has been now replaced by a more simple yet effective set

>     of capacity constraints.

> 

> The proposed approach allows to constrain the minimum and maximum capacity

> of a CPU depending on the set of tasks currently RUNNABLE on that CPU.

> The set of active constraints are tracked by the core scheduler, thus they

> apply across all the scheduling classes. The value of the constraints are

> used to clamp the CPU utilization when the schedutil CPUFreq's governor

> selects a frequency for that CPU.

> 

> This means that the new proposed approach allows to extend the concept of

> tasks classification to frequencies selection, thus allowing informed

> run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different

> optimization policies such as:

>  a) Boosting of important tasks, by enforcing a minimum capacity in the

>     CPUs where they are enqueued for execution.

>  b) Capping of background tasks, by enforcing a maximum capacity.

>  c) Containment of OPPs for RT tasks which cannot easily be switched to

>     the usage of the DL class, but still don't need to run at the maximum

>     frequency.


Do you have any practical examples of that, like for example what exactly
Android is going to use this for?

I gather that there is some experience with the current EAS implementation
there, so I wonder how this work is related to that.

Thanks,
Rafael
Joel Fernandes March 15, 2017, 12:59 p.m. UTC | #3
On Wed, Mar 15, 2017 at 4:40 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> On 13-Mar 03:08, Joel Fernandes (Google) wrote:

>> Hi Patrick,

>>

>> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi

>> <patrick.bellasi@arm.com> wrote:

>> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.

>> > Such a mandatory policy can be made more tunable from userspace thus

>> > allowing for example to define a reasonable max capacity (i.e.

>> > frequency) which is required for the execution of a specific RT/DL

>> > workload. This will contribute to make the RT class more "friendly" for

>> > power/energy sensible applications.

>> >

>> > This patch extends the usage of capacity_{min,max} to the RT/DL classes.

>> > Whenever a task in these classes is RUNNABLE, the capacity required is

>> > defined by the constraints of the control group that task belongs to.

>> >

>>

>> We briefly discussed this at Linaro Connect that this works well for

>> sporadic RT tasks that run briefly and then sleep for long periods of

>> time - so certainly this patch is good, but its only a partial

>> solution to the problem of frequent and short-sleepers and something

>> is required to keep the boost active for short non-RUNNABLE as well.

>> The behavior with many periodic RT tasks is that they will sleep for

>> short intervals and run for short intervals periodically. In this case

>> removing the clamp (or the boost as in schedtune v2) on a dequeue will

>> essentially mean during a narrow window cpufreq can drop the frequency

>> and only to make it go back up again.

>>

>> Currently for schedtune v2, I am working on prototyping something like

>> the following for Android:

>> - if RT task is enqueue, introduce the boost.

>> - When task is dequeued, start a timer for a  "minimum deboost delay

>> time" before taking out the boost.

>> - If task is enqueued again before the timer fires, then cancel the timer.

>>

>> I don't think any "fix" to this particular issue should be to the

>> schedutil governor and should be sorted before going to cpufreq itself

>> (that is before making the request). What do you think about this?

>

> My short observations are:

>

> 1) for certain RT tasks, which have a quite "predictable" activation

>    pattern, we should definitively try to use DEADLINE... which will

>    factor out all "boosting potential races" since the bandwidth

>    requirements are well defined at task description time.


I don't immediately see how deadline can fix this, when a task is
dequeued after end of its current runtime, its bandwidth will be
subtracted from the active running bandwidth. This is what drives the
DL part of the capacity request. In this case, we run into the same
issue as with the boost-removal on dequeue. Isn't it?

> 4) Previous point is about "separation of concerns", thus IMHO any

>    policy defining how to consume the CPU utilization signal

>    (whether it is boosted or not) should be responsibility of

>    schedutil, which eventually does not exclude useful input from the

>    scheduler.

>

> 5) I understand the usefulness of a scale down threshold for schedutil

>    to reduce the current OPP, while I don't get the point for a scale

>    up threshold. If the system is demanding more capacity and there

>    are not HW constrains (e.g. pending changes) then we should go up

>    as soon as possible.

>

> Finally, I think we can improve quite a lot the boosting issues you

> are having with RT tasks by better refining the schedutil thresholds

> implementation.

>

> We already have some patches pending for review:

>    https://lkml.org/lkml/2017/3/2/385

> which fixes some schedutil issue and we will follow up with others

> trying to improve the rate-limiting to not compromise responsiveness.


I agree we can try to explore fixing schedutil to do the right thing.

J.
Patrick Bellasi March 15, 2017, 12:59 p.m. UTC | #4
On 15-Mar 12:41, Rafael J. Wysocki wrote:
> On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote:

> > Was: SchedTune: central, scheduler-driven, power-perfomance control

> > 

> > This series presents a possible alternative design for what has been presented

> > in the past as SchedTune. This redesign has been defined to address the main

> > concerns and comments collected in the LKML discussion [1] as well at the last

> > LPC [2].

> > The aim of this posting is to present a working prototype which implements

> > what has been discussed [2] with people like PeterZ, PaulT and TejunH.

> > 

> > The main differences with respect to the previous proposal [1] are:

> >  1. Task boosting/capping is now implemented as an extension on top of

> >     the existing CGroup CPU controller.

> >  2. The previous boosting strategy, based on the inflation of the CPU's

> >     utilization, has been now replaced by a more simple yet effective set

> >     of capacity constraints.

> > 

> > The proposed approach allows to constrain the minimum and maximum capacity

> > of a CPU depending on the set of tasks currently RUNNABLE on that CPU.

> > The set of active constraints are tracked by the core scheduler, thus they

> > apply across all the scheduling classes. The value of the constraints are

> > used to clamp the CPU utilization when the schedutil CPUFreq's governor

> > selects a frequency for that CPU.

> > 

> > This means that the new proposed approach allows to extend the concept of

> > tasks classification to frequencies selection, thus allowing informed

> > run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different

> > optimization policies such as:

> >  a) Boosting of important tasks, by enforcing a minimum capacity in the

> >     CPUs where they are enqueued for execution.

> >  b) Capping of background tasks, by enforcing a maximum capacity.

> >  c) Containment of OPPs for RT tasks which cannot easily be switched to

> >     the usage of the DL class, but still don't need to run at the maximum

> >     frequency.

> 

> Do you have any practical examples of that, like for example what exactly

> Android is going to use this for?


In general, every "informed run-time" usually know quite a lot about
tasks requirements and how they impact the user experience.

In Android for example tasks are classified depending on their _current_
role. We can distinguish for example between:

- TOP_APP:    which are tasks currently affecting the UI, i.e. part of
              the app currently in foreground
- BACKGROUND: which are tasks not directly impacting the user
              experience

Given these information it could make sense to adopt different
service/optimization policy for different tasks.
For example, we can be interested in
giving maximum responsiveness to TOP_APP tasks while we still want to
be able to save as much energy as possible for the BACKGROUND tasks.

That's where the proposal in this series (partially) comes on hand.

What we propose is a "standard" interface to collect sensible
information from "informed run-times" which can be used to:

a) classify tasks according to the main optimization goals:
   performance boosting vs energy saving

b) support a more dynamic tuning of kernel side behaviors, mainly
   OPPs selection and tasks placement

Regarding this last point, this series specifically represents a
proposal for the integration with schedutil. The main usages we are
looking for in Android are:

a) Boosting the OPP selected for certain critical tasks, with the goal
   to speed-up their completion regardless of (potential) energy impacts.
   A kind-of "race-to-idle" policy for certain tasks.

b) Capping the OPP selection for certain non critical tasks, which is
   a major concerns especially for RT tasks in mobile context, but
   it also apply to FAIR tasks representing background activities.

> I gather that there is some experience with the current EAS implementation

> there, so I wonder how this work is related to that.


You right. We started developing a task boosting strategy a couple of
years ago. The first implementation we did is what is currently in use
by the EAS version in used on Pixel smartphones.

Since the beginning our attitude has always been "mainline first".
However, we found it extremely valuable to proof both interface's
design and feature's benefits on real devices. That's why we keep
backporting these bits on different Android kernels.

Google, which primary representatives are in CC, is also quite focused
on using mainline solutions for their current and future solutions.
That's why, after the release of the Pixel devices end of last year,
we refreshed and posted the proposal on LKML [1] and collected a first
run of valuable feedbacks at LCP [2].

This posting is an expression of the feedbacks collected so far and
the main goal for us are:
1) validate once more the soundness of a scheduler-driven run-time
   power-performance control which is based on information collected
   from informed run-time
2) get an agreement on whether the current interface can be considered
   sufficiently "mainline friendly" to have a chance to get merged
3) rework/refactor what is required if point 2 is not (yet) satisfied

It's worth to notice that these bits are completely independent from
EAS. OPP biasing (i.e. capping/boosting) is a feature which stand by
itself and it can be quite useful in many different scenarios where
EAS is not used at all. A simple example is making schedutil to behave
concurrently like the powersave governor for certain tasks and the
performance governor for other tasks.

As a final remark, this series is going to be a discussion topic in
the upcoming OSPM summit [3]. It would be nice if we can get there
with a sufficient knowledge of the main goals and the current status.
However, please let's keep discussing here about all the possible
concerns which can be raised about this proposal.

> Thanks,

> Rafael


Cheers Patrick

[1] https://lkml.org/lkml/2016/10/27/503
[2] https://lkml.org/lkml/2016/11/25/342
[3] http://retis.sssup.it/ospm-summit/

-- 
#include <best/regards.h>

Patrick Bellasi
Juri Lelli March 15, 2017, 2:44 p.m. UTC | #5
Hi Joel,

On 15/03/17 05:59, Joel Fernandes wrote:
> On Wed, Mar 15, 2017 at 4:40 AM, Patrick Bellasi

> <patrick.bellasi@arm.com> wrote:

> > On 13-Mar 03:08, Joel Fernandes (Google) wrote:

> >> Hi Patrick,

> >>

> >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi

> >> <patrick.bellasi@arm.com> wrote:

> >> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.

> >> > Such a mandatory policy can be made more tunable from userspace thus

> >> > allowing for example to define a reasonable max capacity (i.e.

> >> > frequency) which is required for the execution of a specific RT/DL

> >> > workload. This will contribute to make the RT class more "friendly" for

> >> > power/energy sensible applications.

> >> >

> >> > This patch extends the usage of capacity_{min,max} to the RT/DL classes.

> >> > Whenever a task in these classes is RUNNABLE, the capacity required is

> >> > defined by the constraints of the control group that task belongs to.

> >> >

> >>

> >> We briefly discussed this at Linaro Connect that this works well for

> >> sporadic RT tasks that run briefly and then sleep for long periods of

> >> time - so certainly this patch is good, but its only a partial

> >> solution to the problem of frequent and short-sleepers and something

> >> is required to keep the boost active for short non-RUNNABLE as well.

> >> The behavior with many periodic RT tasks is that they will sleep for

> >> short intervals and run for short intervals periodically. In this case

> >> removing the clamp (or the boost as in schedtune v2) on a dequeue will

> >> essentially mean during a narrow window cpufreq can drop the frequency

> >> and only to make it go back up again.

> >>

> >> Currently for schedtune v2, I am working on prototyping something like

> >> the following for Android:

> >> - if RT task is enqueue, introduce the boost.

> >> - When task is dequeued, start a timer for a  "minimum deboost delay

> >> time" before taking out the boost.

> >> - If task is enqueued again before the timer fires, then cancel the timer.

> >>

> >> I don't think any "fix" to this particular issue should be to the

> >> schedutil governor and should be sorted before going to cpufreq itself

> >> (that is before making the request). What do you think about this?

> >

> > My short observations are:

> >

> > 1) for certain RT tasks, which have a quite "predictable" activation

> >    pattern, we should definitively try to use DEADLINE... which will

> >    factor out all "boosting potential races" since the bandwidth

> >    requirements are well defined at task description time.

> 

> I don't immediately see how deadline can fix this, when a task is

> dequeued after end of its current runtime, its bandwidth will be

> subtracted from the active running bandwidth. This is what drives the

> DL part of the capacity request. In this case, we run into the same

> issue as with the boost-removal on dequeue. Isn't it?

> 


Unfortunately, I still have to post the set of patches (based on Luca's
reclaiming set) that introduces driving of clock frequency from
DEADLINE, so I guess everything we can discuss about how DEADLINE might
help here might be difficult to understand. :(

I should definitely fix that.

However, trying to quickly summarize how that would work (for who is
already somewhat familiar with reclaiming bits):

 - a task utilization contribution is accounted for (at rq level) as
   soon as it wakes up for the first time in a new period
 - its contribution is then removed after the 0lag time (or when the
   task gets throttled)
 - frequency transitions are triggered accordingly

So, I don't see why triggering a go down request after the 0lag time
expired and quickly reacting to tasks waking up would have create
problems in your case?

Thanks,

- Juri
Joel Fernandes March 15, 2017, 4:13 p.m. UTC | #6
On Wed, Mar 15, 2017 at 7:44 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> Hi Joel,

>

> On 15/03/17 05:59, Joel Fernandes wrote:

>> On Wed, Mar 15, 2017 at 4:40 AM, Patrick Bellasi

>> <patrick.bellasi@arm.com> wrote:

>> > On 13-Mar 03:08, Joel Fernandes (Google) wrote:

>> >> Hi Patrick,

>> >>

>> >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi

>> >> <patrick.bellasi@arm.com> wrote:

>> >> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.

>> >> > Such a mandatory policy can be made more tunable from userspace thus

>> >> > allowing for example to define a reasonable max capacity (i.e.

>> >> > frequency) which is required for the execution of a specific RT/DL

>> >> > workload. This will contribute to make the RT class more "friendly" for

>> >> > power/energy sensible applications.

>> >> >

>> >> > This patch extends the usage of capacity_{min,max} to the RT/DL classes.

>> >> > Whenever a task in these classes is RUNNABLE, the capacity required is

>> >> > defined by the constraints of the control group that task belongs to.

>> >> >

>> >>

>> >> We briefly discussed this at Linaro Connect that this works well for

>> >> sporadic RT tasks that run briefly and then sleep for long periods of

>> >> time - so certainly this patch is good, but its only a partial

>> >> solution to the problem of frequent and short-sleepers and something

>> >> is required to keep the boost active for short non-RUNNABLE as well.

>> >> The behavior with many periodic RT tasks is that they will sleep for

>> >> short intervals and run for short intervals periodically. In this case

>> >> removing the clamp (or the boost as in schedtune v2) on a dequeue will

>> >> essentially mean during a narrow window cpufreq can drop the frequency

>> >> and only to make it go back up again.

>> >>

>> >> Currently for schedtune v2, I am working on prototyping something like

>> >> the following for Android:

>> >> - if RT task is enqueue, introduce the boost.

>> >> - When task is dequeued, start a timer for a  "minimum deboost delay

>> >> time" before taking out the boost.

>> >> - If task is enqueued again before the timer fires, then cancel the timer.

>> >>

>> >> I don't think any "fix" to this particular issue should be to the

>> >> schedutil governor and should be sorted before going to cpufreq itself

>> >> (that is before making the request). What do you think about this?

>> >

>> > My short observations are:

>> >

>> > 1) for certain RT tasks, which have a quite "predictable" activation

>> >    pattern, we should definitively try to use DEADLINE... which will

>> >    factor out all "boosting potential races" since the bandwidth

>> >    requirements are well defined at task description time.

>>

>> I don't immediately see how deadline can fix this, when a task is

>> dequeued after end of its current runtime, its bandwidth will be

>> subtracted from the active running bandwidth. This is what drives the

>> DL part of the capacity request. In this case, we run into the same

>> issue as with the boost-removal on dequeue. Isn't it?

>>

>

> Unfortunately, I still have to post the set of patches (based on Luca's

> reclaiming set) that introduces driving of clock frequency from

> DEADLINE, so I guess everything we can discuss about how DEADLINE might

> help here might be difficult to understand. :(

>

> I should definitely fix that.


I fully understand, Sorry to be discussing this too soon here...

> However, trying to quickly summarize how that would work (for who is

> already somewhat familiar with reclaiming bits):

>

>  - a task utilization contribution is accounted for (at rq level) as

>    soon as it wakes up for the first time in a new period

>  - its contribution is then removed after the 0lag time (or when the

>    task gets throttled)

>  - frequency transitions are triggered accordingly

>

> So, I don't see why triggering a go down request after the 0lag time

> expired and quickly reacting to tasks waking up would have create

> problems in your case?


In my experience, the 'reacting to tasks' bit doesn't work very well.
For short running period tasks, we need to set the frequency to
something and not ramp it down too quickly (for ex, runtime 1.5ms and
period 3ms). In this case the 0-lag time would be < 3ms. I guess if
we're going to use 0-lag time, then we'd need to set it runtime and
period to be higher than exactly matching the task's? So would we be
assigning the same bandwidth but for R/T instead of r/t (Where r, R
are the runtimes and t,T are periods, and R > r and T > t)?

Thanks,
Joel
Joel Fernandes March 15, 2017, 11:40 p.m. UTC | #7
On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote:
[..]
>

>> > However, trying to quickly summarize how that would work (for who is

>> > already somewhat familiar with reclaiming bits):

>> >

>> >  - a task utilization contribution is accounted for (at rq level) as

>> >    soon as it wakes up for the first time in a new period

>> >  - its contribution is then removed after the 0lag time (or when the

>> >    task gets throttled)

>> >  - frequency transitions are triggered accordingly

>> >

>> > So, I don't see why triggering a go down request after the 0lag time

>> > expired and quickly reacting to tasks waking up would have create

>> > problems in your case?

>>

>> In my experience, the 'reacting to tasks' bit doesn't work very well.

>

> Humm.. but in this case we won't be 'reacting', we will be

> 'anticipating' tasks' needs, right?


Are you saying we will start ramping frequency before the next
activation so that we're ready for it?

If not, it sounds like it will only make the frequency request on the
next activation when the Active bandwidth increases due to the task
waking up. By then task has already started to run, right?

Thanks,
Joel
Rafael J. Wysocki March 16, 2017, 1:04 a.m. UTC | #8
On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> On 15-Mar 12:41, Rafael J. Wysocki wrote:

>> On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote:

>> > Was: SchedTune: central, scheduler-driven, power-perfomance control

>> >

>> > This series presents a possible alternative design for what has been presented

>> > in the past as SchedTune. This redesign has been defined to address the main

>> > concerns and comments collected in the LKML discussion [1] as well at the last

>> > LPC [2].

>> > The aim of this posting is to present a working prototype which implements

>> > what has been discussed [2] with people like PeterZ, PaulT and TejunH.

>> >

>> > The main differences with respect to the previous proposal [1] are:

>> >  1. Task boosting/capping is now implemented as an extension on top of

>> >     the existing CGroup CPU controller.

>> >  2. The previous boosting strategy, based on the inflation of the CPU's

>> >     utilization, has been now replaced by a more simple yet effective set

>> >     of capacity constraints.

>> >

>> > The proposed approach allows to constrain the minimum and maximum capacity

>> > of a CPU depending on the set of tasks currently RUNNABLE on that CPU.

>> > The set of active constraints are tracked by the core scheduler, thus they

>> > apply across all the scheduling classes. The value of the constraints are

>> > used to clamp the CPU utilization when the schedutil CPUFreq's governor

>> > selects a frequency for that CPU.

>> >

>> > This means that the new proposed approach allows to extend the concept of

>> > tasks classification to frequencies selection, thus allowing informed

>> > run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different

>> > optimization policies such as:

>> >  a) Boosting of important tasks, by enforcing a minimum capacity in the

>> >     CPUs where they are enqueued for execution.

>> >  b) Capping of background tasks, by enforcing a maximum capacity.

>> >  c) Containment of OPPs for RT tasks which cannot easily be switched to

>> >     the usage of the DL class, but still don't need to run at the maximum

>> >     frequency.

>>

>> Do you have any practical examples of that, like for example what exactly

>> Android is going to use this for?

>

> In general, every "informed run-time" usually know quite a lot about

> tasks requirements and how they impact the user experience.

>

> In Android for example tasks are classified depending on their _current_

> role. We can distinguish for example between:

>

> - TOP_APP:    which are tasks currently affecting the UI, i.e. part of

>               the app currently in foreground

> - BACKGROUND: which are tasks not directly impacting the user

>               experience

>

> Given these information it could make sense to adopt different

> service/optimization policy for different tasks.

> For example, we can be interested in

> giving maximum responsiveness to TOP_APP tasks while we still want to

> be able to save as much energy as possible for the BACKGROUND tasks.

>

> That's where the proposal in this series (partially) comes on hand.


A question: Does "responsiveness" translate directly to "capacity" somehow?

Moreover, how exactly is "responsiveness" defined?

> What we propose is a "standard" interface to collect sensible

> information from "informed run-times" which can be used to:

>

> a) classify tasks according to the main optimization goals:

>    performance boosting vs energy saving

>

> b) support a more dynamic tuning of kernel side behaviors, mainly

>    OPPs selection and tasks placement

>

> Regarding this last point, this series specifically represents a

> proposal for the integration with schedutil. The main usages we are

> looking for in Android are:

>

> a) Boosting the OPP selected for certain critical tasks, with the goal

>    to speed-up their completion regardless of (potential) energy impacts.

>    A kind-of "race-to-idle" policy for certain tasks.


It looks like this could be addressed by adding a "this task should
race to idle" flag too.

> b) Capping the OPP selection for certain non critical tasks, which is

>    a major concerns especially for RT tasks in mobile context, but

>    it also apply to FAIR tasks representing background activities.


Well, is the information on how much CPU capacity assign to those
tasks really there in user space?  What's the source of it if so?

>> I gather that there is some experience with the current EAS implementation

>> there, so I wonder how this work is related to that.

>

> You right. We started developing a task boosting strategy a couple of

> years ago. The first implementation we did is what is currently in use

> by the EAS version in used on Pixel smartphones.

>

> Since the beginning our attitude has always been "mainline first".

> However, we found it extremely valuable to proof both interface's

> design and feature's benefits on real devices. That's why we keep

> backporting these bits on different Android kernels.

>

> Google, which primary representatives are in CC, is also quite focused

> on using mainline solutions for their current and future solutions.

> That's why, after the release of the Pixel devices end of last year,

> we refreshed and posted the proposal on LKML [1] and collected a first

> run of valuable feedbacks at LCP [2].


Thanks for the info, but my question was more about how it was related
from the technical angle.  IOW, there surely is some experience
related to how user space can deal with energy problems and I would
expect that experience to be an important factor in designing a kernel
interface for that user space, so I wonder if any particular needs of
the Android user space are addressed here.

I'm not intimately familiar with Android, so I guess I would like to
be educated somewhat on that. :-)

> This posting is an expression of the feedbacks collected so far and

> the main goal for us are:

> 1) validate once more the soundness of a scheduler-driven run-time

>    power-performance control which is based on information collected

>    from informed run-time

> 2) get an agreement on whether the current interface can be considered

>    sufficiently "mainline friendly" to have a chance to get merged

> 3) rework/refactor what is required if point 2 is not (yet) satisfied


My definition of "mainline friendly" may be different from a someone
else's one, but I usually want to know two things:
 1. What problem exactly is at hand.
 2. What alternative ways of addressing it have been considered and
why the particular one proposed has been chosen over the other ones.

At the moment I don't feel like I have enough information in both aspects.

For example, if you said "Android wants to do XYZ because of ABC and
that's how we want to make that possible, and it also could be done in
the other GHJ ways, but they are not attractive and here's why etc"
that would help quite a bit from my POV.

> It's worth to notice that these bits are completely independent from

> EAS. OPP biasing (i.e. capping/boosting) is a feature which stand by

> itself and it can be quite useful in many different scenarios where

> EAS is not used at all. A simple example is making schedutil to behave

> concurrently like the powersave governor for certain tasks and the

> performance governor for other tasks.


That's fine in theory, but honestly an interface like this will be a
maintenance burden and adding it just because it may be useful to
somebody sounds not serious enough.

IOW, I'd like to be able to say "This is going to be used by user
space X to do A and that's how etc" is somebody asks me about that
which honestly I can't at this point.

>

> As a final remark, this series is going to be a discussion topic in

> the upcoming OSPM summit [3]. It would be nice if we can get there

> with a sufficient knowledge of the main goals and the current status.


I'm not sure what you mean here, sorry.

> However, please let's keep discussing here about all the possible

> concerns which can be raised about this proposal.


OK

Thanks,
Rafael
Joel Fernandes March 16, 2017, 3:15 a.m. UTC | #9
Hi Rafael,

On Wed, Mar 15, 2017 at 6:04 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi

>>> Do you have any practical examples of that, like for example what exactly

>>> Android is going to use this for?

>>

>> In general, every "informed run-time" usually know quite a lot about

>> tasks requirements and how they impact the user experience.

>>

>> In Android for example tasks are classified depending on their _current_

>> role. We can distinguish for example between:

>>

>> - TOP_APP:    which are tasks currently affecting the UI, i.e. part of

>>               the app currently in foreground

>> - BACKGROUND: which are tasks not directly impacting the user

>>               experience

>>

>> Given these information it could make sense to adopt different

>> service/optimization policy for different tasks.

>> For example, we can be interested in

>> giving maximum responsiveness to TOP_APP tasks while we still want to

>> be able to save as much energy as possible for the BACKGROUND tasks.

>>

>> That's where the proposal in this series (partially) comes on hand.

>

> A question: Does "responsiveness" translate directly to "capacity" somehow?

>

> Moreover, how exactly is "responsiveness" defined?


Responsiveness is basically how quickly the UI is responding to user
interaction after doing its computation, application-logic and
rendering. Android apps have 2 important threads, the main thread (or
UI thread) which does all the work and computation for the app, and a
Render thread which does the rendering and submission of frames to
display pipeline for further composition and display.

We wish to bias towards performance than energy for this work since
this front facing to the user and we don't care about much about
energy for these tasks at this point, what's most critical is
completion as quickly as possible so the user experience doesn't
suffer from a performance issue that is noticeable.

One metric to define this is "Jank" where we drop frames and aren't
able to render on time. One of the reasons this can happen because the
main thread (UI thread) took longer than expected for some
computation. Whatever the interface - we'd just like to bias the
scheduling and frequency guidance to be more concerned with
performance and less with energy. And use this information for both
frequency selection and task placement. 'What we need' is also app
dependent since every app has its own main thread and is free to
compute whatever it needs. So Android can't estimate this - but we do
know that this app is user facing so in broad terms the interface is
used to say please don't sacrifice performance for these top-apps -
without accurately defining what these performance needs really are
because we don't know it.
For YouTube app for example, the complexity of the video decoding and
the frame rate are very variable depending on the encoding scheme and
the video being played. The flushing of the frames through the display
pipeline is also variable (frame rate depends on the video being
decoded), so this work is variable and we can't say for sure in
definitive terms how much capacity we need.

What we can do is with Patrick's work, we can take the worst case
based on measurements and specify say we need atleast this much
capacity regardless of what load-tracking thinks we need and then we
can scale frequency accordingly. This is the usecase for the minimum
capacity in his clamping patch. This is still not perfect in terms of
defining something accurately because - we don't even know how much we
need, but atleast in broad terms we have some way of telling the
governor to maintain atleast X capacity.

For the clamping of maximum capacity, there are usecases like
background tasks like Patrick said, but also usecases where we don't
want to run at max frequency even though load-tracking thinks that we
need to. For example, there are case where for foreground camera
tasks, where we want to provide sustainable performance without
entering thermal throttling, so the capping will help there.

>> What we propose is a "standard" interface to collect sensible

>> information from "informed run-times" which can be used to:

>>

>> a) classify tasks according to the main optimization goals:

>>    performance boosting vs energy saving

>>

>> b) support a more dynamic tuning of kernel side behaviors, mainly

>>    OPPs selection and tasks placement

>>

>> Regarding this last point, this series specifically represents a

>> proposal for the integration with schedutil. The main usages we are

>> looking for in Android are:

>>

>> a) Boosting the OPP selected for certain critical tasks, with the goal

>>    to speed-up their completion regardless of (potential) energy impacts.

>>    A kind-of "race-to-idle" policy for certain tasks.

>

> It looks like this could be addressed by adding a "this task should

> race to idle" flag too.


But he said 'kind-of' race-to-idle. Racing to idle all the time for
ex. at max frequency will be wasteful of energy so although we don't
care about energy much for top-apps, we do care a bit.

>

>> b) Capping the OPP selection for certain non critical tasks, which is

>>    a major concerns especially for RT tasks in mobile context, but

>>    it also apply to FAIR tasks representing background activities.

>

> Well, is the information on how much CPU capacity assign to those

> tasks really there in user space?  What's the source of it if so?


I believe this is just a matter of tuning and modeling for what is
needed. For ex. to prevent thermal throttling as I mentioned and also
to ensure background activities aren't running at highest frequency
and consuming excessive energy (since racing to idle at higher
frequency is more expensive energy than running slower to idle since
we run at higher voltages at higher frequency and the slow of the
perf/W curve is steeper - p = c * V^2 * F. So the V component being
higher just drains more power quadratic-ally which is of no use to
background tasks - infact in some tests, we're just as happy with
setting them at much lower frequencies than what load-tracking thinks
is needed.

>>> I gather that there is some experience with the current EAS implementation

>>> there, so I wonder how this work is related to that.

>>

>> You right. We started developing a task boosting strategy a couple of

>> years ago. The first implementation we did is what is currently in use

>> by the EAS version in used on Pixel smartphones.

>>

>> Since the beginning our attitude has always been "mainline first".

>> However, we found it extremely valuable to proof both interface's

>> design and feature's benefits on real devices. That's why we keep

>> backporting these bits on different Android kernels.

>>

>> Google, which primary representatives are in CC, is also quite focused

>> on using mainline solutions for their current and future solutions.

>> That's why, after the release of the Pixel devices end of last year,

>> we refreshed and posted the proposal on LKML [1] and collected a first

>> run of valuable feedbacks at LCP [2].

>

> Thanks for the info, but my question was more about how it was related

> from the technical angle.  IOW, there surely is some experience

> related to how user space can deal with energy problems and I would

> expect that experience to be an important factor in designing a kernel

> interface for that user space, so I wonder if any particular needs of

> the Android user space are addressed here.

>

> I'm not intimately familiar with Android, so I guess I would like to

> be educated somewhat on that. :-)


Hope this sheds some light into the Android side of things a bit.

Regards,
Joel
Juri Lelli March 16, 2017, 11:16 a.m. UTC | #10
On 15/03/17 16:40, Joel Fernandes wrote:
> On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote:

> [..]

> >

> >> > However, trying to quickly summarize how that would work (for who is

> >> > already somewhat familiar with reclaiming bits):

> >> >

> >> >  - a task utilization contribution is accounted for (at rq level) as

> >> >    soon as it wakes up for the first time in a new period

> >> >  - its contribution is then removed after the 0lag time (or when the

> >> >    task gets throttled)

> >> >  - frequency transitions are triggered accordingly

> >> >

> >> > So, I don't see why triggering a go down request after the 0lag time

> >> > expired and quickly reacting to tasks waking up would have create

> >> > problems in your case?

> >>

> >> In my experience, the 'reacting to tasks' bit doesn't work very well.

> >

> > Humm.. but in this case we won't be 'reacting', we will be

> > 'anticipating' tasks' needs, right?

> 

> Are you saying we will start ramping frequency before the next

> activation so that we're ready for it?

> 


I'm saying that there is no need to ramp, simply select the frequency
that is needed for a task (or a set of them).

> If not, it sounds like it will only make the frequency request on the

> next activation when the Active bandwidth increases due to the task

> waking up. By then task has already started to run, right?

> 


When the task is enqueued back we select the frequency considering its
bandwidth request (and the bandwidth/utilization of the others). So,
when it actually starts running it will already have enough capacity to
finish in time.
Patrick Bellasi March 16, 2017, 12:27 p.m. UTC | #11
On 16-Mar 11:16, Juri Lelli wrote:
> On 15/03/17 16:40, Joel Fernandes wrote:

> > On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote:

> > [..]

> > >

> > >> > However, trying to quickly summarize how that would work (for who is

> > >> > already somewhat familiar with reclaiming bits):

> > >> >

> > >> >  - a task utilization contribution is accounted for (at rq level) as

> > >> >    soon as it wakes up for the first time in a new period

> > >> >  - its contribution is then removed after the 0lag time (or when the

> > >> >    task gets throttled)

> > >> >  - frequency transitions are triggered accordingly

> > >> >

> > >> > So, I don't see why triggering a go down request after the 0lag time

> > >> > expired and quickly reacting to tasks waking up would have create

> > >> > problems in your case?

> > >>

> > >> In my experience, the 'reacting to tasks' bit doesn't work very well.

> > >

> > > Humm.. but in this case we won't be 'reacting', we will be

> > > 'anticipating' tasks' needs, right?

> > 

> > Are you saying we will start ramping frequency before the next

> > activation so that we're ready for it?

> > 

> 

> I'm saying that there is no need to ramp, simply select the frequency

> that is needed for a task (or a set of them).

> 

> > If not, it sounds like it will only make the frequency request on the

> > next activation when the Active bandwidth increases due to the task

> > waking up. By then task has already started to run, right?

> > 

> 

> When the task is enqueued back we select the frequency considering its

> bandwidth request (and the bandwidth/utilization of the others). So,

> when it actually starts running it will already have enough capacity to

> finish in time.


Here we are factoring out the time required to actually switch to the
required OPP. I think Joel was referring to this time.

That time cannot really be eliminated but from having faster OOP
swiching HW support. Still, jumping strating to the "optimal" OPP
instead of rumping up is a big improvement.


-- 
#include <best/regards.h>

Patrick Bellasi
Juri Lelli March 16, 2017, 12:44 p.m. UTC | #12
On 16/03/17 12:27, Patrick Bellasi wrote:
> On 16-Mar 11:16, Juri Lelli wrote:

> > On 15/03/17 16:40, Joel Fernandes wrote:

> > > On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote:

> > > [..]

> > > >

> > > >> > However, trying to quickly summarize how that would work (for who is

> > > >> > already somewhat familiar with reclaiming bits):

> > > >> >

> > > >> >  - a task utilization contribution is accounted for (at rq level) as

> > > >> >    soon as it wakes up for the first time in a new period

> > > >> >  - its contribution is then removed after the 0lag time (or when the

> > > >> >    task gets throttled)

> > > >> >  - frequency transitions are triggered accordingly

> > > >> >

> > > >> > So, I don't see why triggering a go down request after the 0lag time

> > > >> > expired and quickly reacting to tasks waking up would have create

> > > >> > problems in your case?

> > > >>

> > > >> In my experience, the 'reacting to tasks' bit doesn't work very well.

> > > >

> > > > Humm.. but in this case we won't be 'reacting', we will be

> > > > 'anticipating' tasks' needs, right?

> > > 

> > > Are you saying we will start ramping frequency before the next

> > > activation so that we're ready for it?

> > > 

> > 

> > I'm saying that there is no need to ramp, simply select the frequency

> > that is needed for a task (or a set of them).

> > 

> > > If not, it sounds like it will only make the frequency request on the

> > > next activation when the Active bandwidth increases due to the task

> > > waking up. By then task has already started to run, right?

> > > 

> > 

> > When the task is enqueued back we select the frequency considering its

> > bandwidth request (and the bandwidth/utilization of the others). So,

> > when it actually starts running it will already have enough capacity to

> > finish in time.

> 

> Here we are factoring out the time required to actually switch to the

> required OPP. I think Joel was referring to this time.

> 


Right. But, this is an HW limitation. It seems a problem that every
scheduler driven decision will have to take into account. So, doesn't
make more sense to let the driver (or the governor shim layer) introduce
some sort of hysteresis to frequency changes if needed?

> That time cannot really be eliminated but from having faster OOP

> swiching HW support. Still, jumping strating to the "optimal" OPP

> instead of rumping up is a big improvement.

> 

> 

> -- 

> #include <best/regards.h>

> 

> Patrick Bellasi
Tejun Heo March 20, 2017, 2:51 p.m. UTC | #13
Hello, Patrick.

On Tue, Feb 28, 2017 at 02:38:37PM +0000, Patrick Bellasi wrote:
>  a) Boosting of important tasks, by enforcing a minimum capacity in the

>     CPUs where they are enqueued for execution.

>  b) Capping of background tasks, by enforcing a maximum capacity.

>  c) Containment of OPPs for RT tasks which cannot easily be switched to

>     the usage of the DL class, but still don't need to run at the maximum

>     frequency.


As this is something completely new, I think it'd be a great idea to
give a couple concerete examples in the head message to help people
understand what it's for.

Thanks.

-- 
tejun
Peter Zijlstra April 12, 2017, 12:48 p.m. UTC | #14
On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> >     illustrated per your above points in that it affects both, while in

> >     fact it actually modifies another metric, namely util_avg.

> 

> I don't see it modifying in any direct way util_avg.


The point is that clamps called 'capacity' are applied to util. So while
you don't modify util directly, you do modify the util signal (for one
consumer).
Patrick Bellasi April 12, 2017, 1:24 p.m. UTC | #15
On 12-Apr 14:22, Peter Zijlstra wrote:
> On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:

> > Sorry, I don't get instead what are the "confusing nesting properties"

> > you are referring to?

> 

> If a parent group sets min=.2 and max=.8, what are the constraints on

> its child groups for setting their resp min and max?


Currently the logic I'm proposing enforces this:

a) capacity_max can only be reduced
   because we accept that a child can be further constrained
   for example:
   - a resource manager allocates a max capacity to an application
   - the application itself knows that some of its child are background
     tasks and they can be further constrained

b) capacity_min can only be increased
   because we want to inhibit child affecting overall performance
   for example:
   - a resource manager allocates a minimum capacity to an application
   - the application itself cannot slow-down some of its child
     without risking to affect other (unknown) external entities

> I can't immediately gives rules that would make sense.


The second rule is more tricky, but I see it matching better an
overall decomposition schema where a single resource manager is
allocating a capacity_min to two different entities (A and B) which
are independent but (it only knows) are also cooperating.

Let's think about the Android run-time which allocate resources to a
system service (entity A) which it knows it has to interact with
a certain app (entity B).

The cooperation dependency can be resolved only by the resource
manager, by assigning capacity_min at entity level CGroups.
Thus, entities subgroups should not be allowed to further reduce
this constraint without risking to impact an (unknown for them)
external entity.

> For instance, allowing a child to lower min would violate the parent

> constraint,


Quite likely don't want this.

> while allowing a child to increase min would grant the child

> more resources than the parent.


But still within the capacity_max enforced by the parent.

We should always consider the pair (min,max), once a parent defined
this range to me it's seem ok that child can freely play within that
range.

Why should not be allowed a child group to set:

   capacity_min_child = capacity_max_parent

?


> Neither seem like a good thing.


-- 
#include <best/regards.h>

Patrick Bellasi
Peter Zijlstra April 12, 2017, 2:34 p.m. UTC | #16
On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote:
> On 12-Apr 14:48, Peter Zijlstra wrote:

> > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:

> > > >     illustrated per your above points in that it affects both, while in

> > > >     fact it actually modifies another metric, namely util_avg.

> > > 

> > > I don't see it modifying in any direct way util_avg.

> > 

> > The point is that clamps called 'capacity' are applied to util. So while

> > you don't modify util directly, you do modify the util signal (for one

> > consumer).

> 

> Right, but this consumer (i.e. schedutil) it's already translating

> the util_avg into a next_freq (which ultimately it's a capacity).

> 

> Thus, I don't see a big misfit in that code path to "filter" this

> translation with a capacity clamp.


Still strikes me as odd though.
Patrick Bellasi April 12, 2017, 2:43 p.m. UTC | #17
On 12-Apr 16:34, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote:

> > On 12-Apr 14:48, Peter Zijlstra wrote:

> > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:

> > > > >     illustrated per your above points in that it affects both, while in

> > > > >     fact it actually modifies another metric, namely util_avg.

> > > > 

> > > > I don't see it modifying in any direct way util_avg.

> > > 

> > > The point is that clamps called 'capacity' are applied to util. So while

> > > you don't modify util directly, you do modify the util signal (for one

> > > consumer).

> > 

> > Right, but this consumer (i.e. schedutil) it's already translating

> > the util_avg into a next_freq (which ultimately it's a capacity).

> > 

> > Thus, I don't see a big misfit in that code path to "filter" this

> > translation with a capacity clamp.

> 

> Still strikes me as odd though.


Can you better elaborate on they why?

-- 
#include <best/regards.h>

Patrick Bellasi