[RFC,0/7] Introduce thermal pressure

Message ID	1539102302-9057-1-git-send-email-thara.gopinath@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-pm-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Thara Gopinath <thara.gopinath@linaro.org> To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org, rui.zhang@intel.com Cc: gregkh@linuxfoundation.org, rafael@kernel.org, amit.kachhap@gmail.com, viresh.kumar@linaro.org, javi.merino@kernel.org, edubezval@gmail.com, daniel.lezcano@linaro.org, linux-pm@vger.kernel.org, quentin.perret@arm.com, ionela.voinescu@arm.com, vincent.guittot@linaro.org Subject: [RFC PATCH 0/7] Introduce thermal pressure Date: Tue, 9 Oct 2018 12:24:55 -0400 Message-Id: <1539102302-9057-1-git-send-email-thara.gopinath@linaro.org> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk
Series	Introduce thermal pressure \| expand [RFC,0/7] Introduce thermal pressure [RFC,2/7] sched/pelt.c: Add support to track thermal pressure [RFC,3/7] sched: Add infrastructure to store and update instantaneous thermal pressure [RFC,4/7] sched: Initialize per cpu thermal pressure structure [RFC,7/7] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping

Thara Gopinath Oct. 9, 2018, 4:24 p.m. UTC

Thermal governors can respond to an overheat event for a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in linux kernel, in event of maximum
frequency capping of a cpu, the maximum available compute
capacity of the cpu is not adjusted at all. In other words, scheduler
is unware maximum cpu capacity restrictions placed due to thermal
activity. This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The delta between the maximum possible capacity of a cpu and
maximum available capacity of a cpu due to thermal event can
be considered as thermal pressure. Instantaneous thermal pressure
is hard to record and can sometime be erroneous as there can be mismatch
between the actual capping of capacity and scheduler recording it.
Thus solution is to have a weighted average per cpu value for thermal
pressure over time. The weight reflects the amount of time the cpu has
spent at a capped maximum frequency. To accumulate, average and
appropriately decay thermal pressure, this patch series uses pelt
signals and reuses the available framework that does a similar
bookkeeping of rt/dl task utilization.

Regarding testing, basic build, boot and sanity testing have been
performed on hikey960 mainline kernel with debian file system.
Further aobench (An occlusion renderer for benchmarking realworld
floating point performance) showed the following results on hikey960
with debain.

                                        Result          Standard        Standard
                                        (Time secs)     Error           Deviation
Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

Thara Gopinath (7):
  sched/pelt: Add option to make load and util calculations frequency
    invariant
  sched/pelt.c: Add support to track thermal pressure
  sched: Add infrastructure to store and update instantaneous thermal
    pressure
  sched: Initialize per cpu thermal pressure structure
  sched/fair: Enable CFS periodic tick to update thermal pressure
  sched/fair: update cpu_capcity to reflect thermal pressure
  thermal/cpu-cooling: Update thermal pressure in case of a maximum
    frequency capping

 drivers/base/arch_topology.c  |  1 +
 drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
 include/linux/sched.h         | 14 +++++++++
 kernel/sched/Makefile         |  2 +-
 kernel/sched/core.c           |  2 ++
 kernel/sched/fair.c           |  4 +++
 kernel/sched/pelt.c           | 40 ++++++++++++++++++--------
 kernel/sched/pelt.h           |  7 +++++
 kernel/sched/sched.h          |  1 +
 kernel/sched/thermal.c        | 66 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/thermal.h        | 13 +++++++++
 11 files changed, 157 insertions(+), 13 deletions(-)
 create mode 100644 kernel/sched/thermal.c
 create mode 100644 kernel/sched/thermal.h

-- 
2.1.4

Javi Merino Oct. 10, 2018, 5:44 a.m. UTC | #1

On Tue, Oct 09, 2018 at 12:24:55PM -0400, Thara Gopinath wrote:
> Thermal governors can respond to an overheat event for a cpu by

> capping the cpu's maximum possible frequency. This in turn

> means that the maximum available compute capacity of the

> cpu is restricted. But today in linux kernel, in event of maximum

> frequency capping of a cpu, the maximum available compute

> capacity of the cpu is not adjusted at all. In other words, scheduler

> is unware maximum cpu capacity restrictions placed due to thermal

> activity.


Interesting, I would have sworn that I tested this years ago by
lowering the maximum frequency of a cpufreq domain, and the scheduler
reacted accordingly to the new maximum capacities of the cpus.

>           This patch series attempts to address this issue.

> The benefits identified are better task placement among available

> cpus in event of overheating which in turn leads to better

> performance numbers.

> 

> The delta between the maximum possible capacity of a cpu and

> maximum available capacity of a cpu due to thermal event can

> be considered as thermal pressure. Instantaneous thermal pressure

> is hard to record and can sometime be erroneous as there can be mismatch

> between the actual capping of capacity and scheduler recording it.

> Thus solution is to have a weighted average per cpu value for thermal

> pressure over time. The weight reflects the amount of time the cpu has

> spent at a capped maximum frequency. To accumulate, average and

> appropriately decay thermal pressure, this patch series uses pelt

> signals and reuses the available framework that does a similar

> bookkeeping of rt/dl task utilization.

> 

> Regarding testing, basic build, boot and sanity testing have been

> performed on hikey960 mainline kernel with debian file system.

> Further aobench (An occlusion renderer for benchmarking realworld

> floating point performance) showed the following results on hikey960

> with debain.

> 

>                                         Result          Standard        Standard

>                                         (Time secs)     Error           Deviation

> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

> 

> Thara Gopinath (7):

>   sched/pelt: Add option to make load and util calculations frequency

>     invariant

>   sched/pelt.c: Add support to track thermal pressure

>   sched: Add infrastructure to store and update instantaneous thermal

>     pressure

>   sched: Initialize per cpu thermal pressure structure

>   sched/fair: Enable CFS periodic tick to update thermal pressure

>   sched/fair: update cpu_capcity to reflect thermal pressure

>   thermal/cpu-cooling: Update thermal pressure in case of a maximum

>     frequency capping

> 

>  drivers/base/arch_topology.c  |  1 +

>  drivers/thermal/cpu_cooling.c | 20 ++++++++++++-


thermal?  There are other ways in which the maximum frequency of a cpu
can be limited, for example from userspace via scaling_max_freq.

When something (anything) changes the maximum frequency of a cpufreq
policy, the scheduler should be notified.  I think this change should
be done in cpufreq instead to make it generic and not particular to
a given maximum frequency "capper".

Cheers,
Javi

Javi Merino Oct. 10, 2018, 5:57 a.m. UTC | #2

On Tue, Oct 09, 2018 at 12:25:01PM -0400, Thara Gopinath wrote:
> cpu_capacity relflects the maximum available capacity of a cpu. Thermal

> pressure on a cpu means this maximum available capacity is reduced. This

> patch reduces the average thermal pressure for a cpu from its maximum

> available capacity so that cpu_capacity reflects the actual

> available capacity.

> 

> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>

> ---

>  kernel/sched/fair.c | 1 +

>  1 file changed, 1 insertion(+)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 7deb1d0..8651e55 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7497,6 +7497,7 @@ static unsigned long scale_rt_capacity(int cpu)

>  

>  	used = READ_ONCE(rq->avg_rt.util_avg);

>  	used += READ_ONCE(rq->avg_dl.util_avg);

> +	used += READ_ONCE(rq->avg_thermal.load_avg);

IIUIC, you are treating thermal pressure as an artificial load on the
cpu.  If so, this sounds like a hard to maintain hack.  Thermal
pressure have different characteristics to utilization.  What happens
if thermal sets the cpu cooling state back to 0 because there is
thermal headroom again?  Do we keep adding this artificial load to the
cpu just because there was thermal pressure in the past and let it
decay as if it was cpu load?

Cheers,
Javi

Ingo Molnar Oct. 10, 2018, 6:17 a.m. UTC | #3

* Thara Gopinath <thara.gopinath@linaro.org> wrote:

> Thermal governors can respond to an overheat event for a cpu by

> capping the cpu's maximum possible frequency. This in turn

> means that the maximum available compute capacity of the

> cpu is restricted. But today in linux kernel, in event of maximum

> frequency capping of a cpu, the maximum available compute

> capacity of the cpu is not adjusted at all. In other words, scheduler

> is unware maximum cpu capacity restrictions placed due to thermal

> activity. This patch series attempts to address this issue.

> The benefits identified are better task placement among available

> cpus in event of overheating which in turn leads to better

> performance numbers.

> 

> The delta between the maximum possible capacity of a cpu and

> maximum available capacity of a cpu due to thermal event can

> be considered as thermal pressure. Instantaneous thermal pressure

> is hard to record and can sometime be erroneous as there can be mismatch

> between the actual capping of capacity and scheduler recording it.

> Thus solution is to have a weighted average per cpu value for thermal

> pressure over time. The weight reflects the amount of time the cpu has

> spent at a capped maximum frequency. To accumulate, average and

> appropriately decay thermal pressure, this patch series uses pelt

> signals and reuses the available framework that does a similar

> bookkeeping of rt/dl task utilization.

> 

> Regarding testing, basic build, boot and sanity testing have been

> performed on hikey960 mainline kernel with debian file system.

> Further aobench (An occlusion renderer for benchmarking realworld

> floating point performance) showed the following results on hikey960

> with debain.

> 

>                                         Result          Standard        Standard

>                                         (Time secs)     Error           Deviation

> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%


Wow, +13% speedup, impressive! We definitely want this outcome.

I'm wondering what happens if we do not track and decay the thermal load at all at the PELT 
level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal 
events we receive from the CPU.

You describe the averaging as:

> Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can 

> be mismatch between the actual capping of capacity and scheduler recording it.


Not sure I follow the argument here: are there bogus thermal throttling events? If so then
they are hopefully not frequent enough and should average out over time even if we follow
it instantly.

I.e. what is 'can sometimes be erroneous', exactly?

Thanks,

	Ingo

Vincent Guittot Oct. 10, 2018, 8:50 a.m. UTC | #4

On Wed, 10 Oct 2018 at 10:29, Quentin Perret <quentin.perret@arm.com> wrote:
>

> Hi Thara,

>

> On Wednesday 10 Oct 2018 at 08:17:51 (+0200), Ingo Molnar wrote:

> >

> > * Thara Gopinath <thara.gopinath@linaro.org> wrote:

> >

> > > Thermal governors can respond to an overheat event for a cpu by

> > > capping the cpu's maximum possible frequency. This in turn

> > > means that the maximum available compute capacity of the

> > > cpu is restricted. But today in linux kernel, in event of maximum

> > > frequency capping of a cpu, the maximum available compute

> > > capacity of the cpu is not adjusted at all. In other words, scheduler

> > > is unware maximum cpu capacity restrictions placed due to thermal

> > > activity. This patch series attempts to address this issue.

> > > The benefits identified are better task placement among available

> > > cpus in event of overheating which in turn leads to better

> > > performance numbers.

> > >

> > > The delta between the maximum possible capacity of a cpu and

> > > maximum available capacity of a cpu due to thermal event can

> > > be considered as thermal pressure. Instantaneous thermal pressure

> > > is hard to record and can sometime be erroneous as there can be mismatch

> > > between the actual capping of capacity and scheduler recording it.

> > > Thus solution is to have a weighted average per cpu value for thermal

> > > pressure over time. The weight reflects the amount of time the cpu has

> > > spent at a capped maximum frequency. To accumulate, average and

> > > appropriately decay thermal pressure, this patch series uses pelt

> > > signals and reuses the available framework that does a similar

> > > bookkeeping of rt/dl task utilization.

> > >

> > > Regarding testing, basic build, boot and sanity testing have been

> > > performed on hikey960 mainline kernel with debian file system.

> > > Further aobench (An occlusion renderer for benchmarking realworld

> > > floating point performance) showed the following results on hikey960

> > > with debain.

> > >

> > >                                         Result          Standard        Standard

> > >                                         (Time secs)     Error           Deviation

> > > Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

> > > Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

> >

> > Wow, +13% speedup, impressive! We definitely want this outcome.

> >

> > I'm wondering what happens if we do not track and decay the thermal load at all at the PELT

> > level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal

> > events we receive from the CPU.

>

> +1, it's not that obvious (to me at least) that averaging the thermal

> pressure over time is necessarily what we want. Say the thermal governor

> caps a CPU and 'removes' 70% of its capacity, it will take forever for

> the PELT signal to ramp-up to that level before the scheduler can react.

> And the other way around, if you release the cap, it'll take a while

> before we actually start using the newly available capacity. I can also

> imagine how reacting too fast can be counter-productive, but I guess

> having numbers and/or use-cases to show that would be great :-)


The problem with reflecting directly the capping is that it happens
far more often than the pace at which cpu_capacity_orig is updated in
the scheduler. This means that at the moment when scheduler uses the
value, it might not be correct anymore. Then, this value are also used
when building the sched_domain and setting max_cpu_capacity which
would also implies the rebuilt the sched_domain topology ...

The pace of changing the capping is to fast to reflect that in the
whole scheduler topology

>

> Thara, have you tried to experiment with a simpler implementation as

> suggested by Ingo ?

>

> Also, assuming that we do want to average things, do we actually want to

> tie the thermal ramp-up time to the PELT half life ? That provides

> nice maths properties wrt the other signals, but it's not obvious to me

> that this thermal 'constant' should be the same on all platforms. Or

> maybe it should ?


The main interest of using PELT signal is that thermal pressure will
evolve at the same pace as other signals used in the scheduler. With
thermal pressure, we have the exact same problem as with RT tasks. The
thermal will cap the max frequency which will cap the utilization of
the tasks running on the CPU

>

> Thanks,

> Quentin

>

> >

> > You describe the averaging as:

> >

> > > Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can

> > > be mismatch between the actual capping of capacity and scheduler recording it.

> >

> > Not sure I follow the argument here: are there bogus thermal throttling events? If so then

> > they are hopefully not frequent enough and should average out over time even if we follow

> > it instantly.

> >

> > I.e. what is 'can sometimes be erroneous', exactly?

> >

> > Thanks,

> >

> >       Ingo

Quentin Perret Oct. 10, 2018, 9:55 a.m. UTC | #5

Hi Vincent,

On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:
> The problem with reflecting directly the capping is that it happens

> far more often than the pace at which cpu_capacity_orig is updated in

> the scheduler.

Hmm, how can you be so sure ? That most likely depends on the workload,
the platform and the thermal governor. Some platforms heat up slowly,
some quickly. The pace at which the thermal governor will change things
should depend on that I assume.

> This means that at the moment when scheduler uses the

> value, it might not be correct anymore.

And OTOH, when you remove a cap for example, it will take time before
the scheduler can see the newly available capacity if you need to wait
for the signal to decay. So you are using a wrong information too in
that scenario.

> Then, this value are also used

> when building the sched_domain and setting max_cpu_capacity which

> would also implies the rebuilt the sched_domain topology ...

Wait what ? I thought the thermal cap was reflected in capacity_of, not
capacity_orig_of ... You need to rebuild the sched_domain in case of
thermal pressure ?

Hmm, let me have a closer look at the patches, I must have missed
something ...

> The pace of changing the capping is to fast to reflect that in the

> whole scheduler topology

That's probably true in some cases, but it'd be cool to have numbers to
back up that statement, I think.

Now, if you do need to rebuild the sched domain topology every time you
update the thermal pressure, I think the PELT HL is _way_ too short for
that ... You can't rebuild the whole thing every 32ms or so. Or am I
misunderstanding something ?

> > Thara, have you tried to experiment with a simpler implementation as

> > suggested by Ingo ?

> >

> > Also, assuming that we do want to average things, do we actually want to

> > tie the thermal ramp-up time to the PELT half life ? That provides

> > nice maths properties wrt the other signals, but it's not obvious to me

> > that this thermal 'constant' should be the same on all platforms. Or

> > maybe it should ?

> 

> The main interest of using PELT signal is that thermal pressure will

> evolve at the same pace as other signals used in the scheduler.

Right, I think this is a nice property too (assuming that we actually
want to average things out).

> With

> thermal pressure, we have the exact same problem as with RT tasks. The

> thermal will cap the max frequency which will cap the utilization of

> the tasks running on the CPU

Well, the nature of the signal is slightly different IMO. Yes it's
capacity, but you're no actually measuring time spent on the CPU. All
other PELT signals are based on time, this thermal thing isn't, so it is
kinda different in a way. And I'm still wondering if it could be helpful
to be able to have a different HL for that thermal signal. That would
'break' the nice maths properties we have, yes, but is it a problem or is
it actually helpful to cope with the thermal characteristics of
different platforms ?

Thanks,
Quentin

Vincent Guittot Oct. 10, 2018, 10:14 a.m. UTC | #6

On Wed, 10 Oct 2018 at 11:55, Quentin Perret <quentin.perret@arm.com> wrote:
>

> Hi Vincent,

>

> On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:

> > The problem with reflecting directly the capping is that it happens

> > far more often than the pace at which cpu_capacity_orig is updated in

> > the scheduler.

>

> Hmm, how can you be so sure ? That most likely depends on the workload,

> the platform and the thermal governor. Some platforms heat up slowly,

> some quickly. The pace at which the thermal governor will change things

> should depend on that I assume.

>

> > This means that at the moment when scheduler uses the

> > value, it might not be correct anymore.

>

> And OTOH, when you remove a cap for example, it will take time before

> the scheduler can see the newly available capacity if you need to wait

> for the signal to decay. So you are using a wrong information too in

> that scenario.


But we stay consistant with all other metrics. The same happen when a
task decide to stay idle for a long time after some activity... You
will have to wait for the signal to decay

>

> > Then, this value are also used

> > when building the sched_domain and setting max_cpu_capacity which

> > would also implies the rebuilt the sched_domain topology ...

>

> Wait what ? I thought the thermal cap was reflected in capacity_of, not

> capacity_orig_of ... You need to rebuild the sched_domain in case of

> thermal pressure ?

>

> Hmm, let me have a closer look at the patches, I must have missed

> something ...

>

> > The pace of changing the capping is to fast to reflect that in the

> > whole scheduler topology

>

> That's probably true in some cases, but it'd be cool to have numbers to

> back up that statement, I think.

>

> Now, if you do need to rebuild the sched domain topology every time you

> update the thermal pressure, I think the PELT HL is _way_ too short for

> that ... You can't rebuild the whole thing every 32ms or so. Or am I

> misunderstanding something ?

>

> > > Thara, have you tried to experiment with a simpler implementation as

> > > suggested by Ingo ?

> > >

> > > Also, assuming that we do want to average things, do we actually want to

> > > tie the thermal ramp-up time to the PELT half life ? That provides

> > > nice maths properties wrt the other signals, but it's not obvious to me

> > > that this thermal 'constant' should be the same on all platforms. Or

> > > maybe it should ?

> >

> > The main interest of using PELT signal is that thermal pressure will

> > evolve at the same pace as other signals used in the scheduler.

>

> Right, I think this is a nice property too (assuming that we actually

> want to average things out).

>

> > With

> > thermal pressure, we have the exact same problem as with RT tasks. The

> > thermal will cap the max frequency which will cap the utilization of

> > the tasks running on the CPU

>

> Well, the nature of the signal is slightly different IMO. Yes it's

> capacity, but you're no actually measuring time spent on the CPU. All

> other PELT signals are based on time, this thermal thing isn't, so it is

> kinda different in a way. And I'm still wondering if it could be helpful


hmmm ... it is based on time too.
Both signals (current ones and thermal one) are really close. The main
difference with other utilization signal is that instead of providing
a running/not running boolean that is then weighted by the current
capacity, the signal uses direclty the capped max capacity to reflect
the amount of cycle that is stolen by thermal mitigation.

> to be able to have a different HL for that thermal signal. That would

> 'break' the nice maths properties we have, yes, but is it a problem or is

> it actually helpful to cope with the thermal characteristics of

> different platforms ?


If you don't use the sign kind of signal with the same responsiveness,
you will start to see some OPP toggles as an example when the thermal
state change because one metrics will change faster than the other one
and you can't have a correct view of the system. Same problem was
happening with rt task.

I take the example of RT task because it quite similar in the effect
except that RT task steal all cycles when it runs whereas thermal
mitigation steal only some by capping the frequency
>

> Thanks,

> Quentin

Vincent Guittot Oct. 10, 2018, 12:04 p.m. UTC | #7

On Wed, 10 Oct 2018 at 12:36, Quentin Perret <quentin.perret@arm.com> wrote:
>

> On Wednesday 10 Oct 2018 at 12:14:32 (+0200), Vincent Guittot wrote:

> > On Wed, 10 Oct 2018 at 11:55, Quentin Perret <quentin.perret@arm.com> wrote:

> > >

> > > Hi Vincent,

> > >

> > > On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:

> > > > The problem with reflecting directly the capping is that it happens

> > > > far more often than the pace at which cpu_capacity_orig is updated in

> > > > the scheduler.

> > >

> > > Hmm, how can you be so sure ? That most likely depends on the workload,

> > > the platform and the thermal governor. Some platforms heat up slowly,

> > > some quickly. The pace at which the thermal governor will change things

> > > should depend on that I assume.

> > >

> > > > This means that at the moment when scheduler uses the

> > > > value, it might not be correct anymore.

> > >

> > > And OTOH, when you remove a cap for example, it will take time before

> > > the scheduler can see the newly available capacity if you need to wait

> > > for the signal to decay. So you are using a wrong information too in

> > > that scenario.

> >

> > But we stay consistant with all other metrics. The same happen when a

> > task decide to stay idle for a long time after some activity... You

> > will have to wait for the signal to decay

>

> Because you see that as a task going idle. You could also say that

> removing the cap from a CPU is equivalent to migrating a task off that

> CPU. In this case you should see the newly available cap pretty much

> immediately.

>

> Also, I feel like the whole thermal capping story could interest the

> Intel folks who, IIRC, were discussing how to represent their 'boosted'

> OPPs in the scheduler some time ago. You can see the Intel boost thing

> as a cap I think -- some OPPs can be inaccessible in some case. I'd be

> interested to see what's their take on this.

>

> > > > Then, this value are also used

> > > > when building the sched_domain and setting max_cpu_capacity which

> > > > would also implies the rebuilt the sched_domain topology ...

> > >

> > > Wait what ? I thought the thermal cap was reflected in capacity_of, not

> > > capacity_orig_of ... You need to rebuild the sched_domain in case of

> > > thermal pressure ?

>

> I can't locate where you need to do this in the series. Do you actually

> need to rebuild the sd hierarchy ?


This patchset doesn't touch cpu_capacity_orig and doesn't need to  as
it assume that the max capacity is unchanged but some capacity is
momentary stolen by thermal.
 If you want to reflect immediately all thermal capping change, you
have to update this field and all related fields and struct around

>

> > > Hmm, let me have a closer look at the patches, I must have missed

> > > something ...

> > >

> > > > The pace of changing the capping is to fast to reflect that in the

> > > > whole scheduler topology

> > >

> > > That's probably true in some cases, but it'd be cool to have numbers to

> > > back up that statement, I think.

> > >

> > > Now, if you do need to rebuild the sched domain topology every time you

> > > update the thermal pressure, I think the PELT HL is _way_ too short for

> > > that ... You can't rebuild the whole thing every 32ms or so. Or am I

> > > misunderstanding something ?

> > >

> > > > > Thara, have you tried to experiment with a simpler implementation as

> > > > > suggested by Ingo ?

> > > > >

> > > > > Also, assuming that we do want to average things, do we actually want to

> > > > > tie the thermal ramp-up time to the PELT half life ? That provides

> > > > > nice maths properties wrt the other signals, but it's not obvious to me

> > > > > that this thermal 'constant' should be the same on all platforms. Or

> > > > > maybe it should ?

> > > >

> > > > The main interest of using PELT signal is that thermal pressure will

> > > > evolve at the same pace as other signals used in the scheduler.

> > >

> > > Right, I think this is a nice property too (assuming that we actually

> > > want to average things out).

> > >

> > > > With

> > > > thermal pressure, we have the exact same problem as with RT tasks. The

> > > > thermal will cap the max frequency which will cap the utilization of

> > > > the tasks running on the CPU

> > >

> > > Well, the nature of the signal is slightly different IMO. Yes it's

> > > capacity, but you're no actually measuring time spent on the CPU. All

> > > other PELT signals are based on time, this thermal thing isn't, so it is

> > > kinda different in a way. And I'm still wondering if it could be helpful

> >

> > hmmm ... it is based on time too.

>

> You're not actually measuring the time spent on the CPU by the 'thermal

> task'. There is no such thing as a 'thermal task'. You're just trying to

> model things like that, but the thermal stuff isn't actually

> interrupting running tasks to eat CPU cycles. It just makes thing run

> slower, which isn't exactly the same thing IMO.

>

> But maybe that's a detail.

>

> > Both signals (current ones and thermal one) are really close. The main

> > difference with other utilization signal is that instead of providing

> > a running/not running boolean that is then weighted by the current

> > capacity, the signal uses direclty the capped max capacity to reflect

> > the amount of cycle that is stolen by thermal mitigation.

> >

> > > to be able to have a different HL for that thermal signal. That would

> > > 'break' the nice maths properties we have, yes, but is it a problem or is

> > > it actually helpful to cope with the thermal characteristics of

> > > different platforms ?

> >

> > If you don't use the sign kind of signal with the same responsiveness,

> > you will start to see some OPP toggles as an example when the thermal

> > state change because one metrics will change faster than the other one

> > and you can't have a correct view of the system. Same problem was

> > happening with rt task.

>

> Well, that wasn't the problem with rt tasks. The problem with RT tasks

> was that the time they spend on the CPU wasn't accounted _at all_ when

> selecting frequency for CFS, not that the accounting was at a different

> pace ...


The problem was the same with RT, the cfs utilization was lower than
reality because RT steals soem cycle to CFS
So schedutil was selecting a lower frequency when cfs was running
whereas the CPU was fully used.
The same can happen with thermal:
cap the max freq because of thermal
the utilization with decrease.
remove the cap
the utilization is still low and you will select a low OPP because you
don't take into account cycle stolen by thermal like with RT

Regards,
Vincent
>

> Thanks,

> Quentin

Juri Lelli Oct. 10, 2018, 12:23 p.m. UTC | #8

On 10/10/18 14:04, Vincent Guittot wrote:

[...]

> The problem was the same with RT, the cfs utilization was lower than

> reality because RT steals soem cycle to CFS

> So schedutil was selecting a lower frequency when cfs was running

> whereas the CPU was fully used.

> The same can happen with thermal:

> cap the max freq because of thermal

> the utilization with decrease.

> remove the cap

> the utilization is still low and you will select a low OPP because you

> don't take into account cycle stolen by thermal like with RT


What if we scale frequency component considering the capped temporary
max?

Juri Lelli Oct. 10, 2018, 12:50 p.m. UTC | #9

On 10/10/18 14:34, Vincent Guittot wrote:
> Hi  Juri,

> 

> On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote:

> >

> > On 10/10/18 14:04, Vincent Guittot wrote:

> >

> > [...]

> >

> > > The problem was the same with RT, the cfs utilization was lower than

> > > reality because RT steals soem cycle to CFS

> > > So schedutil was selecting a lower frequency when cfs was running

> > > whereas the CPU was fully used.

> > > The same can happen with thermal:

> > > cap the max freq because of thermal

> > > the utilization with decrease.

> > > remove the cap

> > > the utilization is still low and you will select a low OPP because you

> > > don't take into account cycle stolen by thermal like with RT

> >

> > What if we scale frequency component considering the capped temporary

> > max?

> 

> Do you mean using a kind of scale_thermal_capacity in accumulate_sum

> when computing utilization ?


Yeah, something like that I guess. So that we account for temporary
"fake" 1024..

Quentin Perret Oct. 10, 2018, 1:05 p.m. UTC | #10

On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:
> This patchset doesn't touch cpu_capacity_orig and doesn't need to  as

> it assume that the max capacity is unchanged but some capacity is

> momentary stolen by thermal.

>  If you want to reflect immediately all thermal capping change, you

> have to update this field and all related fields and struct around


I don't follow you here. I never said I wanted to change
cpu_capacity_orig. I don't think we should do that actually. Changing
capacity_of (which is updated during LB	IIRC) is just fine. The question
is about what you want to do there: reflect an averaged value or the
instantaneous one.

It's not obvious (to me) that the complex one (the averaged value) is
better than the other, simpler, one. All I'm saying from the beginning
is that it'd be nice to have numbers and use cases to discuss the pros
and cons of both approaches.

> > > > Hmm, let me have a closer look at the patches, I must have missed

> > > > something ...

> > > >

> > > > > The pace of changing the capping is to fast to reflect that in the

> > > > > whole scheduler topology

> > > >

> > > > That's probably true in some cases, but it'd be cool to have numbers to

> > > > back up that statement, I think.

> > > >

> > > > Now, if you do need to rebuild the sched domain topology every time you

> > > > update the thermal pressure, I think the PELT HL is _way_ too short for

> > > > that ... You can't rebuild the whole thing every 32ms or so. Or am I

> > > > misunderstanding something ?

> > > >

> > > > > > Thara, have you tried to experiment with a simpler implementation as

> > > > > > suggested by Ingo ?

> > > > > >

> > > > > > Also, assuming that we do want to average things, do we actually want to

> > > > > > tie the thermal ramp-up time to the PELT half life ? That provides

> > > > > > nice maths properties wrt the other signals, but it's not obvious to me

> > > > > > that this thermal 'constant' should be the same on all platforms. Or

> > > > > > maybe it should ?

> > > > >

> > > > > The main interest of using PELT signal is that thermal pressure will

> > > > > evolve at the same pace as other signals used in the scheduler.

> > > >

> > > > Right, I think this is a nice property too (assuming that we actually

> > > > want to average things out).

> > > >

> > > > > With

> > > > > thermal pressure, we have the exact same problem as with RT tasks. The

> > > > > thermal will cap the max frequency which will cap the utilization of

> > > > > the tasks running on the CPU

> > > >

> > > > Well, the nature of the signal is slightly different IMO. Yes it's

> > > > capacity, but you're no actually measuring time spent on the CPU. All

> > > > other PELT signals are based on time, this thermal thing isn't, so it is

> > > > kinda different in a way. And I'm still wondering if it could be helpful

> > >

> > > hmmm ... it is based on time too.

> >

> > You're not actually measuring the time spent on the CPU by the 'thermal

> > task'. There is no such thing as a 'thermal task'. You're just trying to

> > model things like that, but the thermal stuff isn't actually

> > interrupting running tasks to eat CPU cycles. It just makes thing run

> > slower, which isn't exactly the same thing IMO.

> >

> > But maybe that's a detail.

> >

> > > Both signals (current ones and thermal one) are really close. The main

> > > difference with other utilization signal is that instead of providing

> > > a running/not running boolean that is then weighted by the current

> > > capacity, the signal uses direclty the capped max capacity to reflect

> > > the amount of cycle that is stolen by thermal mitigation.

> > >

> > > > to be able to have a different HL for that thermal signal. That would

> > > > 'break' the nice maths properties we have, yes, but is it a problem or is

> > > > it actually helpful to cope with the thermal characteristics of

> > > > different platforms ?

> > >

> > > If you don't use the sign kind of signal with the same responsiveness,

> > > you will start to see some OPP toggles as an example when the thermal

> > > state change because one metrics will change faster than the other one

> > > and you can't have a correct view of the system. Same problem was

> > > happening with rt task.

> >

> > Well, that wasn't the problem with rt tasks. The problem with RT tasks

> > was that the time they spend on the CPU wasn't accounted _at all_ when

> > selecting frequency for CFS, not that the accounting was at a different

> > pace ...

> 

> The problem was the same with RT, the cfs utilization was lower than

> reality because RT steals soem cycle to CFS

> So schedutil was selecting a lower frequency when cfs was running

> whereas the CPU was fully used.

> The same can happen with thermal:

> cap the max freq because of thermal

> the utilization with decrease.

> remove the cap

> the utilization is still low and you will select a low OPP because you

> don't take into account cycle stolen by thermal like with RT


I'm not arguing with the fact that we need to reflect the thermal
pressure in the scheduler to see that a CPU is fully busy. I agree with
that.

I'm saying you don't necessarily have to update the thermal stuff and
the existing PELT signals *at the same pace*, because different
platforms have different thermal characteristics.

Thanks,*
Quentin

Vincent Guittot Oct. 10, 2018, 1:08 p.m. UTC | #11

On Wed, 10 Oct 2018 at 14:50, Juri Lelli <juri.lelli@gmail.com> wrote:
>

> On 10/10/18 14:34, Vincent Guittot wrote:

> > Hi  Juri,

> >

> > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote:

> > >

> > > On 10/10/18 14:04, Vincent Guittot wrote:

> > >

> > > [...]

> > >

> > > > The problem was the same with RT, the cfs utilization was lower than

> > > > reality because RT steals soem cycle to CFS

> > > > So schedutil was selecting a lower frequency when cfs was running

> > > > whereas the CPU was fully used.

> > > > The same can happen with thermal:

> > > > cap the max freq because of thermal

> > > > the utilization with decrease.

> > > > remove the cap

> > > > the utilization is still low and you will select a low OPP because you

> > > > don't take into account cycle stolen by thermal like with RT

> > >

> > > What if we scale frequency component considering the capped temporary

> > > max?

> >

> > Do you mean using a kind of scale_thermal_capacity in accumulate_sum

> > when computing utilization ?

>

> Yeah, something like that I guess. So that we account for temporary

> "fake" 1024..


But the utilization will not be invariant anymore across the system

Quentin Perret Oct. 10, 2018, 1:11 p.m. UTC | #12

On Wednesday 10 Oct 2018 at 14:50:33 (+0200), Juri Lelli wrote:
> On 10/10/18 14:34, Vincent Guittot wrote:

> > Hi  Juri,

> > 

> > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote:

> > >

> > > On 10/10/18 14:04, Vincent Guittot wrote:

> > >

> > > [...]

> > >

> > > > The problem was the same with RT, the cfs utilization was lower than

> > > > reality because RT steals soem cycle to CFS

> > > > So schedutil was selecting a lower frequency when cfs was running

> > > > whereas the CPU was fully used.

> > > > The same can happen with thermal:

> > > > cap the max freq because of thermal

> > > > the utilization with decrease.

> > > > remove the cap

> > > > the utilization is still low and you will select a low OPP because you

> > > > don't take into account cycle stolen by thermal like with RT

> > >

> > > What if we scale frequency component considering the capped temporary

> > > max?

> > 

> > Do you mean using a kind of scale_thermal_capacity in accumulate_sum

> > when computing utilization ?

> 

> Yeah, something like that I guess. So that we account for temporary

> "fake" 1024..


But wouldn't that break frequency invariance ? A task would look bigger
on a capped CPU than a non-capped one no ?

Vincent Guittot Oct. 10, 2018, 1:27 p.m. UTC | #13

On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote:
>

> On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:

> > This patchset doesn't touch cpu_capacity_orig and doesn't need to  as

> > it assume that the max capacity is unchanged but some capacity is

> > momentary stolen by thermal.

> >  If you want to reflect immediately all thermal capping change, you

> > have to update this field and all related fields and struct around

>

> I don't follow you here. I never said I wanted to change

> cpu_capacity_orig. I don't think we should do that actually. Changing

> capacity_of (which is updated during LB IIRC) is just fine. The question

> is about what you want to do there: reflect an averaged value or the

> instantaneous one.


Sorry I though your were speaking about updating this cpu_capacity_orig.
With using instantaneous max  value in capacity_of(), we are back to
the problem raised by Thara that  the value will most probably not
reflect the current capping value when it is used in LB, because LB
period can quite long on busy CPU (default max value is 32*sd_weight
ms)

>

> It's not obvious (to me) that the complex one (the averaged value) is

> better than the other, simpler, one. All I'm saying from the beginning

> is that it'd be nice to have numbers and use cases to discuss the pros

> and cons of both approaches.

>

> > > > > Hmm, let me have a closer look at the patches, I must have missed

> > > > > something ...

> > > > >

> > > > > > The pace of changing the capping is to fast to reflect that in the

> > > > > > whole scheduler topology

> > > > >


[snip]

> > >

> > > Well, that wasn't the problem with rt tasks. The problem with RT tasks

> > > was that the time they spend on the CPU wasn't accounted _at all_ when

> > > selecting frequency for CFS, not that the accounting was at a different

> > > pace ...

> >

> > The problem was the same with RT, the cfs utilization was lower than

> > reality because RT steals soem cycle to CFS

> > So schedutil was selecting a lower frequency when cfs was running

> > whereas the CPU was fully used.

> > The same can happen with thermal:

> > cap the max freq because of thermal

> > the utilization with decrease.

> > remove the cap

> > the utilization is still low and you will select a low OPP because you

> > don't take into account cycle stolen by thermal like with RT

>

> I'm not arguing with the fact that we need to reflect the thermal

> pressure in the scheduler to see that a CPU is fully busy. I agree with

> that.

>

> I'm saying you don't necessarily have to update the thermal stuff and

> the existing PELT signals *at the same pace*, because different

> platforms have different thermal characteristics.


But you also need to take into account how fast other metrics in the
scheduler are updated otherwise a metric will reflect a change not
already reflected in the other metrics and you might take wrong
decision as my example above where utilization is still low but
thermal pressure is nul and you assume that you have lot of spare
capacity
Having metrics that use same responsiveness and are synced, help to
get a consolidated view of the system.

Vincent
>

> Thanks,*

> Quentin

Juri Lelli Oct. 10, 2018, 1:34 p.m. UTC | #14

On 10/10/18 15:08, Vincent Guittot wrote:
> On Wed, 10 Oct 2018 at 14:50, Juri Lelli <juri.lelli@gmail.com> wrote:

> >

> > On 10/10/18 14:34, Vincent Guittot wrote:

> > > Hi  Juri,

> > >

> > > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote:

> > > >

> > > > On 10/10/18 14:04, Vincent Guittot wrote:

> > > >

> > > > [...]

> > > >

> > > > > The problem was the same with RT, the cfs utilization was lower than

> > > > > reality because RT steals soem cycle to CFS

> > > > > So schedutil was selecting a lower frequency when cfs was running

> > > > > whereas the CPU was fully used.

> > > > > The same can happen with thermal:

> > > > > cap the max freq because of thermal

> > > > > the utilization with decrease.

> > > > > remove the cap

> > > > > the utilization is still low and you will select a low OPP because you

> > > > > don't take into account cycle stolen by thermal like with RT

> > > >

> > > > What if we scale frequency component considering the capped temporary

> > > > max?

> > >

> > > Do you mean using a kind of scale_thermal_capacity in accumulate_sum

> > > when computing utilization ?

> >

> > Yeah, something like that I guess. So that we account for temporary

> > "fake" 1024..

> 

> But the utilization will not be invariant anymore across the system


Mmm, I guess I might be wrong, but I was thinking we should be able to
deal with this similarly to what we do with cpus with different max
capacities. So, another factor?  Because then, how do we handle other
ways in which max freq can be restricted (e.g. from userspace as Javi
was also mentioning)?

Quentin Perret Oct. 10, 2018, 1:47 p.m. UTC | #15

On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote:
> On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote:

> >

> > On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:

> > > This patchset doesn't touch cpu_capacity_orig and doesn't need to  as

> > > it assume that the max capacity is unchanged but some capacity is

> > > momentary stolen by thermal.

> > >  If you want to reflect immediately all thermal capping change, you

> > > have to update this field and all related fields and struct around

> >

> > I don't follow you here. I never said I wanted to change

> > cpu_capacity_orig. I don't think we should do that actually. Changing

> > capacity_of (which is updated during LB IIRC) is just fine. The question

> > is about what you want to do there: reflect an averaged value or the

> > instantaneous one.

> 

> Sorry I though your were speaking about updating this cpu_capacity_orig.

N/p, communication via email can easily become confusing :-)

> With using instantaneous max  value in capacity_of(), we are back to

> the problem raised by Thara that  the value will most probably not

> reflect the current capping value when it is used in LB, because LB

> period can quite long on busy CPU (default max value is 32*sd_weight

> ms)

But averaging the capping value over time doesn't make LB happen more
often ... That will help you account for capping that happened in the
past, but it's not obvious this is actually a good thing. Probably not
all the time anyway.

Say a CPU was capped at 50% of it's capacity, then the cap is removed.
At that point it'll take 100ms+ for the thermal signal to decay and let
the scheduler know about the newly available capacity. That can probably
be a performance hit in some use cases ... And the other way around, it
can also take forever for the scheduler to notice that a CPU has a
reduced capacity before reacting to it.

If you want to filter out very short transient capping events to avoid
over-reacting in the scheduler (is this actually happening ?), then
maybe the average should be done on the cooling device side or something
like that ?

Thanks,
Quentin

Thara Gopinath Oct. 10, 2018, 2:15 p.m. UTC | #16

Hello Javi,

Thanks for the interest.

On 10/10/2018 01:44 AM, Javi Merino wrote:
> On Tue, Oct 09, 2018 at 12:24:55PM -0400, Thara Gopinath wrote:

>> Thermal governors can respond to an overheat event for a cpu by

>> capping the cpu's maximum possible frequency. This in turn

>> means that the maximum available compute capacity of the

>> cpu is restricted. But today in linux kernel, in event of maximum

>> frequency capping of a cpu, the maximum available compute

>> capacity of the cpu is not adjusted at all. In other words, scheduler

>> is unware maximum cpu capacity restrictions placed due to thermal

>> activity.

> 

> Interesting, I would have sworn that I tested this years ago by

> lowering the maximum frequency of a cpufreq domain, and the scheduler

> reacted accordingly to the new maximum capacities of the cpus.

> 

>>           This patch series attempts to address this issue.

>> The benefits identified are better task placement among available

>> cpus in event of overheating which in turn leads to better

>> performance numbers.

>>

>> The delta between the maximum possible capacity of a cpu and

>> maximum available capacity of a cpu due to thermal event can

>> be considered as thermal pressure. Instantaneous thermal pressure

>> is hard to record and can sometime be erroneous as there can be mismatch

>> between the actual capping of capacity and scheduler recording it.

>> Thus solution is to have a weighted average per cpu value for thermal

>> pressure over time. The weight reflects the amount of time the cpu has

>> spent at a capped maximum frequency. To accumulate, average and

>> appropriately decay thermal pressure, this patch series uses pelt

>> signals and reuses the available framework that does a similar

>> bookkeeping of rt/dl task utilization.

>>

>> Regarding testing, basic build, boot and sanity testing have been

>> performed on hikey960 mainline kernel with debian file system.

>> Further aobench (An occlusion renderer for benchmarking realworld

>> floating point performance) showed the following results on hikey960

>> with debain.

>>

>>                                         Result          Standard        Standard

>>                                         (Time secs)     Error           Deviation

>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

>>

>> Thara Gopinath (7):

>>   sched/pelt: Add option to make load and util calculations frequency

>>     invariant

>>   sched/pelt.c: Add support to track thermal pressure

>>   sched: Add infrastructure to store and update instantaneous thermal

>>     pressure

>>   sched: Initialize per cpu thermal pressure structure

>>   sched/fair: Enable CFS periodic tick to update thermal pressure

>>   sched/fair: update cpu_capcity to reflect thermal pressure

>>   thermal/cpu-cooling: Update thermal pressure in case of a maximum

>>     frequency capping

>>

>>  drivers/base/arch_topology.c  |  1 +

>>  drivers/thermal/cpu_cooling.c | 20 ++++++++++++-

> 

> thermal?  There are other ways in which the maximum frequency of a cpu

> can be limited, for example from userspace via scaling_max_freq.


Yes there are other ways in which maximum frequency of a cpu can be
capped. The difference probably is that in case of a user-space cap, the
time duration the cpu remains in the capped state is significantly
higher than capping due to a thermal event. So may be the response of
the scheduler should be different in that scenario (like rebuilding the
capacities of all cpus etc).

This patch series does not rebuild the scheduler structures. This just
tells the scheduler that some cpu capacity is stolen.
> 

> When something (anything) changes the maximum frequency of a cpufreq

> policy, the scheduler should be notified.  I think this change should

> be done in cpufreq instead to make it generic and not particular to

> a given maximum frequency "capper".


I am glad to hear you say this!  So far, all I have heard whenever I
bring up this topic is issues with such an update from cpufreq(delays,
lost updates etc). Personally, I have not seen these issues enough to
comment on them. I will really like to hear more about these issues for
an update from cpufreq here on the list.

For me, the patch series is a mechanism to let scheduler know that a
thermal event has stolen some cpu capacity. The update itself can happen
from any framework which can track the event and we all mutually agree on.

Regards
Thara

> 

> Cheers,

> Javi

> 



-- 
Regards
Thara

Vincent Guittot Oct. 10, 2018, 3:19 p.m. UTC | #17

On Wed, 10 Oct 2018 at 15:48, Quentin Perret <quentin.perret@arm.com> wrote:
>

> On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote:

> > On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote:

> > >

> > > On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:

> > > > This patchset doesn't touch cpu_capacity_orig and doesn't need to  as

> > > > it assume that the max capacity is unchanged but some capacity is

> > > > momentary stolen by thermal.

> > > >  If you want to reflect immediately all thermal capping change, you

> > > > have to update this field and all related fields and struct around

> > >

> > > I don't follow you here. I never said I wanted to change

> > > cpu_capacity_orig. I don't think we should do that actually. Changing

> > > capacity_of (which is updated during LB IIRC) is just fine. The question

> > > is about what you want to do there: reflect an averaged value or the

> > > instantaneous one.

> >

> > Sorry I though your were speaking about updating this cpu_capacity_orig.

>

> N/p, communication via email can easily become confusing :-)

>

> > With using instantaneous max  value in capacity_of(), we are back to

> > the problem raised by Thara that  the value will most probably not

> > reflect the current capping value when it is used in LB, because LB

> > period can quite long on busy CPU (default max value is 32*sd_weight

> > ms)

>

> But averaging the capping value over time doesn't make LB happen more

> often ... That will help you account for capping that happened in the


But you know what happens in average between 2 LB

> past, but it's not obvious this is actually a good thing. Probably not

> all the time anyway.

>

> Say a CPU was capped at 50% of it's capacity, then the cap is removed.

> At that point it'll take 100ms+ for the thermal signal to decay and let

> the scheduler know about the newly available capacity. That can probably


But the point is that you don't know:
- if the capping will not happen soon. If the pressure has reached the
50%,  it means that it already happened quite often in the past 100ms.
- if there is really available capacity as the current sum of
utilization reflects what was available for tasks and not what the
tasks really wants to use


> be a performance hit in some use cases ... And the other way around, it

> can also take forever for the scheduler to notice that a CPU has a


What do you mean by forever ?

> reduced capacity before reacting to it.

>

> If you want to filter out very short transient capping events to avoid

> over-reacting in the scheduler (is this actually happening ?), then

> maybe the average should be done on the cooling device side or something

> like that ?

>

> Thanks,

> Quentin

Lukasz Luba Oct. 10, 2018, 3:35 p.m. UTC | #18

Hi Thara,

I have run it on Exynos5433 mainline.
When it is enabled with step_wise thermal governor,
some of my tests are showing ~30-50% regression (i.e. hackbench),
dhrystone ~10%.

Could you tell me which thermal governor was used in your case?
Please also share the name of that benchmark, i will give it a try.
Is it single threaded compute-intensive?

Regards,
Lukasz

On 10/09/2018 06:24 PM, Thara Gopinath wrote:
> Thermal governors can respond to an overheat event for a cpu by

> capping the cpu's maximum possible frequency. This in turn

> means that the maximum available compute capacity of the

> cpu is restricted. But today in linux kernel, in event of maximum

> frequency capping of a cpu, the maximum available compute

> capacity of the cpu is not adjusted at all. In other words, scheduler

> is unware maximum cpu capacity restrictions placed due to thermal

> activity. This patch series attempts to address this issue.

> The benefits identified are better task placement among available

> cpus in event of overheating which in turn leads to better

> performance numbers.

> 

> The delta between the maximum possible capacity of a cpu and

> maximum available capacity of a cpu due to thermal event can

> be considered as thermal pressure. Instantaneous thermal pressure

> is hard to record and can sometime be erroneous as there can be mismatch

> between the actual capping of capacity and scheduler recording it.

> Thus solution is to have a weighted average per cpu value for thermal

> pressure over time. The weight reflects the amount of time the cpu has

> spent at a capped maximum frequency. To accumulate, average and

> appropriately decay thermal pressure, this patch series uses pelt

> signals and reuses the available framework that does a similar

> bookkeeping of rt/dl task utilization.

> 

> Regarding testing, basic build, boot and sanity testing have been

> performed on hikey960 mainline kernel with debian file system.

> Further aobench (An occlusion renderer for benchmarking realworld

> floating point performance) showed the following results on hikey960

> with debain.

> 

>                                          Result          Standard        Standard

>                                          (Time secs)     Error           Deviation

> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

> 

> Thara Gopinath (7):

>    sched/pelt: Add option to make load and util calculations frequency

>      invariant

>    sched/pelt.c: Add support to track thermal pressure

>    sched: Add infrastructure to store and update instantaneous thermal

>      pressure

>    sched: Initialize per cpu thermal pressure structure

>    sched/fair: Enable CFS periodic tick to update thermal pressure

>    sched/fair: update cpu_capcity to reflect thermal pressure

>    thermal/cpu-cooling: Update thermal pressure in case of a maximum

>      frequency capping

> 

>   drivers/base/arch_topology.c  |  1 +

>   drivers/thermal/cpu_cooling.c | 20 ++++++++++++-

>   include/linux/sched.h         | 14 +++++++++

>   kernel/sched/Makefile         |  2 +-

>   kernel/sched/core.c           |  2 ++

>   kernel/sched/fair.c           |  4 +++

>   kernel/sched/pelt.c           | 40 ++++++++++++++++++--------

>   kernel/sched/pelt.h           |  7 +++++

>   kernel/sched/sched.h          |  1 +

>   kernel/sched/thermal.c        | 66 +++++++++++++++++++++++++++++++++++++++++++

>   kernel/sched/thermal.h        | 13 +++++++++

>   11 files changed, 157 insertions(+), 13 deletions(-)

>   create mode 100644 kernel/sched/thermal.c

>   create mode 100644 kernel/sched/thermal.h

>

Thara Gopinath Oct. 10, 2018, 3:43 p.m. UTC | #19

Hello Ingo,
Thank you for the review.

On 10/10/2018 02:17 AM, Ingo Molnar wrote:
> 

> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

> 

>> Thermal governors can respond to an overheat event for a cpu by

>> capping the cpu's maximum possible frequency. This in turn

>> means that the maximum available compute capacity of the

>> cpu is restricted. But today in linux kernel, in event of maximum

>> frequency capping of a cpu, the maximum available compute

>> capacity of the cpu is not adjusted at all. In other words, scheduler

>> is unware maximum cpu capacity restrictions placed due to thermal

>> activity. This patch series attempts to address this issue.

>> The benefits identif

ied are better task placement among available
>> cpus in event of overheating which in turn leads to better

>> performance numbers.

>>

>> The delta between the maximum possible capacity of a cpu and

>> maximum available capacity of a cpu due to thermal event can

>> be considered as thermal pressure. Instantaneous thermal pressure

>> is hard to record and can sometime be erroneous as there can be mismatch

>> between the actual capping of capacity and scheduler recording it.

>> Thus solution is to have a weighted average per cpu value for thermal

>> pressure over time. The weight reflects the amount of time the cpu has

>> spent at a capped maximum frequency. To accumulate, average and

>> appropriately decay thermal pressure, this patch series uses pelt

>> signals and reuses the available framework that does a similar

>> bookkeeping of rt/dl task utilization.

>>

>> Regarding testing, basic build, boot and sanity testing have been

>> performed on hikey960 mainline kernel with debian file system.

>> Further aobench (An occlusion renderer for benchmarking realworld

>> floating point performance) showed the following results on hikey960

>> with debain.

>>

>>                                         Result          Standard        Standard

>>                                         (Time secs)     Error           Deviation

>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

> 

> Wow, +13% speedup, impressive! We definitely want this outcome.

> 

> I'm wondering what happens if we do not track and decay the thermal load at all at the PELT 

> level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal 

> events we receive from the CPU.

The problem with instantaneous update is that sometimes thermal events
happen at a much faster pace than cpu_capacity is updated in
the scheduler. This means that at the moment when scheduler uses the
value, it might not be correct anymore.

Having said that, today Android common kernel has a solution which
instantaneously updates cpu_capacity in case of a thermal event.
To give a bit of background on the evolution of the solution I have
proposed, below is a time line of analysis I have done.

1.  I started this activity by analyzing the existing framework on
android common kernel. I ran android benchmark tests (Jankbench,
Vellamo, Geekbench) with and without the existing instantaneous update
mechanism. I found that there is no real performance difference to be
observed with an instantaneous updated of cpu_capacity at least in my
test scenarios.
2. Then I developed an algorithm to track, accumulate and decay the
capacity capping i.e an algorithm without using the pelt signals(this
was prior to the new pelt framework in mainline). With this android
benchmarks showed performance improvements. At this point I also ported
the solution to mainline kernel and ran the aobench analysis which again
showed a performance improvement.
3. Finally with the new pelt framework in place, I replaced my algorithm
with the one used for rt and dl utilization tracking which is the
current patch series. I have not been able to run tests with this on
Android yet.

All tests were performed on hikey960.
I have a Google spreadsheet, documenting results at various stages of
analysis. I am not sure how to share it with the group here.

> 

> You describe the averaging as:

> 

>> Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can 

>> be mismatch between the actual capping of capacity and scheduler recording it.

> 

> Not sure I follow the argument here: are there bogus thermal throttling events? If so then

> they are hopefully not frequent enough and should average out over time even if we follow

> it instantly.

No bogus events. It is more like sometimes capping happens at a much
faster rate than cpu_capacity is updated and the scheduler looses these
events.

> 

> I.e. what is 'can sometimes be erroneous', exactly?

> 

> Thanks,

> 

> 	Ingo

> 

-- 
Regards
Thara

Ionela Voinescu Oct. 10, 2018, 4:15 p.m. UTC | #20

Hi guys,

On 10/10/18 14:47, Quentin Perret wrote:
> On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote:

>> On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote:

>>>

>>> On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:

>>>> This patchset doesn't touch cpu_capacity_orig and doesn't need to  as

>>>> it assume that the max capacity is unchanged but some capacity is

>>>> momentary stolen by thermal.

>>>>  If you want to reflect immediately all thermal capping change, you

>>>> have to update this field and all related fields and struct around

>>>

>>> I don't follow you here. I never said I wanted to change

>>> cpu_capacity_orig. I don't think we should do that actually. Changing

>>> capacity_of (which is updated during LB IIRC) is just fine. The question

>>> is about what you want to do there: reflect an averaged value or the

>>> instantaneous one.

>>

>> Sorry I though your were speaking about updating this cpu_capacity_orig.

> 

> N/p, communication via email can easily become confusing :-)

> 

>> With using instantaneous max  value in capacity_of(), we are back to

>> the problem raised by Thara that  the value will most probably not

>> reflect the current capping value when it is used in LB, because LB

>> period can quite long on busy CPU (default max value is 32*sd_weight

>> ms)

> 

> But averaging the capping value over time doesn't make LB happen more

> often ... That will help you account for capping that happened in the

> past, but it's not obvious this is actually a good thing. Probably not

> all the time anyway.

> 

> Say a CPU was capped at 50% of it's capacity, then the cap is removed.

> At that point it'll take 100ms+ for the thermal signal to decay and let

> the scheduler know about the newly available capacity. That can probably

> be a performance hit in some use cases ... And the other way around, it

> can also take forever for the scheduler to notice that a CPU has a

> reduced capacity before reacting to it.

> 

> If you want to filter out very short transient capping events to avoid

> over-reacting in the scheduler (is this actually happening ?), then

> maybe the average should be done on the cooling device side or something

> like that ?

> 

I think there isn't just the issue of the *occasional* overreaction of a
thermal governor due to noise in the temperature sensors or some spike
in environmental temperature that determines a delayed reaction in the
scheduler due to when capacity is updated.

I'm seeing a bigger issue for *sustained* high temperatures that are not
treated effectively by governors. Depending on the platform, heat can
be dissipated over longer or shorter periods of time. This can determine
a seesaw effect on the maximum frequency: it determines the temperature
is over a threshold and it starts capping, but heat is not dissipated
quickly enough for that to reflect in the value of the temperature sensor,
so it continues to cap; when the temperature gets to normal, capping
is lifted, which in turn results access to higher OPPs and a return to
high temperatures, etc.

What will happen is that, *depending on platform* and the moment when
capacity is updated, you can see either a CPU with a capacity of 1024, or
let's say 800, or (on hikey960 :)) around 500, and back and forth
between them.

Because of these I tend to think that a regulated (averaged) value of
thermal pressure is better than an instantaneous one. Thermal mitigation
measures are there for the well-being and safety of a device, not for
optimizations so it can and should be allowed to overreact, or have a
delayed reaction. But ping-pong-ing tasks back and forth between CPUs
due to changes in CPU capacity is harmful for performance. What would be
awesome to achieve with this is (close to) optimal use of restricted
capacities of CPUs, and I tend to believe instantaneous and most probably
out of date capacity values would not lead to this.

But this is almost a gut feeling and of course it should be validated on
devices with different thermal characteristics. Given the high variation
between devices with regards to this I'd be reluctant to tie it to the
PELT half life.

Regards,
Ionela.

> Thanks,

> Quentin

>

Daniel Lezcano Oct. 10, 2018, 4:54 p.m. UTC | #21

On 10/10/2018 17:35, Lukasz Luba wrote:
> Hi Thara,

> 

> I have run it on Exynos5433 mainline.

> When it is enabled with step_wise thermal governor,

> some of my tests are showing ~30-50% regression (i.e. hackbench),

> dhrystone ~10%.

> 

> Could you tell me which thermal governor was used in your case?

> Please also share the name of that benchmark, i will give it a try.

> Is it single threaded compute-intensive?


aobench AFAICT

It would be interesting if you can share the thermal profile of your board.


> On 10/09/2018 06:24 PM, Thara Gopinath wrote:

>> Thermal governors can respond to an overheat event for a cpu by

>> capping the cpu's maximum possible frequency. This in turn

>> means that the maximum available compute capacity of the

>> cpu is restricted. But today in linux kernel, in event of maximum

>> frequency capping of a cpu, the maximum available compute

>> capacity of the cpu is not adjusted at all. In other words, scheduler

>> is unware maximum cpu capacity restrictions placed due to thermal

>> activity. This patch series attempts to address this issue.

>> The benefits identified are better task placement among available

>> cpus in event of overheating which in turn leads to better

>> performance numbers.

>>

>> The delta between the maximum possible capacity of a cpu and

>> maximum available capacity of a cpu due to thermal event can

>> be considered as thermal pressure. Instantaneous thermal pressure

>> is hard to record and can sometime be erroneous as there can be mismatch

>> between the actual capping of capacity and scheduler recording it.

>> Thus solution is to have a weighted average per cpu value for thermal

>> pressure over time. The weight reflects the amount of time the cpu has

>> spent at a capped maximum frequency. To accumulate, average and

>> appropriately decay thermal pressure, this patch series uses pelt

>> signals and reuses the available framework that does a similar

>> bookkeeping of rt/dl task utilization.

>>

>> Regarding testing, basic build, boot and sanity testing have been

>> performed on hikey960 mainline kernel with debian file system.

>> Further aobench (An occlusion renderer for benchmarking realworld

>> floating point performance) showed the following results on hikey960

>> with debain.

>>

>>                                          Result          Standard        Standard

>>                                          (Time secs)     Error           Deviation

>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

>>

>> Thara Gopinath (7):

>>    sched/pelt: Add option to make load and util calculations frequency

>>      invariant

>>    sched/pelt.c: Add support to track thermal pressure

>>    sched: Add infrastructure to store and update instantaneous thermal

>>      pressure

>>    sched: Initialize per cpu thermal pressure structure

>>    sched/fair: Enable CFS periodic tick to update thermal pressure

>>    sched/fair: update cpu_capcity to reflect thermal pressure

>>    thermal/cpu-cooling: Update thermal pressure in case of a maximum

>>      frequency capping

>>

>>   drivers/base/arch_topology.c  |  1 +

>>   drivers/thermal/cpu_cooling.c | 20 ++++++++++++-

>>   include/linux/sched.h         | 14 +++++++++

>>   kernel/sched/Makefile         |  2 +-

>>   kernel/sched/core.c           |  2 ++

>>   kernel/sched/fair.c           |  4 +++

>>   kernel/sched/pelt.c           | 40 ++++++++++++++++++--------

>>   kernel/sched/pelt.h           |  7 +++++

>>   kernel/sched/sched.h          |  1 +

>>   kernel/sched/thermal.c        | 66 +++++++++++++++++++++++++++++++++++++++++++

>>   kernel/sched/thermal.h        | 13 +++++++++

>>   11 files changed, 157 insertions(+), 13 deletions(-)

>>   create mode 100644 kernel/sched/thermal.c

>>   create mode 100644 kernel/sched/thermal.h

>>



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Thara Gopinath Oct. 10, 2018, 5:03 p.m. UTC | #22

Hello Quentin,
On 10/10/2018 05:55 AM, Quentin Perret wrote:
> Hi Vincent,

> 

> On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:

>> The problem with reflecting directly the capping is that it happens

>> far more often than the pace at which cpu_capacity_orig is updated in

>> the scheduler.

> 

> Hmm, how can you be so sure ? That most likely depends on the workload,

> the platform and the thermal governor. Some platforms heat up slowly,

> some quickly. The pace at which the thermal governor will change things

> should depend on that I assume.


Yes I agree. How often a thermal event occurs is indeed dependent on
workload, platform and governors. For e.g. hikey960  the same workload
displays different behavior with and without a fan. What we want is a
generic solution that will work across "spiky" events and more spread
out events. An instantaneous update of cpu_capacity most certainly does
not capture spiky events. Also it can lead to cpu_capacity swinging
wildly from the original value to a much lower value and back again.
Averaging of thermal capacity limitation (Across any pre-determined
duration i.e using Pelt signals or without) takes care of smoothening
the effect of thermal capping and making sure that capping events are
not entirely missed.  For me it is a much more elegant solution than an
instantaneous solution.
Having said that, like you mentioned below, when the thermal capping
goes away, it is not reflected as an instantaneous availability of spare
capacity. It is slowly increased. But it is not necessarily wrong
information. All it tells the scheduler is that during the last
"pre-determined" duration on an average "X" was the capacity available
for a CFS task on this cpu.

> 

>> This means that at the moment when scheduler uses the

>> value, it might not be correct anymore.

> 

> And OTOH, when you remove a cap for example, it will take time before

> the scheduler can see the newly available capacity if you need to wait

> for the signal to decay. So you are using a wrong information too in

> that scenario.

> 

>> Then, this value are also used

>> when building the sched_domain and setting max_cpu_capacity which

>> would also implies the rebuilt the sched_domain topology ...

> 

> Wait what ? I thought the thermal cap was reflected in capacity_of, not

> capacity_orig_of ... You need to rebuild the sched_domain in case of

> thermal pressure ?

> 

> Hmm, let me have a closer look at the patches, I must have missed

> something ...

> 

>> The pace of changing the capping is to fast to reflect that in the

>> whole scheduler topology

> 

> That's probably true in some cases, but it'd be cool to have numbers to

> back up that statement, I think.

> 

> Now, if you do need to rebuild the sched domain topology every time you

> update the thermal pressure, I think the PELT HL is _way_ too short for

> that ... You can't rebuild the whole thing every 32ms or so. Or am I

> misunderstanding something ?

> 

>>> Thara, have you tried to experiment with a simpler implementation as

>>> suggested by Ingo ?

>>>

>>> Also, assuming that we do want to average things, do we actually want to

>>> tie the thermal ramp-up time to the PELT half life ? That provides

>>> nice maths properties wrt the other signals, but it's not obvious to me

>>> that this thermal 'constant' should be the same on all platforms. Or

>>> maybe it should ?

>>

>> The main interest of using PELT signal is that thermal pressure will

>> evolve at the same pace as other signals used in the scheduler.

> 

> Right, I think this is a nice property too (assuming that we actually

> want to average things out).

> 

>> With

>> thermal pressure, we have the exact same problem as with RT tasks. The

>> thermal will cap the max frequency which will cap the utilization of

>> the tasks running on the CPU

> 

> Well, the nature of the signal is slightly different IMO. Yes it's

> capacity, but you're no actually measuring time spent on the CPU. All

> other PELT signals are based on time, this thermal thing isn't, so it is

> kinda different in a way. And I'm still wondering if it could be helpful

> to be able to have a different HL for that thermal signal. That would

> 'break' the nice maths properties we have, yes, but is it a problem or is

> it actually helpful to cope with the thermal characteristics of

> different platforms ?



> 

> Thanks,

> Quentin

> 



-- 
Regards
Thara

Lukasz Luba Oct. 11, 2018, 7:35 a.m. UTC | #23

Hi Daniel,

On 10/10/2018 06:54 PM, Daniel Lezcano wrote:
> On 10/10/2018 17:35, Lukasz Luba wrote:

>> Hi Thara,

>>

>> I have run it on Exynos5433 mainline.

>> When it is enabled with step_wise thermal governor,

>> some of my tests are showing ~30-50% regression (i.e. hackbench),

>> dhrystone ~10%.

>>

>> Could you tell me which thermal governor was used in your case?

>> Please also share the name of that benchmark, i will give it a try.

>> Is it single threaded compute-intensive?

> 

> aobench AFAICT

> 

> It would be interesting if you can share the thermal profile of your board.

> 

Thanks for the benchmark name.
It was tested on Samsung TM2 device with Exynos 5433 with debian.
Thermal stuff you can find in mainline:
arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi

Regards,
Lukasz

> 

>> On 10/09/2018 06:24 PM, Thara Gopinath wrote:

>>> Thermal governors can respond to an overheat event for a cpu by

>>> capping the cpu's maximum possible frequency. This in turn

>>> means that the maximum available compute capacity of the

>>> cpu is restricted. But today in linux kernel, in event of maximum

>>> frequency capping of a cpu, the maximum available compute

>>> capacity of the cpu is not adjusted at all. In other words, scheduler

>>> is unware maximum cpu capacity restrictions placed due to thermal

>>> activity. This patch series attempts to address this issue.

>>> The benefits identified are better task placement among available

>>> cpus in event of overheating which in turn leads to better

>>> performance numbers.

>>>

>>> The delta between the maximum possible capacity of a cpu and

>>> maximum available capacity of a cpu due to thermal event can

>>> be considered as thermal pressure. Instantaneous thermal pressure

>>> is hard to record and can sometime be erroneous as there can be mismatch

>>> between the actual capping of capacity and scheduler recording it.

>>> Thus solution is to have a weighted average per cpu value for thermal

>>> pressure over time. The weight reflects the amount of time the cpu has

>>> spent at a capped maximum frequency. To accumulate, average and

>>> appropriately decay thermal pressure, this patch series uses pelt

>>> signals and reuses the available framework that does a similar

>>> bookkeeping of rt/dl task utilization.

>>>

>>> Regarding testing, basic build, boot and sanity testing have been

>>> performed on hikey960 mainline kernel with debian file system.

>>> Further aobench (An occlusion renderer for benchmarking realworld

>>> floating point performance) showed the following results on hikey960

>>> with debain.

>>>

>>>                                           Result          Standard        Standard

>>>                                           (Time secs)     Error           Deviation

>>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

>>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

>>>

>>> Thara Gopinath (7):

>>>     sched/pelt: Add option to make load and util calculations frequency

>>>       invariant

>>>     sched/pelt.c: Add support to track thermal pressure

>>>     sched: Add infrastructure to store and update instantaneous thermal

>>>       pressure

>>>     sched: Initialize per cpu thermal pressure structure

>>>     sched/fair: Enable CFS periodic tick to update thermal pressure

>>>     sched/fair: update cpu_capcity to reflect thermal pressure

>>>     thermal/cpu-cooling: Update thermal pressure in case of a maximum

>>>       frequency capping

>>>

>>>    drivers/base/arch_topology.c  |  1 +

>>>    drivers/thermal/cpu_cooling.c | 20 ++++++++++++-

>>>    include/linux/sched.h         | 14 +++++++++

>>>    kernel/sched/Makefile         |  2 +-

>>>    kernel/sched/core.c           |  2 ++

>>>    kernel/sched/fair.c           |  4 +++

>>>    kernel/sched/pelt.c           | 40 ++++++++++++++++++--------

>>>    kernel/sched/pelt.h           |  7 +++++

>>>    kernel/sched/sched.h          |  1 +

>>>    kernel/sched/thermal.c        | 66 +++++++++++++++++++++++++++++++++++++++++++

>>>    kernel/sched/thermal.h        | 13 +++++++++

>>>    11 files changed, 157 insertions(+), 13 deletions(-)

>>>    create mode 100644 kernel/sched/thermal.c

>>>    create mode 100644 kernel/sched/thermal.h

>>>

> 

>

Lukasz Luba Oct. 11, 2018, 11:10 a.m. UTC | #24

On 10/10/2018 07:30 PM, Thara Gopinath wrote:
> Hello Lukasz,

> 

> On 10/10/2018 11:35 AM, Lukasz Luba wrote:

>> Hi Thara,

>>

>> I have run it on Exynos5433 mainline.

>> When it is enabled with step_wise thermal governor,

>> some of my tests are showing ~30-50% regression (i.e. hackbench),

>> dhrystone ~10%.

> 

> That is interesting. If I understand correctly, dhrystone spawns 16

> threads or so and floods the system. In "theory", such a test should not

> see any performance improvement and degradation. What is the thermal

> activity like in your system? I will try running one of these tests on

> hikey960.

I use this dhrystone implementation:
https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c
It does not span new threads/processes and I pinned it to a single cpu.

My thermal setup is probably different than yours.
You have (on hikey960) probably 1 sensor for whole SoC and one thermal
zone (if it is this mainline file:
arch/arm64/boot/dts/hisilicon/hi3660.dtsi).
This thermal zone has two cooling devices - two clusters with dvfs.
Your temperature signal read out from that sensor is probably much
smoother. When you have sensor inside cluster, the rising factor
can be even 20deg/s (for big cores).
In my case, there are 4 thermal zones, each cluster has it's private
sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',
which is recommended for IPA.
>>

>> Could you tell me which thermal governor was used in your case?

>> Please also share the name of that benchmark, i will give it a try.

>> Is it single threaded compute-intensive?

> 

> Step-wise governor.

> I use aobench which is part of phoronix-test-suite.

> 

> Regards

> Thara

> 

I have built this aobench and run it pinned to single big cpu:
time taskset -c 4 ./aobench
The results:
3min-5:30min [mainline]
5:15min-5:50min [+patchset]

The idea is definitely worth to investigate further.

Regards,
Lukasz

Lukasz Luba Oct. 12, 2018, 9:37 a.m. UTC | #25

On 10/11/2018 10:23 AM, Daniel Lezcano wrote:
> On 11/10/2018 09:35, Lukasz Luba wrote:

>> Hi Daniel,

>>

>> On 10/10/2018 06:54 PM, Daniel Lezcano wrote:

>>> On 10/10/2018 17:35, Lukasz Luba wrote:

>>>> Hi Thara,

>>>>

>>>> I have run it on Exynos5433 mainline.

>>>> When it is enabled with step_wise thermal governor,

>>>> some of my tests are showing ~30-50% regression (i.e. hackbench),

>>>> dhrystone ~10%.

>>>>

>>>> Could you tell me which thermal governor was used in your case?

>>>> Please also share the name of that benchmark, i will give it a try.

>>>> Is it single threaded compute-intensive?

>>>

>>> aobench AFAICT

>>>

>>> It would be interesting if you can share the thermal profile of your board.

>>>

>> Thanks for the benchmark name.

>> It was tested on Samsung TM2 device with Exynos 5433 with debian.

>> Thermal stuff you can find in mainline:

>> arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi

> 

> By thermal profile, I was meaning a figure with the temperature

> regulation (temperature idle, then workload, then temperature increase,

> mitigated temperature, end of workload, temperature decrease).

Currently, I cannot share these data.
> 

> The thermal description looks wrong in the DT. I suggest to experiment

> the series with the DT fixed.


Could you tell more what looks wrong or maybe send a draft/patch?
> eg. from hi6220.dtsi

> 

> 

>                  thermal-zones {

> 

>                          cls0: cls0 {

>                                  polling-delay = <1000>;

>                                  polling-delay-passive = <100>;

>                                  sustainable-power = <3326>;

> 

>                                  /* sensor ID */

>                                  thermal-sensors = <&tsensor 2>;

> 

>                                  trips {

>                                          threshold: trip-point@0 {

>                                                  temperature = <65000>;

>                                                  hysteresis = <0>;

>                                                  type = "passive";

>                                          };

> 

>                                          target: trip-point@1 {

>                                                  temperature = <75000>;

>                                                  hysteresis = <0>;

>                                                  type = "passive";

>                                          };

>                                  };

> 

>                                  cooling-maps {

>                                          map0 {

>                                                  trip = <&target>;

>                                                  cooling-device = <&cpu0

> THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;

>                                          };

>                                  };

>                          };

>                  };

> 

> Note the cooling devices are *passive*

> 

>

For me this DT looks like it is copied from ARM Juno board for IPA,
where AFAIR there were no interrupts for the temperature sensor
(due to some psci/scpi/firmware lack of support).
The cooling map is also short. I am not sure if it would work efficient
with step-wise, with IPA it will work.

The DT configuration for Exynos has been working OK for years.
Exynos5433 supports 8 trip points, Exynos5422 has 4.
So if you have more trip points you need to start 'polling'.
Therefore, for Exynos there is no need to run queued work
in thermal framework every 1s or 100ms just to
read the current temperature.
The exynos-tmu driver gets irq and calls thermal framework
when such trip point is crossed. Then there is 'monitoring
window' for the trip point...
Long story short: 'active' is used for that reason.

Regards,
Lukasz

Ingo Molnar Oct. 16, 2018, 7:33 a.m. UTC | #26

* Thara Gopinath <thara.gopinath@linaro.org> wrote:

> >> Regarding testing, basic build, boot and sanity testing have been

> >> performed on hikey960 mainline kernel with debian file system.

> >> Further aobench (An occlusion renderer for benchmarking realworld

> >> floating point performance) showed the following results on hikey960

> >> with debain.

> >>

> >>                                         Result          Standard        Standard

> >>                                         (Time secs)     Error           Deviation

> >> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

> >> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

> > 

> > Wow, +13% speedup, impressive! We definitely want this outcome.

> > 

> > I'm wondering what happens if we do not track and decay the thermal 

> > load at all at the PELT level, but instantaneously decrease/increase 

> > effective CPU capacity in reaction to thermal events we receive from 

> > the CPU.

> 

> The problem with instantaneous update is that sometimes thermal events 

> happen at a much faster pace than cpu_capacity is updated in the 

> scheduler. This means that at the moment when scheduler uses the 

> value, it might not be correct anymore.

Let me offer a different interpretation: if we average throttling events 
then we create a 'smooth' average of 'true CPU capacity' that doesn't 
fluctuate much. This allows more stable yet asymmetric task placement if 
the thermal characteristics of the different cores is different 
(asymmetric). This, compared to instantaneous updates, would reduce 
unnecessary task migrations between cores.

Is that accurate?

If the thermal characteristics of the cores is roughly symmetric and the 
measured CPU-intense load itself is symmetric as well, then I have 
trouble seeing why reacting to thermal events should make any difference 
at all.

Are there any inherent asymmetries in the thermal properties of the 
cores, or in the benchmarked workload itself?

Thanks,

	Ingo

Vincent Guittot Oct. 16, 2018, 5:11 p.m. UTC | #27

Hi Lukasz,

On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <l.luba@partner.samsung.com> wrote:
>

>

>

> On 10/10/2018 07:30 PM, Thara Gopinath wrote:

> > Hello Lukasz,

> >

> > On 10/10/2018 11:35 AM, Lukasz Luba wrote:

> >> Hi Thara,

> >>

> >> I have run it on Exynos5433 mainline.

> >> When it is enabled with step_wise thermal governor,

> >> some of my tests are showing ~30-50% regression (i.e. hackbench),

> >> dhrystone ~10%.

> >

> > That is interesting. If I understand correctly, dhrystone spawns 16

> > threads or so and floods the system. In "theory", such a test should not

> > see any performance improvement and degradation. What is the thermal

> > activity like in your system? I will try running one of these tests on

> > hikey960.

> I use this dhrystone implementation:

> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c

> It does not span new threads/processes and I pinned it to a single cpu.

>

> My thermal setup is probably different than yours.

> You have (on hikey960) probably 1 sensor for whole SoC and one thermal

> zone (if it is this mainline file:

> arch/arm64/boot/dts/hisilicon/hi3660.dtsi).

> This thermal zone has two cooling devices - two clusters with dvfs.

> Your temperature signal read out from that sensor is probably much

> smoother. When you have sensor inside cluster, the rising factor

> can be even 20deg/s (for big cores).

> In my case, there are 4 thermal zones, each cluster has it's private

> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',

> which is recommended for IPA.

> >>

> >> Could you tell me which thermal governor was used in your case?

> >> Please also share the name of that benchmark, i will give it a try.

> >> Is it single threaded compute-intensive?

> >

> > Step-wise governor.

> > I use aobench which is part of phoronix-test-suite.

> >

> > Regards

> > Thara

> >

> I have built this aobench and run it pinned to single big cpu:

> time taskset -c 4 ./aobench


Why have you pinned the test only on CPU4 ?
Goal of thermal pressure is to inform the scheduler of reduced compute
capacity and help the scheduler to take better decision in tasks
placement.
So I would not expect perf impact on your test as the bench will stay
pinned on the cpu4
That being said you obviously have perf impact as shown below in your results
And other tasks on the system are not pinned and might come and
disturb your bench

> The results:

> 3min-5:30min [mainline]

> 5:15min-5:50min [+patchset]

>

> The idea is definitely worth to investigate further.


Yes I agree

Vincent
>

> Regards,

> Lukasz

>

>

>

Thara Gopinath Oct. 17, 2018, 4:21 p.m. UTC | #28

On 10/16/2018 03:33 AM, Ingo Molnar wrote:
> 

> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

> 

>>>> Regarding testing, basic build, boot and sanity testing have been

>>>> performed on hikey960 mainline kernel with debian file system.

>>>> Further aobench (An occlusion renderer for benchmarking realworld

>>>> floating point performance) showed the following results on hikey960

>>>> with debain.

>>>>

>>>>                                         Result          Standard        Standard

>>>>                                         (Time secs)     Error           Deviation

>>>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

>>>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

>>>

>>> Wow, +13% speedup, impressive! We definitely want this outcome.

>>>

>>> I'm wondering what happens if we do not track and decay the thermal 

>>> load at all at the PELT level, but instantaneously decrease/increase 

>>> effective CPU capacity in reaction to thermal events we receive from 

>>> the CPU.

>>

>> The problem with instantaneous update is that sometimes thermal events 

>> happen at a much faster pace than cpu_capacity is updated in the 

>> scheduler. This means that at the moment when scheduler uses the 

>> value, it might not be correct anymore.

> 

> Let me offer a different interpretation: if we average throttling events 

> then we create a 'smooth' average of 'true CPU capacity' that doesn't 

> fluctuate much. This allows more stable yet asymmetric task placement if 

> the thermal characteristics of the different cores is different 

> (asymmetric). This, compared to instantaneous updates, would reduce 

> unnecessary task migrations between cores.

> 

> Is that accurate?


Yes. I think it is accurate. I will also add that if we don't average
throttling events, we will miss the events that occur in between load
balancing(LB) period.

> 

> If the thermal characteristics of the cores is roughly symmetric and the 

> measured CPU-intense load itself is symmetric as well, then I have 

> trouble seeing why reacting to thermal events should make any difference 

> at all.

In this scenario, i agree that scheduler reaction to thermal events
should not make any difference in fact we should not observe any
improvement or degradation in performance.

> 

> Are there any inherent asymmetries in the thermal properties of the 

> cores, or in the benchmarked workload itself?


The benchmarked workload , meaning aobench? I don't think there arre any
asymmetries there. On Hikey960, there are two clusters with different
frequency domains. So yes I will say there is asymmetry there. Asides,
IMHO, any other tasks running on the system can create an inherent
asymmetry as cpu utilizations can vary.

Regards
Thara
> 

> Thanks,

> 

> 	Ingo

> 



-- 
Regards
Thara

Thara Gopinath Oct. 17, 2018, 4:24 p.m. UTC | #29

On 10/16/2018 01:11 PM, Vincent Guittot wrote:
> Hi Lukasz,

> 

> On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <l.luba@partner.samsung.com> wrote:

>>

>>

>>

>> On 10/10/2018 07:30 PM, Thara Gopinath wrote:

>>> Hello Lukasz,

>>>

>>> On 10/10/2018 11:35 AM, Lukasz Luba wrote:

>>>> Hi Thara,

>>>>

>>>> I have run it on Exynos5433 mainline.

>>>> When it is enabled with step_wise thermal governor,

>>>> some of my tests are showing ~30-50% regression (i.e. hackbench),

>>>> dhrystone ~10%.

>>>

>>> That is interesting. If I understand correctly, dhrystone spawns 16

>>> threads or so and floods the system. In "theory", such a test should not

>>> see any performance improvement and degradation. What is the thermal

>>> activity like in your system? I will try running one of these tests on

>>> hikey960.

>> I use this dhrystone implementation:

>> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c

>> It does not span new threads/processes and I pinned it to a single cpu.

>>

>> My thermal setup is probably different than yours.

>> You have (on hikey960) probably 1 sensor for whole SoC and one thermal

>> zone (if it is this mainline file:

>> arch/arm64/boot/dts/hisilicon/hi3660.dtsi).

>> This thermal zone has two cooling devices - two clusters with dvfs.

>> Your temperature signal read out from that sensor is probably much

>> smoother. When you have sensor inside cluster, the rising factor

>> can be even 20deg/s (for big cores).

>> In my case, there are 4 thermal zones, each cluster has it's private

>> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',

>> which is recommended for IPA.

>>>>

>>>> Could you tell me which thermal governor was used in your case?

>>>> Please also share the name of that benchmark, i will give it a try.

>>>> Is it single threaded compute-intensive?

>>>

>>> Step-wise governor.

>>> I use aobench which is part of phoronix-test-suite.

>>>

>>> Regards

>>> Thara

>>>

>> I have built this aobench and run it pinned to single big cpu:

>> time taskset -c 4 ./aobench

> 

> Why have you pinned the test only on CPU4 ?

> Goal of thermal pressure is to inform the scheduler of reduced compute

> capacity and help the scheduler to take better decision in tasks

> placement.


Hi Lukasz,

I agree with Vincent's observation. I had not seen this earlier. Pinning
a task to a cpu will obviously prevent migration. The performance
regression could be due to as Vincent mentioned below other tasks in the
system. On another note, which cpufreq governor are you using? Is the
core being bumped up to highest possible OPP during the test?

Regards
Thara
> So I would not expect perf impact on your test as the bench will stay

> pinned on the cpu4

> That being said you obviously have perf impact as shown below in your results

> And other tasks on the system are not pinned and might come and

> disturb your bench

> 

>> The results:

>> 3min-5:30min [mainline]

>> 5:15min-5:50min [+patchset]

>>

>> The idea is definitely worth to investigate further.

> 

> Yes I agree

> 

> Vincent

>>

>> Regards,

>> Lukasz

>>

>>

>>



-- 
Regards
Thara

Rafael J. Wysocki Oct. 18, 2018, 7:08 a.m. UTC | #30

On Thu, Oct 18, 2018 at 8:48 AM Ingo Molnar <mingo@kernel.org> wrote:
>

>

> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

>

> > On 10/16/2018 03:33 AM, Ingo Molnar wrote:

> > >

> > > * Thara Gopinath <thara.gopinath@linaro.org> wrote:

> > >

> > >>>> Regarding testing, basic build, boot and sanity testing have been

> > >>>> performed on hikey960 mainline kernel with debian file system.

> > >>>> Further aobench (An occlusion renderer for benchmarking realworld

> > >>>> floating point performance) showed the following results on hikey960

> > >>>> with debain.

> > >>>>

> > >>>>                                         Result          Standard        Standard

> > >>>>                                         (Time secs)     Error           Deviation

> > >>>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

> > >>>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

> > >>>

> > >>> Wow, +13% speedup, impressive! We definitely want this outcome.

> > >>>

> > >>> I'm wondering what happens if we do not track and decay the thermal

> > >>> load at all at the PELT level, but instantaneously decrease/increase

> > >>> effective CPU capacity in reaction to thermal events we receive from

> > >>> the CPU.

> > >>

> > >> The problem with instantaneous update is that sometimes thermal events

> > >> happen at a much faster pace than cpu_capacity is updated in the

> > >> scheduler. This means that at the moment when scheduler uses the

> > >> value, it might not be correct anymore.

> > >

> > > Let me offer a different interpretation: if we average throttling events

> > > then we create a 'smooth' average of 'true CPU capacity' that doesn't

> > > fluctuate much. This allows more stable yet asymmetric task placement if

> > > the thermal characteristics of the different cores is different

> > > (asymmetric). This, compared to instantaneous updates, would reduce

> > > unnecessary task migrations between cores.

> > >

> > > Is that accurate?

> >

> > Yes. I think it is accurate. I will also add that if we don't average

> > throttling events, we will miss the events that occur in between load

> > balancing(LB) period.

>

> Yeah, so I'd definitely suggest to not integrate this averaging into

> pelt.c in the fashion presented, because:

>

>  - This couples your thermal throttling averaging to the PELT decay

>    half-time AFAICS, which would break the other user every time the

>    decay is changed/tuned.

>

>  - The boolean flag that changes behavior in pelt.c is not particularly

>    clean either and complicates the code.

>

>  - Instead maybe factor out a decaying average library into

>    kernel/sched/avg.h perhaps (if this truly improves the code), and use

>    those methods both in pelt.c and any future thermal.c - and maybe

>    other places where we do decaying averages.

>

>  - But simple decaying averages are not that complex either, so I think

>    your original solution of open coding it is probably fine as well.

>

> Furthermore, any logic introduced by thermal.c and the resulting change

> to load-balancing behavior would have to be in perfect sync with cpufreq

> governor actions - one mechanism should not work against the other.


Right, that really is required.

> The only long term maintainable solution is to move all high level

> cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,

> which has been done to a fair degree already in the past ~2 years - but

> it's unclear to me to what extent this is true for thermal throttling

> policy currently: there might be more governor surgery and code

> reshuffling required?


It doesn't cover thermal management directly ATM.

The EAS work kind of hopes to make a connection in there by adding a
common energy model to underlie both the performance scaling and
thermal management, but it doesn't change the thermal decision making
part AFAICS.

So it is fair to say that additional governor surgery and code
reshuffling will be required IMO.

Thanks,
Rafael

Ingo Molnar Oct. 18, 2018, 7:50 a.m. UTC | #31

* Rafael J. Wysocki <rafael@kernel.org> wrote:

> > The only long term maintainable solution is to move all high level

> > cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,

> > which has been done to a fair degree already in the past ~2 years - but

> > it's unclear to me to what extent this is true for thermal throttling

> > policy currently: there might be more governor surgery and code

> > reshuffling required?

> 

> It doesn't cover thermal management directly ATM.

> 

> The EAS work kind of hopes to make a connection in there by adding a

> common energy model to underlie both the performance scaling and

> thermal management, but it doesn't change the thermal decision making

> part AFAICS.

> 

> So it is fair to say that additional governor surgery and code

> reshuffling will be required IMO.


BTW., when factoring out high level thermal management code it might make 
sense to increase the prominence of the cpufreq code within the scheduler 
and organize it a bit better, by introducing its own 
kernel/sched/cpufreq/ directory and renaming things the following way:

  kernel/sched/cpufreq.c		=> kernel/sched/cpufreq/core.c
  kernel/sched/cpufreq_schedutil.c	=> kernel/sched/cpufreq/metrics.c
  kernel/sched/thermal.c		=> kernel/sched/cpufreq/thermal.c

... or so?

With no change to functionality, this is just a re-organization and 
expansion/preparation for the bright future. =B-)

Thanks,

	Ingo

Lukasz Luba Oct. 18, 2018, 8 a.m. UTC | #32

On 10/17/2018 06:24 PM, Thara Gopinath wrote:
> On 10/16/2018 01:11 PM, Vincent Guittot wrote:

>> Hi Lukasz,

>>

>> On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <l.luba@partner.samsung.com> wrote:

>>>

>>>

>>>

>>> On 10/10/2018 07:30 PM, Thara Gopinath wrote:

>>>> Hello Lukasz,

>>>>

>>>> On 10/10/2018 11:35 AM, Lukasz Luba wrote:

>>>>> Hi Thara,

>>>>>

>>>>> I have run it on Exynos5433 mainline.

>>>>> When it is enabled with step_wise thermal governor,

>>>>> some of my tests are showing ~30-50% regression (i.e. hackbench),

>>>>> dhrystone ~10%.

>>>>

>>>> That is interesting. If I understand correctly, dhrystone spawns 16

>>>> threads or so and floods the system. In "theory", such a test should not

>>>> see any performance improvement and degradation. What is the thermal

>>>> activity like in your system? I will try running one of these tests on

>>>> hikey960.

>>> I use this dhrystone implementation:

>>> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c

>>> It does not span new threads/processes and I pinned it to a single cpu.

>>>

>>> My thermal setup is probably different than yours.

>>> You have (on hikey960) probably 1 sensor for whole SoC and one thermal

>>> zone (if it is this mainline file:

>>> arch/arm64/boot/dts/hisilicon/hi3660.dtsi).

>>> This thermal zone has two cooling devices - two clusters with dvfs.

>>> Your temperature signal read out from that sensor is probably much

>>> smoother. When you have sensor inside cluster, the rising factor

>>> can be even 20deg/s (for big cores).

>>> In my case, there are 4 thermal zones, each cluster has it's private

>>> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',

>>> which is recommended for IPA.

>>>>>

>>>>> Could you tell me which thermal governor was used in your case?

>>>>> Please also share the name of that benchmark, i will give it a try.

>>>>> Is it single threaded compute-intensive?

>>>>

>>>> Step-wise governor.

>>>> I use aobench which is part of phoronix-test-suite.

>>>>

>>>> Regards

>>>> Thara

>>>>

>>> I have built this aobench and run it pinned to single big cpu:

>>> time taskset -c 4 ./aobench

>>

>> Why have you pinned the test only on CPU4 ?

>> Goal of thermal pressure is to inform the scheduler of reduced compute

>> capacity and help the scheduler to take better decision in tasks

>> placement.

> 

> Hi Lukasz,

> 

> I agree with Vincent's observation. I had not seen this earlier. Pinning

> a task to a cpu will obviously prevent migration. The performance

> regression could be due to as Vincent mentioned below other tasks in the

> system. On another note, which cpufreq governor are you using? Is the

> core being bumped up to highest possible OPP during the test?

The governor is ondemand. No, it is not at highest OPP.
Could you send me the needed test configuration and condition?
We would then align in this area.

Regards,
Lukasz
> 

> Regards

> Thara

>> So I would not expect perf impact on your test as the bench will stay

>> pinned on the cpu4

>> That being said you obviously have perf impact as shown below in your results

>> And other tasks on the system are not pinned and might come and

>> disturb your bench

>>

>>> The results:

>>> 3min-5:30min [mainline]

>>> 5:15min-5:50min [+patchset]

>>>

>>> The idea is definitely worth to investigate further.

>>

>> Yes I agree

>>

>> Vincent

>>>

>>> Regards,

>>> Lukasz

>>>

>>>

>>>

> 

>

Rafael J. Wysocki Oct. 18, 2018, 8:14 a.m. UTC | #33

On Thu, Oct 18, 2018 at 9:50 AM Ingo Molnar <mingo@kernel.org> wrote:
>

>

> * Rafael J. Wysocki <rafael@kernel.org> wrote:

>

> > > The only long term maintainable solution is to move all high level

> > > cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,

> > > which has been done to a fair degree already in the past ~2 years - but

> > > it's unclear to me to what extent this is true for thermal throttling

> > > policy currently: there might be more governor surgery and code

> > > reshuffling required?

> >

> > It doesn't cover thermal management directly ATM.

> >

> > The EAS work kind of hopes to make a connection in there by adding a

> > common energy model to underlie both the performance scaling and

> > thermal management, but it doesn't change the thermal decision making

> > part AFAICS.

> >

> > So it is fair to say that additional governor surgery and code

> > reshuffling will be required IMO.

>

> BTW., when factoring out high level thermal management code it might make

> sense to increase the prominence of the cpufreq code within the scheduler

> and organize it a bit better, by introducing its own

> kernel/sched/cpufreq/ directory and renaming things the following way:

>

>   kernel/sched/cpufreq.c                => kernel/sched/cpufreq/core.c

>   kernel/sched/cpufreq_schedutil.c      => kernel/sched/cpufreq/metrics.c

>   kernel/sched/thermal.c                => kernel/sched/cpufreq/thermal.c

>

> ... or so?

>

> With no change to functionality, this is just a re-organization and

> expansion/preparation for the bright future. =B-)


No disagreement here. :-)

Cheers,
Rafael

Thara Gopinath Oct. 18, 2018, 4:17 p.m. UTC | #34

On 10/18/2018 02:48 AM, Ingo Molnar wrote:
> 

> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

> 

>> On 10/16/2018 03:33 AM, Ingo Molnar wrote:

>>>

>>> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

>>>

>>>>>> Regarding testing, basic build, boot and sanity testing have been

>>>>>> performed on hikey960 mainline kernel with debian file system.

>>>>>> Further aobench (An occlusion renderer for benchmarking realworld

>>>>>> floating point performance) showed the following results on hikey960

>>>>>> with debain.

>>>>>>

>>>>>>                                         Result          Standard        Standard

>>>>>>                                         (Time secs)     Error           Deviation

>>>>>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%

>>>>>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

>>>>>

>>>>> Wow, +13% speedup, impressive! We definitely want this outcome.

>>>>>

>>>>> I'm wondering what happens if we do not track and decay the thermal 

>>>>> load at all at the PELT level, but instantaneously decrease/increase 

>>>>> effective CPU capacity in reaction to thermal events we receive from 

>>>>> the CPU.

>>>>

>>>> The problem with instantaneous update is that sometimes thermal events 

>>>> happen at a much faster pace than cpu_capacity is updated in the 

>>>> scheduler. This means that at the moment when scheduler uses the 

>>>> value, it might not be correct anymore.

>>>

>>> Let me offer a different interpretation: if we average throttling events 

>>> then we create a 'smooth' average of 'true CPU capacity' that doesn't 

>>> fluctuate much. This allows more stable yet asymmetric task placement if 

>>> the thermal characteristics of the different cores is different 

>>> (asymmetric). This, compared to instantaneous updates, would reduce 

>>> unnecessary task migrations between cores.

>>>

>>> Is that accurate?

>>

>> Yes. I think it is accurate. I will also add that if we don't average

>> throttling events, we will miss the events that occur in between load

>> balancing(LB) period.

> 

> Yeah, so I'd definitely suggest to not integrate this averaging into 

> pelt.c in the fashion presented, because:

> 

>  - This couples your thermal throttling averaging to the PELT decay

>    half-time AFAICS, which would break the other user every time the

>    decay is changed/tuned.

Let me pose the question in this manner. Today rt utilization, dl
utilization etc is tracked via PELT. The inherent idea is that, a cpu
has some capacity stolen by let us say rt task and so subtract the
capacity utilized by the rt task from cpu when calculating the remaining
capacity for a CFS task. Now, the idea behind thermal pressure is that,
the maximum available capacity of a cpu is limited due to a thermal
event, so take it out of the remaining capacity of a cpu for a CFS task
(at-least to start with). If utilization for rt task, dl task etc is
calculated via PELT and the capacity constraint due to thermal event
calculated by another averaging algorithm, there can be some mismatch in
the "capacity stolen" calculations, right? Isnt't it better to track all
the events that can limit the capacity of a cpu via one algorithm ?

> 

>  - The boolean flag that changes behavior in pelt.c is not particularly

>    clean either and complicates the code.


I agree. Part of the idea behind this RFC patch set was to brainstorm if
there is a better approach to this.
> 

>  - Instead maybe factor out a decaying average library into

>    kernel/sched/avg.h perhaps (if this truly improves the code), and use

>    those methods both in pelt.c and any future thermal.c - and maybe

>    other places where we do decaying averages.

> 

>  - But simple decaying averages are not that complex either, so I think

>    your original solution of open coding it is probably fine as well. 

> 

> Furthermore, any logic introduced by thermal.c and the resulting change 

> to load-balancing behavior would have to be in perfect sync with cpufreq 

> governor actions - one mechanism should not work against the other.

I agree. I will go one step further and argue that the changes here make
best sense with sched util governor as it picks up the signals directly
from the scheduler to choose an OPP. Any other governor will be less
effective.


> 

> The only long term maintainable solution is to move all high level 

> cpufreq logic and policy handling code into kernel/sched/cpufreq*.c, 

> which has been done to a fair degree already in the past ~2 years - but 

> it's unclear to me to what extent this is true for thermal throttling 

> policy currently: there might be more governor surgery and code 

> reshuffling required?

> 

> The short term goal would be to at minimum have all the bugs lined up in 

> kernel/sched/* neatly, so that we have the chance to see and fix them in 

> a single place. ;-)


I see that Daniel has already sent some patches for this!
Regards
Thara
> 

> Thanks,

> 

> 	Ingo

> 



-- 
Regards
Thara

Ingo Molnar Oct. 19, 2018, 8:02 a.m. UTC | #35

* Thara Gopinath <thara.gopinath@linaro.org> wrote:

> > Yeah, so I'd definitely suggest to not integrate this averaging into 

> > pelt.c in the fashion presented, because:

> > 

> >  - This couples your thermal throttling averaging to the PELT decay

> >    half-time AFAICS, which would break the other user every time the

> >    decay is changed/tuned.

>

> Let me pose the question in this manner. Today rt utilization, dl 

> utilization etc is tracked via PELT. The inherent idea is that, a cpu 

> has some capacity stolen by let us say rt task and so subtract the 

> capacity utilized by the rt task from cpu when calculating the 

> remaining capacity for a CFS task. Now, the idea behind thermal 

> pressure is that, the maximum available capacity of a cpu is limited 

> due to a thermal event, so take it out of the remaining capacity of a 

> cpu for a CFS task (at-least to start with). If utilization for rt 

> task, dl task etc is calculated via PELT and the capacity constraint 

> due to thermal event calculated by another averaging algorithm, there 

> can be some mismatch in the "capacity stolen" calculations, right? 

> Isnt't it better to track all the events that can limit the capacity of 

> a cpu via one algorithm ?

So what unifies RT and DL utilization is that those are all direct task 
loads independent of external factors.

Thermal load is more of a complex physical property of the combination of 
various internal and external factors: the whole system workload running 
(not just that single task), the thermal topology of the hardware, 
external temperatures, the hardware's and the governor's policy regarding 
thermal loads, etc. etc.

So while obviously when effective capacity of a CPU is calculated then 
these will all be subtracted from the maximum capacity of the CPU, but I 
think the thermal load metric and the averaging itself is probably 
dissimilar enough to not be calculated via the PELT half-life for 
example.

For example a reasonable future property would be match the speed of 
decay in the averaging to the observed speed of decay via temperature 
sensors? Most temperature sensors do a certain amount of averaging 
themselves as well - and some platforms might not expose temperatures at 
all, only 'got thermally throttled' / 'running at full speed' kind of 
feedback.

Anyway, this doesn't really impact the concept, it's an implementational 
detail, and much of this could be resolved if the averaging code in 
pelt.c was librarized a bit - and that's really what you did there in a 
fashion, I just think it should probably be abstracted out more clearly. 
(I have no clear implementational suggestions right now, other than 'try 
and see how it works out - it might be a bad idea'.)

Thanks,

	Ingo

Valentin Schneider Oct. 19, 2018, 11:29 a.m. UTC | #36

Hi,

On 19/10/2018 09:02, Ingo Molnar wrote:
> 

> * Thara Gopinath <thara.gopinath@linaro.org> wrote:


[...]
 
> So what unifies RT and DL utilization is that those are all direct task 

> loads independent of external factors.

> 

> Thermal load is more of a complex physical property of the combination of 

> various internal and external factors: the whole system workload running 

> (not just that single task), the thermal topology of the hardware, 

> external temperatures, the hardware's and the governor's policy regarding 

> thermal loads, etc. etc.

> 

> So while obviously when effective capacity of a CPU is calculated then 

> these will all be subtracted from the maximum capacity of the CPU, but I 

> think the thermal load metric and the averaging itself is probably 

> dissimilar enough to not be calculated via the PELT half-life for 

> example.

> 

> For example a reasonable future property would be match the speed of 

> decay in the averaging to the observed speed of decay via temperature 

> sensors? Most temperature sensors do a certain amount of averaging 

> themselves as well - and some platforms might not expose temperatures at 

> all, only 'got thermally throttled' / 'running at full speed' kind of 

> feedback.


That would also open the door to having different decay speeds on
different domains, if we have the tsensors for it - big and LITTLE cores
are not going to heat up in the same way (although there's going to
be some heat propagation).

Another thermal decay speed hint I'd see would be the energy model - it
does tell us after all how much energy is going through those cores, and
with a rough estimate of how much they can take before overheating 
(sustainable-power entry in the devicetree) we might be able to deduce a
somewhat sane decay speed.

> 

> Anyway, this doesn't really impact the concept, it's an implementational 

> detail, and much of this could be resolved if the averaging code in 

> pelt.c was librarized a bit - and that's really what you did there in a 

> fashion, I just think it should probably be abstracted out more clearly. 

> (I have no clear implementational suggestions right now, other than 'try 

> and see how it works out - it might be a bad idea'.)

> 

> Thanks,

> 

> 	Ingo

> 

> 

>

Vincent Guittot Dec. 4, 2018, 3:43 p.m. UTC | #37

Hi Thara,

On Tue, 9 Oct 2018 at 18:25, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>

> Introduce support in CFS periodic tick to trigger the process of

> computing average thermal pressure for a cpu.

>

> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>

> ---

>  kernel/sched/fair.c | 3 +++

>  1 file changed, 3 insertions(+)

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index b39fb59..7deb1d0 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -21,6 +21,7 @@

>   *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra

>   */

>  #include "sched.h"

> +#include "thermal.h"

>

>  #include <trace/events/sched.h>

>

> @@ -9557,6 +9558,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

>

>         if (static_branch_unlikely(&sched_numa_balancing))

>                 task_tick_numa(rq, curr);

> +

> +       update_periodic_maxcap(rq);


You have to call update_periodic_maxcap() in update_blocked_averages() too
Otherwise, the thermal pressure will not always be updated correctly
for tickless system

>  }

>

>  /*

> --

> 2.1.4

>

[RFC,0/7] Introduce thermal pressure

Message

Comments