[V2,0/3] Introduce Thermal Pressure

Message ID 1555443521-579-1-git-send-email-thara.gopinath@linaro.org
Headers show
Series
  • Introduce Thermal Pressure
Related show

Message

Thara Gopinath April 16, 2019, 7:38 p.m.
Thermal governors can respond to an overheat event of a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in the kernel, task scheduler is 
not notified of capping of maximum frequency of a cpu.
In other words, scheduler is unware of maximum capacity
restrictions placed on a cpu due to thermal activity.
This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The reduction in the maximum possible capacity of a cpu due to a 
thermal event can be considered as thermal pressure. Instantaneous
thermal pressure is hard to record and can sometime be erroneous
as there can be mismatch between the actual capping of capacity
and scheduler recording it. Thus solution is to have a weighted
average per cpu value for thermal pressure over time.
The weight reflects the amount of time the cpu has spent at a
capped maximum frequency. Since thermal pressure is recorded as
an average, it must be decayed periodically. To this extent, this
patch series defines a configurable decay period.

Regarding testing, basic build, boot and sanity testing have been
performed on hikey960 mainline kernel with debian file system.
Further, aobench (An occlusion renderer for benchmarking realworld
floating point performance), dhrystone and hackbench test have been
run with the thermal pressure algorithm. During testing, due to
constraints of step wise governor in dealing with big little systems,
cpu cooling was disabled on little core, the idea being that
big core will heat up and cpu cooling device will throttle the
frequency of the big cores there by limiting the maximum available
capacity and the scheduler will spread out tasks to little cores as well.
Finally, this patch series has been boot tested on db410C running v5.1-rc4
kernel.

During the course of development various methods of capturing
and reflecting thermal pressure were implemented.

The first method to be evaluated was to convert the
capped max frequency into capacity and have the scheduler use the
instantaneous value when updating cpu_capacity.
This method is referenced as "Instantaneous Thermal Pressure" in the
test results below. 

The next two methods employs different methods of averaging the
thermal pressure before applying it when updating cpu_capacity.
The first of these methods re-used the PELT algorithm already present
in the kernel that does the averaging of rt and dl load and utilization.
This method is referenced as "Thermal Pressure Averaging using PELT fmwk"
in the test results below.

The final method employs an averaging algorithm that collects and
decays thermal pressure based on the decay period. In this method,
the decay period is configurable. This method is referenced as
"Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the
test results below.

The test results below shows 3-5% improvement in performance when
using the third solution compared to the default system today where
scheduler is unware of cpu capacity limitations due to thermal events.


			Hackbench: (1 group , 30000 loops, 10 runs)
				Result            Standard Deviation
				(Time Secs)        (% of mean)

No Thermal Pressure             10.21                   7.99%

Instantaneous thermal pressure  10.16                   5.36%

Thermal Pressure Averaging
using PELT fmwk                 9.88                    3.94%

Thermal Pressure Averaging
non-PELT Algo. Decay : 500 ms   9.94                    4.59%

Thermal Pressure Averaging
non-PELT Algo. Decay : 250 ms   7.52                    5.42%

Thermal Pressure Averaging
non-PELT Algo. Decay : 125 ms   9.87                    3.94%



			Aobench: Size 2048 *  2048
				Result            Standard Deviation
				(Time Secs)        (% of mean)

No Thermal Pressure             141.58          15.85%

Instantaneous thermal pressure  141.63          15.03%

Thermal Pressure Averaging
using PELT fmwk                 134.48          13.16%

Thermal Pressure Averaging
non-PELT Algo. Decay : 500 ms   133.62          13.00%

Thermal Pressure Averaging
non-PELT Algo. Decay : 250 ms   137.22          15.30%

Thermal Pressure Averaging
non-PELT Algo. Decay : 125 ms   137.55          13.26%

Dhrystone was run 10 times with each run spawning 20 threads of
500 MLOOPS.The idea here is to measure the Total dhrystone run
time and not look at individual processor performance.

			Dhrystone Run Time
				Result            Standard Deviation
				(Time Secs)        (% of mean)

No Thermal Pressure		1.14                    10.04%

Instantaneous thermal pressure  1.15                    9%

Thermal Pressure Averaging
using PELT fmwk                 1.19                    11.60%

Thermal Pressure Averaging
non-PELT Algo. Decay : 500 ms   1.09                    7.51%

Thermal Pressure Averaging
non-PELT Algo. Decay : 250 ms   1.012                   7.02%

Thermal Pressure Averaging
non-PELT Algo. Decay : 125 ms   1.12                    9.02%

V1->V2: Removed using Pelt framework for thermal pressure accumulation
	and averaging. Instead implemented a weighted average algorithm.

Thara Gopinath (3):
  Calculate Thermal Pressure
  sched/fair: update cpu_capcity to reflect thermal pressure
  thermal/cpu-cooling: Update thermal pressure in case of a maximum
    frequency capping

 drivers/thermal/cpu_cooling.c |   4 +
 include/linux/sched/thermal.h |  11 +++
 kernel/sched/Makefile         |   2 +-
 kernel/sched/fair.c           |   4 +
 kernel/sched/thermal.c        | 220 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 240 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/sched/thermal.h
 create mode 100644 kernel/sched/thermal.c
-- 
2.1.4

Comments

Ingo Molnar April 17, 2019, 5:36 a.m. | #1
* Thara Gopinath <thara.gopinath@linaro.org> wrote:

> The test results below shows 3-5% improvement in performance when

> using the third solution compared to the default system today where

> scheduler is unware of cpu capacity limitations due to thermal events.


The numbers look very promising!

I've rearranged the results to make the performance properties of the 
various approaches and parameters easier to see:

                                         (seconds, lower is better)

			                 Hackbench   Aobench   Dhrystone
                                         =========   =======   =========
Vanilla kernel (No Thermal Pressure)         10.21    141.58        1.14
Instantaneous thermal pressure               10.16    141.63        1.15
Thermal Pressure Averaging:
      - PELT fmwk                             9.88    134.48        1.19
      - non-PELT Algo. Decay : 500 ms         9.94    133.62        1.09
      - non-PELT Algo. Decay : 250 ms         7.52    137.22        1.012
      - non-PELT Algo. Decay : 125 ms         9.87    137.55        1.12


Firstly, a couple of questions about the numbers:

   1)

      Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012?
      You reported it as:

             non-PELT Algo. Decay : 250 ms   1.012                   7.02%

      But the formatting is significant 3 digits versus only two for all 
      the other results.

   2)

      You reported the hackbench numbers with "10 runs" - did the other 
      benchmarks use 10 runs as well? Maybe you used fewer runs for the 
      longest benchmark, Aobench?

Secondly, it appears the non-PELT decaying average is the best approach, 
but the results are a bit coarse around the ~250 msecs peak. Maybe it 
would be good to measure it in 50 msecs steps between 50 msecs and 1000 
msecs - but only if it can be scripted sanely:

A possible approach would be to add a debug sysctl for the tuning period, 
and script all these benchmark runs and the printing of the results. You 
could add another (debug) sysctl to turn the 'instant' logic on, and to 
restore vanilla kernel behavior as well - this makes it all much easier 
to script and measure with a single kernel image, without having to 
reboot the kernel. The sysctl overhead will not be measurable for 
workloads like this.

Then you can use "perf stat --null --table" to measure runtime and stddev 
easily and with a single tool, for example:

  dagon:~> perf stat --null --sync --repeat 10 --table ./hackbench 20 >benchmark.out

  Performance counter stats for './hackbench 20' (10 runs):

           # Table of individual measurements:
           0.15246 (-0.03960) ######
           0.20832 (+0.01627) ##
           0.17895 (-0.01310) ##
           0.19791 (+0.00585) #
           0.19209 (+0.00004) #
           0.19406 (+0.00201) #
           0.22484 (+0.03278) ###
           0.18695 (-0.00511) #
           0.19032 (-0.00174) #
           0.19464 (+0.00259) #

           # Final result:
           0.19205 +- 0.00592 seconds time elapsed  ( +-  3.08% )

Note how all the individual measurements can be captured this way, 
without seeing the benchmark output itself. So difference benchmarks can 
be measured this way, assuming they don't have too long setup time.

Thanks,

	Ingo
Ingo Molnar April 17, 2019, 5:55 a.m. | #2
* Ingo Molnar <mingo@kernel.org> wrote:

> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

> 

> > The test results below shows 3-5% improvement in performance when

> > using the third solution compared to the default system today where

> > scheduler is unware of cpu capacity limitations due to thermal events.

> 

> The numbers look very promising!

> 

> I've rearranged the results to make the performance properties of the 

> various approaches and parameters easier to see:

> 

>                                          (seconds, lower is better)

> 

> 			                 Hackbench   Aobench   Dhrystone

>                                          =========   =======   =========

> Vanilla kernel (No Thermal Pressure)         10.21    141.58        1.14

> Instantaneous thermal pressure               10.16    141.63        1.15

> Thermal Pressure Averaging:

>       - PELT fmwk                             9.88    134.48        1.19

>       - non-PELT Algo. Decay : 500 ms         9.94    133.62        1.09

>       - non-PELT Algo. Decay : 250 ms         7.52    137.22        1.012

>       - non-PELT Algo. Decay : 125 ms         9.87    137.55        1.12


So what I forgot to say is that IMO your results show robust improvements 
over the vanilla kernel of around 5%, with a relatively straightforward 
thermal pressure metric. So I suspect we could do something like this, if 
there was a bit more measurements done to get the best decay period 
established - the 125-250-500 msecs results seem a bit coarse and not 
entirely unambiguous.

In terms of stddev: the perf stat --pre hook could be used to add a dummy 
benchmark run, to heat up the test system, to get more reliable, less 
noisy numbers?

BTW., that big improvement in hackbench results to 7.52 at 250 msecs, is 
that real, or a fluke perhaps?

Thanks,

	Ingo
Thara Gopinath April 17, 2019, 5:18 p.m. | #3
On 04/17/2019 01:36 AM, Ingo Molnar wrote:
> 

> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

> 

>> The test results below shows 3-5% improvement in performance when

>> using the third solution compared to the default system today where

>> scheduler is unware of cpu capacity limitations due to thermal events.

> 

> The numbers look very promising!


Hello Ingo,
Thank you for the review.
> 

> I've rearranged the results to make the performance properties of the 

> various approaches and parameters easier to see:

> 

>                                          (seconds, lower is better)

> 

> 			                 Hackbench   Aobench   Dhrystone

>                                          =========   =======   =========

> Vanilla kernel (No Thermal Pressure)         10.21    141.58        1.14

> Instantaneous thermal pressure               10.16    141.63        1.15

> Thermal Pressure Averaging:

>       - PELT fmwk                             9.88    134.48        1.19

>       - non-PELT Algo. Decay : 500 ms         9.94    133.62        1.09

>       - non-PELT Algo. Decay : 250 ms         7.52    137.22        1.012

>       - non-PELT Algo. Decay : 125 ms         9.87    137.55        1.12

> 

> 

> Firstly, a couple of questions about the numbers:

> 

>    1)

> 

>       Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012?

>       You reported it as:

> 

>              non-PELT Algo. Decay : 250 ms   1.012                   7.02%


It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks
multiple time because of the anomalies noticed.  1.012 is a formatting
error on my part when I copy pasted the results into a google sheet I am
maintaining to capture the test results. Sorry about the confusion.
> 

>       But the formatting is significant 3 digits versus only two for all 

>       the other results.

> 

>    2)

> 

>       You reported the hackbench numbers with "10 runs" - did the other 

>       benchmarks use 10 runs as well? Maybe you used fewer runs for the 

>       longest benchmark, Aobench?

 Hackbench and dhrystone are 10 runs each. Aobench is part of phoronix
test suit and the test suite runs it six times and gives the per run
results, mean and stddev. On my part,  I ran aobench just once per
configuration.

> 

> Secondly, it appears the non-PELT decaying average is the best approach, 

> but the results are a bit coarse around the ~250 msecs peak. Maybe it 

> would be good to measure it in 50 msecs steps between 50 msecs and 1000 

> msecs - but only if it can be scripted sanely:


non-PELT looks better overall because the test results are quite
comparable (if not better) between the two solutions and it takes care
of concerns people raised when I posted V1 using PELT-fmwk algo
regarding reuse of utilization signal to track thermal pressure.

Regarding the decay period, I agree that more testing can be done. I
like your suggestions below and I am going to try implementing them
sometime next week. Once I have some solid results, I will send them out.

My concern regarding getting hung up too much on decay period is that I
think it could vary from SoC to SoC depending on the type and number of
cores and thermal characteristics. So I was thinking eventually the
decay period should be configurable via a config option or by any other
means. Testing on different systems will definitely help and maybe I am
wrong and there is no much variation between systems.

Regards
Thara

> 

> A possible approach would be to add a debug sysctl for the tuning period, 

> and script all these benchmark runs and the printing of the results. You 

> could add another (debug) sysctl to turn the 'instant' logic on, and to 

> restore vanilla kernel behavior as well - this makes it all much easier 

> to script and measure with a single kernel image, without having to 

> reboot the kernel. The sysctl overhead will not be measurable for 

> workloads like this.

> 

> Then you can use "perf stat --null --table" to measure runtime and stddev 

> easily and with a single tool, for example:

> 

>   dagon:~> perf stat --null --sync --repeat 10 --table ./hackbench 20 >benchmark.out

> 

>   Performance counter stats for './hackbench 20' (10 runs):

> 

>            # Table of individual measurements:

>            0.15246 (-0.03960) ######

>            0.20832 (+0.01627) ##

>            0.17895 (-0.01310) ##

>            0.19791 (+0.00585) #

>            0.19209 (+0.00004) #

>            0.19406 (+0.00201) #

>            0.22484 (+0.03278) ###

>            0.18695 (-0.00511) #

>            0.19032 (-0.00174) #

>            0.19464 (+0.00259) #

> 

>            # Final result:

>            0.19205 +- 0.00592 seconds time elapsed  ( +-  3.08% )

> 

> Note how all the individual measurements can be captured this way, 

> without seeing the benchmark output itself. So difference benchmarks can 

> be measured this way, assuming they don't have too long setup time.

> 

> Thanks,

> 

> 	Ingo

> 



-- 
Regards
Thara
Thara Gopinath April 17, 2019, 5:28 p.m. | #4
On 04/17/2019 01:55 AM, Ingo Molnar wrote:
> 

> * Ingo Molnar <mingo@kernel.org> wrote:

> 

>> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

>>

>>> The test results below shows 3-5% improvement in performance when

>>> using the third solution compared to the default system today where

>>> scheduler is unware of cpu capacity limitations due to thermal events.

>>

>> The numbers look very promising!

>>

>> I've rearranged the results to make the performance properties of the 

>> various approaches and parameters easier to see:

>>

>>                                          (seconds, lower is better)

>>

>> 			                 Hackbench   Aobench   Dhrystone

>>                                          =========   =======   =========

>> Vanilla kernel (No Thermal Pressure)         10.21    141.58        1.14

>> Instantaneous thermal pressure               10.16    141.63        1.15

>> Thermal Pressure Averaging:

>>       - PELT fmwk                             9.88    134.48        1.19

>>       - non-PELT Algo. Decay : 500 ms         9.94    133.62        1.09

>>       - non-PELT Algo. Decay : 250 ms         7.52    137.22        1.012

>>       - non-PELT Algo. Decay : 125 ms         9.87    137.55        1.12

> 

> So what I forgot to say is that IMO your results show robust improvements 

> over the vanilla kernel of around 5%, with a relatively straightforward 

> thermal pressure metric. So I suspect we could do something like this, if 

> there was a bit more measurements done to get the best decay period 

> established - the 125-250-500 msecs results seem a bit coarse and not 

> entirely unambiguous.


To give you the background, I started with decay period of 500 ms. No
other reason except the previous version of rt-pressure that existed in
the scheduler employed a 500 ms decay period. Then the idea was to
decrease the decay period by half and see what happens and so on. But I
agree, that it is a bit coarse. I will probably get around to
implementing some of your suggestions to capture more granular results
in the next few weeks.
> 

> In terms of stddev: the perf stat --pre hook could be used to add a dummy 

> benchmark run, to heat up the test system, to get more reliable, less 

> noisy numbers?

> 

> BTW., that big improvement in hackbench results to 7.52 at 250 msecs, is 

> that real, or a fluke perhaps?

For me, it is an anomaly. Having said that, I did rerun the tests with
this configuration at least twice(if not more) and the results were
similar. It is an anomaly because I have no explanation as to why there
is so much improvement at the 250 ms decay period.

> 

> Thanks,

> 

> 	Ingo

> 



-- 
Regards
Thara
Ingo Molnar April 17, 2019, 6:29 p.m. | #5
* Thara Gopinath <thara.gopinath@linaro.org> wrote:

> 

> On 04/17/2019 01:36 AM, Ingo Molnar wrote:

> > 

> > * Thara Gopinath <thara.gopinath@linaro.org> wrote:

> > 

> >> The test results below shows 3-5% improvement in performance when

> >> using the third solution compared to the default system today where

> >> scheduler is unware of cpu capacity limitations due to thermal events.

> > 

> > The numbers look very promising!

> 

> Hello Ingo,

> Thank you for the review.

> > 

> > I've rearranged the results to make the performance properties of the 

> > various approaches and parameters easier to see:

> > 

> >                                          (seconds, lower is better)

> > 

> > 			                 Hackbench   Aobench   Dhrystone

> >                                          =========   =======   =========

> > Vanilla kernel (No Thermal Pressure)         10.21    141.58        1.14

> > Instantaneous thermal pressure               10.16    141.63        1.15

> > Thermal Pressure Averaging:

> >       - PELT fmwk                             9.88    134.48        1.19

> >       - non-PELT Algo. Decay : 500 ms         9.94    133.62        1.09

> >       - non-PELT Algo. Decay : 250 ms         7.52    137.22        1.012

> >       - non-PELT Algo. Decay : 125 ms         9.87    137.55        1.12

> > 

> > 

> > Firstly, a couple of questions about the numbers:

> > 

> >    1)

> > 

> >       Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012?

> >       You reported it as:

> > 

> >              non-PELT Algo. Decay : 250 ms   1.012                   7.02%

> 

> It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks

> multiple time because of the anomalies noticed.  1.012 is a formatting

> error on my part when I copy pasted the results into a google sheet I am

> maintaining to capture the test results. Sorry about the confusion.


That's actually pretty good, because it suggests a 35% and 15% 
improvement over the vanilla kernel - which is very good for such 
CPU-bound workloads.

Not that 5% is bad in itself - but 15% is better ;-)

> Regarding the decay period, I agree that more testing can be done. I 

> like your suggestions below and I am going to try implementing them 

> sometime next week. Once I have some solid results, I will send them 

> out.


Thanks!

> My concern regarding getting hung up too much on decay period is that I 

> think it could vary from SoC to SoC depending on the type and number of 

> cores and thermal characteristics. So I was thinking eventually the 

> decay period should be configurable via a config option or by any other 

> means. Testing on different systems will definitely help and maybe I am 

> wrong and there is no much variation between systems.


Absolutely, so I'd not be against keeping it a SCHED_DEBUG tunable or so, 
until there's a better understanding of how the physical properties of 
the SoC map to an ideal decay period.

Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load 
tracking approach. I suppose there's some connection of this to Energy 
Aware Scheduling? Or not ...

Thanks,

	Ingo
Thara Gopinath April 18, 2019, 12:07 a.m. | #6
On 04/17/2019 02:29 PM, Ingo Molnar wrote:
> 

> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

> 

>>

>> On 04/17/2019 01:36 AM, Ingo Molnar wrote:

>>>

>>> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

>>>

>>>> The test results below shows 3-5% improvement in performance when

>>>> using the third solution compared to the default system today where

>>>> scheduler is unware of cpu capacity limitations due to thermal events.

>>>

>>> The numbers look very promising!

>>

>> Hello Ingo,

>> Thank you for the review.

>>>

>>> I've rearranged the results to make the performance properties of the 

>>> various approaches and parameters easier to see:

>>>

>>>                                          (seconds, lower is better)

>>>

>>> 			                 Hackbench   Aobench   Dhrystone

>>>                                          =========   =======   =========

>>> Vanilla kernel (No Thermal Pressure)         10.21    141.58        1.14

>>> Instantaneous thermal pressure               10.16    141.63        1.15

>>> Thermal Pressure Averaging:

>>>       - PELT fmwk                             9.88    134.48        1.19

>>>       - non-PELT Algo. Decay : 500 ms         9.94    133.62        1.09

>>>       - non-PELT Algo. Decay : 250 ms         7.52    137.22        1.012

>>>       - non-PELT Algo. Decay : 125 ms         9.87    137.55        1.12

>>>

>>>

>>> Firstly, a couple of questions about the numbers:

>>>

>>>    1)

>>>

>>>       Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012?

>>>       You reported it as:

>>>

>>>              non-PELT Algo. Decay : 250 ms   1.012                   7.02%

>>

>> It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks

>> multiple time because of the anomalies noticed.  1.012 is a formatting

>> error on my part when I copy pasted the results into a google sheet I am

>> maintaining to capture the test results. Sorry about the confusion.

> 

> That's actually pretty good, because it suggests a 35% and 15% 

> improvement over the vanilla kernel - which is very good for such 

> CPU-bound workloads.

> 

> Not that 5% is bad in itself - but 15% is better ;-)

> 

>> Regarding the decay period, I agree that more testing can be done. I 

>> like your suggestions below and I am going to try implementing them 

>> sometime next week. Once I have some solid results, I will send them 

>> out.

> 

> Thanks!

> 

>> My concern regarding getting hung up too much on decay period is that I 

>> think it could vary from SoC to SoC depending on the type and number of 

>> cores and thermal characteristics. So I was thinking eventually the 

>> decay period should be configurable via a config option or by any other 

>> means. Testing on different systems will definitely help and maybe I am 

>> wrong and there is no much variation between systems.

> 

> Absolutely, so I'd not be against keeping it a SCHED_DEBUG tunable or so, 

> until there's a better understanding of how the physical properties of 

> the SoC map to an ideal decay period.

> 

> Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load 

> tracking approach. I suppose there's some connection of this to Energy 

> Aware Scheduling? Or not ...

Mmm.. Not so much. This does not have much to do with EAS. The feature
itself will be really useful if there are asymmetric cpus in the  system
rather than symmetric cpus. In case of SMP, since all cores have same
capacity and assuming any thermal mitigation will be implemented across
the all the cpus, there won't be any different scheduler behavior.

Regards
Thara
> 

> Thanks,

> 

> 	Ingo

> 



-- 
Regards
Thara
Quentin Perret April 18, 2019, 9:22 a.m. | #7
On Wednesday 17 Apr 2019 at 20:29:32 (+0200), Ingo Molnar wrote:
> 

> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

> 

> > 

> > On 04/17/2019 01:36 AM, Ingo Molnar wrote:

> > > 

> > > * Thara Gopinath <thara.gopinath@linaro.org> wrote:

> > > 

> > >> The test results below shows 3-5% improvement in performance when

> > >> using the third solution compared to the default system today where

> > >> scheduler is unware of cpu capacity limitations due to thermal events.

> > > 

> > > The numbers look very promising!

> > 

> > Hello Ingo,

> > Thank you for the review.

> > > 

> > > I've rearranged the results to make the performance properties of the 

> > > various approaches and parameters easier to see:

> > > 

> > >                                          (seconds, lower is better)

> > > 

> > > 			                 Hackbench   Aobench   Dhrystone

> > >                                          =========   =======   =========

> > > Vanilla kernel (No Thermal Pressure)         10.21    141.58        1.14

> > > Instantaneous thermal pressure               10.16    141.63        1.15

> > > Thermal Pressure Averaging:

> > >       - PELT fmwk                             9.88    134.48        1.19

> > >       - non-PELT Algo. Decay : 500 ms         9.94    133.62        1.09

> > >       - non-PELT Algo. Decay : 250 ms         7.52    137.22        1.012

> > >       - non-PELT Algo. Decay : 125 ms         9.87    137.55        1.12

> > > 

> > > 

> > > Firstly, a couple of questions about the numbers:

> > > 

> > >    1)

> > > 

> > >       Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012?

> > >       You reported it as:

> > > 

> > >              non-PELT Algo. Decay : 250 ms   1.012                   7.02%

> > 

> > It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks

> > multiple time because of the anomalies noticed.  1.012 is a formatting

> > error on my part when I copy pasted the results into a google sheet I am

> > maintaining to capture the test results. Sorry about the confusion.

> 

> That's actually pretty good, because it suggests a 35% and 15% 

> improvement over the vanilla kernel - which is very good for such 

> CPU-bound workloads.

> 

> Not that 5% is bad in itself - but 15% is better ;-)

> 

> > Regarding the decay period, I agree that more testing can be done. I 

> > like your suggestions below and I am going to try implementing them 

> > sometime next week. Once I have some solid results, I will send them 

> > out.

> 

> Thanks!

> 

> > My concern regarding getting hung up too much on decay period is that I 

> > think it could vary from SoC to SoC depending on the type and number of 

> > cores and thermal characteristics. So I was thinking eventually the 

> > decay period should be configurable via a config option or by any other 

> > means. Testing on different systems will definitely help and maybe I am 

> > wrong and there is no much variation between systems.

> 

> Absolutely, so I'd not be against keeping it a SCHED_DEBUG tunable or so, 

> until there's a better understanding of how the physical properties of 

> the SoC map to an ideal decay period.


+1, that'd be really useful to try this out on several platforms.

> Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load 

> tracking approach.


I certainly don't hate it :-) In fact we already have something in the
Android kernel to reflect thermal pressure into the CPU capacity using
the 'instantaneous' approach. I'm all in favour of replacing our
out-of-tree stuff by a mainline solution, and even more if that performs
better.

So yes, we need to discuss the implementation details and all, but I'd
personally be really happy to see something upstream in this area.

> I suppose there's some connection of this to Energy 

> Aware Scheduling? Or not ...


Hmm, there isn't an immediate connection, I think. But that could
change.

FWIW I'm currently pushing a patch-set to make the thermal subsystem use
the same Energy Model as EAS ([1]) instead of its own. There are several
good reasons to do this, but one of them is to make sure the scheduler
and the thermal stuff (and the rest of the kernel) have a consistent
definition of what 'power' means. That might enable us do smart things
in the scheduler, but that's really for later.

Thanks,
Quentin

[1] https://lore.kernel.org/lkml/20190417094301.17622-1-quentin.perret@arm.com/
Ionela Voinescu April 24, 2019, 3:57 p.m. | #8
Hi Thara,

The idea and the results look promising. I'm trying to understand better
the cause of the improvements so I've added below some questions that
would help me out with this.


> Regarding testing, basic build, boot and sanity testing have been

> performed on hikey960 mainline kernel with debian file system.

> Further, aobench (An occlusion renderer for benchmarking realworld

> floating point performance), dhrystone and hackbench test have been

> run with the thermal pressure algorithm. During testing, due to

> constraints of step wise governor in dealing with big little systems,

> cpu cooling was disabled on little core, the idea being that

> big core will heat up and cpu cooling device will throttle the

> frequency of the big cores there by limiting the maximum available

> capacity and the scheduler will spread out tasks to little cores as well.

> Finally, this patch series has been boot tested on db410C running v5.1-rc4

> kernel.

>


Did you try using IPA as well? It is better equipped to deal with
big-LITTLE systems and it's more probable IPA will be used for these
systems, where your solution will have the biggest impact as well.
The difference will be that you'll have both the big cluster and the
LITTLE cluster capped in different proportions depending on their
utilization and their efficiency.

> During the course of development various methods of capturing

> and reflecting thermal pressure were implemented.

> 

> The first method to be evaluated was to convert the

> capped max frequency into capacity and have the scheduler use the

> instantaneous value when updating cpu_capacity.

> This method is referenced as "Instantaneous Thermal Pressure" in the

> test results below. 

> 

> The next two methods employs different methods of averaging the

> thermal pressure before applying it when updating cpu_capacity.

> The first of these methods re-used the PELT algorithm already present

> in the kernel that does the averaging of rt and dl load and utilization.

> This method is referenced as "Thermal Pressure Averaging using PELT fmwk"

> in the test results below.

> 

> The final method employs an averaging algorithm that collects and

> decays thermal pressure based on the decay period. In this method,

> the decay period is configurable. This method is referenced as

> "Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the

> test results below.

> 

> The test results below shows 3-5% improvement in performance when

> using the third solution compared to the default system today where

> scheduler is unware of cpu capacity limitations due to thermal events.

> 


Did you happen to record the amount of capping imposed on the big cores
when these results were obtained? Did you find scenarios where the
capacity of the bigs resulted in being lower than the capacity of the
LITTLEs (capacity inversion)?
This is one case where we'll see a big impact in considering thermal
pressure.

Also, given that these are more or less sustained workloads, I'm
wondering if there is any effect on workloads running on an uncapped
system following capping. I would image such a test being composed of a
single threaded period (no capping) followed by a multi-threaded period
(with capping), continued in a loop. It might be interesting to have
something like this as well, as part of your test coverage.


Thanks,
Ionela.
Peter Zijlstra April 24, 2019, 4:34 p.m. | #9
On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote:
> Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load 

> tracking approach. 


I seem to remember competing proposals, and have forgotten everything
about them; the cover letter also didn't have references to them or
mention them in any way.

As to the averaging and period, I personally prefer a PELT signal with
the windows lined up, if that really is too short a window, then a PELT
like signal with a natural multiple of the PELT period would make sense,
such that the windows still line up nicely.

Mixing different averaging methods and non-aligned windows just makes me
uncomfortable.
Ingo Molnar April 25, 2019, 5:33 p.m. | #10
* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote:

> > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load 

> > tracking approach. 

> 

> I seem to remember competing proposals, and have forgotten everything

> about them; the cover letter also didn't have references to them or

> mention them in any way.

> 

> As to the averaging and period, I personally prefer a PELT signal with

> the windows lined up, if that really is too short a window, then a PELT

> like signal with a natural multiple of the PELT period would make sense,

> such that the windows still line up nicely.

> 

> Mixing different averaging methods and non-aligned windows just makes me

> uncomfortable.


Yeah, so the problem with PELT is that while it nicely approximates 
variable-period decay calculations with plain additions, shifts and table 
lookups (i.e. accelerates pow()), AFAICS the most important decay 
parameter is fixed: the speed of decay, the dampening factor, which is 
fixed at 32:

  Documentation/scheduler/sched-pelt.c

  #define HALFLIFE 32

Right?

Thara's numbers suggest that there's high sensitivity to the speed of 
decay. By using PELT we'd be using whatever averaging speed there is 
within PELT.

Now we could make that parametric of course, but that would both 
complicate the PELT lookup code (one more dimension) and would negatively 
affect code generation in a number of places.

Thanks,

	Ingo
Ingo Molnar April 25, 2019, 5:44 p.m. | #11
* Ingo Molnar <mingo@kernel.org> wrote:

> 

> * Peter Zijlstra <peterz@infradead.org> wrote:

> 

> > On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote:

> > > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load 

> > > tracking approach. 

> > 

> > I seem to remember competing proposals, and have forgotten everything

> > about them; the cover letter also didn't have references to them or

> > mention them in any way.

> > 

> > As to the averaging and period, I personally prefer a PELT signal with

> > the windows lined up, if that really is too short a window, then a PELT

> > like signal with a natural multiple of the PELT period would make sense,

> > such that the windows still line up nicely.

> > 

> > Mixing different averaging methods and non-aligned windows just makes me

> > uncomfortable.

> 

> Yeah, so the problem with PELT is that while it nicely approximates 

> variable-period decay calculations with plain additions, shifts and table 

> lookups (i.e. accelerates pow()), AFAICS the most important decay 

> parameter is fixed: the speed of decay, the dampening factor, which is 

> fixed at 32:

> 

>   Documentation/scheduler/sched-pelt.c

> 

>   #define HALFLIFE 32

> 

> Right?

> 

> Thara's numbers suggest that there's high sensitivity to the speed of 

> decay. By using PELT we'd be using whatever averaging speed there is 

> within PELT.

> 

> Now we could make that parametric of course, but that would both 

> complicate the PELT lookup code (one more dimension) and would negatively 

> affect code generation in a number of places.


I missed the other solution, which is what you suggested: by 
increasing/reducing the PELT window size we can effectively shift decay 
speed and use just a single lookup table.

I.e. instead of the fixed period size of 1024 in accumulate_sum(), use 
decay_load() directly but use a different (longer) window size from 1024 
usecs to calculate 'periods', and make it a multiple of 1024.

This might just work out right: with a half-life of 32 the fastest decay 
speed should be around ~20 msecs (?) - and Thara's numbers so far suggest 
that the sweet spot averaging is significantly longer, at a couple of 
hundred millisecs.

Thanks,

	Ingo
Vincent Guittot April 26, 2019, 7:08 a.m. | #12
On Thu, 25 Apr 2019 at 19:44, Ingo Molnar <mingo@kernel.org> wrote:
>

>

> * Ingo Molnar <mingo@kernel.org> wrote:

>

> >

> > * Peter Zijlstra <peterz@infradead.org> wrote:

> >

> > > On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote:

> > > > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load

> > > > tracking approach.

> > >

> > > I seem to remember competing proposals, and have forgotten everything

> > > about them; the cover letter also didn't have references to them or

> > > mention them in any way.

> > >

> > > As to the averaging and period, I personally prefer a PELT signal with

> > > the windows lined up, if that really is too short a window, then a PELT

> > > like signal with a natural multiple of the PELT period would make sense,

> > > such that the windows still line up nicely.

> > >

> > > Mixing different averaging methods and non-aligned windows just makes me

> > > uncomfortable.

> >

> > Yeah, so the problem with PELT is that while it nicely approximates

> > variable-period decay calculations with plain additions, shifts and table

> > lookups (i.e. accelerates pow()), AFAICS the most important decay

> > parameter is fixed: the speed of decay, the dampening factor, which is

> > fixed at 32:

> >

> >   Documentation/scheduler/sched-pelt.c

> >

> >   #define HALFLIFE 32

> >

> > Right?

> >

> > Thara's numbers suggest that there's high sensitivity to the speed of

> > decay. By using PELT we'd be using whatever averaging speed there is

> > within PELT.

> >

> > Now we could make that parametric of course, but that would both

> > complicate the PELT lookup code (one more dimension) and would negatively

> > affect code generation in a number of places.

>

> I missed the other solution, which is what you suggested: by

> increasing/reducing the PELT window size we can effectively shift decay

> speed and use just a single lookup table.

>

> I.e. instead of the fixed period size of 1024 in accumulate_sum(), use

> decay_load() directly but use a different (longer) window size from 1024

> usecs to calculate 'periods', and make it a multiple of 1024.


Can't we also scale the now parameter of ___update_load_sum() ?
If we right shift it before calling ___update_load_sum, it should be
the same as using a half period of 62, 128, 256ms ...
The main drawback would be a lost of precision but we are in the range
of 2, 4, 8us compared to the 1ms window

This is quite similar to how we scale the utilization with frequency and uarch

>

> This might just work out right: with a half-life of 32 the fastest decay

> speed should be around ~20 msecs (?) - and Thara's numbers so far suggest

> that the sweet spot averaging is significantly longer, at a couple of

> hundred millisecs.

>

> Thanks,

>

>         Ingo
Ingo Molnar April 26, 2019, 8:35 a.m. | #13
* Vincent Guittot <vincent.guittot@linaro.org> wrote:

> On Thu, 25 Apr 2019 at 19:44, Ingo Molnar <mingo@kernel.org> wrote:

> >

> >

> > * Ingo Molnar <mingo@kernel.org> wrote:

> >

> > >

> > > * Peter Zijlstra <peterz@infradead.org> wrote:

> > >

> > > > On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote:

> > > > > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load

> > > > > tracking approach.

> > > >

> > > > I seem to remember competing proposals, and have forgotten everything

> > > > about them; the cover letter also didn't have references to them or

> > > > mention them in any way.

> > > >

> > > > As to the averaging and period, I personally prefer a PELT signal with

> > > > the windows lined up, if that really is too short a window, then a PELT

> > > > like signal with a natural multiple of the PELT period would make sense,

> > > > such that the windows still line up nicely.

> > > >

> > > > Mixing different averaging methods and non-aligned windows just makes me

> > > > uncomfortable.

> > >

> > > Yeah, so the problem with PELT is that while it nicely approximates

> > > variable-period decay calculations with plain additions, shifts and table

> > > lookups (i.e. accelerates pow()), AFAICS the most important decay

> > > parameter is fixed: the speed of decay, the dampening factor, which is

> > > fixed at 32:

> > >

> > >   Documentation/scheduler/sched-pelt.c

> > >

> > >   #define HALFLIFE 32

> > >

> > > Right?

> > >

> > > Thara's numbers suggest that there's high sensitivity to the speed of

> > > decay. By using PELT we'd be using whatever averaging speed there is

> > > within PELT.

> > >

> > > Now we could make that parametric of course, but that would both

> > > complicate the PELT lookup code (one more dimension) and would negatively

> > > affect code generation in a number of places.

> >

> > I missed the other solution, which is what you suggested: by

> > increasing/reducing the PELT window size we can effectively shift decay

> > speed and use just a single lookup table.

> >

> > I.e. instead of the fixed period size of 1024 in accumulate_sum(), use

> > decay_load() directly but use a different (longer) window size from 1024

> > usecs to calculate 'periods', and make it a multiple of 1024.

> 

> Can't we also scale the now parameter of ___update_load_sum() ?

> If we right shift it before calling ___update_load_sum, it should be

> the same as using a half period of 62, 128, 256ms ...

> The main drawback would be a lost of precision but we are in the range

> of 2, 4, 8us compared to the 1ms window

> 

> This is quite similar to how we scale the utilization with frequency and uarch


Yeah, that would work too.

Thanks,

	Ingo
Thara Gopinath April 26, 2019, 11:50 a.m. | #14
On 04/24/2019 11:57 AM, Ionela Voinescu wrote:
> Hi Thara,

> 

> The idea and the results look promising. I'm trying to understand better

> the cause of the improvements so I've added below some questions that

> would help me out with this.


Hi Ionela,

Thanks for the review.

> 

> 

>> Regarding testing, basic build, boot and sanity testing have been

>> performed on hikey960 mainline kernel with debian file system.

>> Further, aobench (An occlusion renderer for benchmarking realworld

>> floating point performance), dhrystone and hackbench test have been

>> run with the thermal pressure algorithm. During testing, due to

>> constraints of step wise governor in dealing with big little systems,

>> cpu cooling was disabled on little core, the idea being that

>> big core will heat up and cpu cooling device will throttle the

>> frequency of the big cores there by limiting the maximum available

>> capacity and the scheduler will spread out tasks to little cores as well.

>> Finally, this patch series has been boot tested on db410C running v5.1-rc4

>> kernel.

>>

> 

> Did you try using IPA as well? It is better equipped to deal with

> big-LITTLE systems and it's more probable IPA will be used for these

> systems, where your solution will have the biggest impact as well.

> The difference will be that you'll have both the big cluster and the

> LITTLE cluster capped in different proportions depending on their

> utilization and their efficiency.


No. I did not use IPA simply because it was not enabled in mainline. I
agree it is better equipped  to deal with big-little systems. The idea
to remove cpu cooling on little cluster was to in some (not the
cleanest) manner to mimic this. But I agree that IPA testing is possibly
the next step.Any help in this regard is appreciated.

> 

>> During the course of development various methods of capturing

>> and reflecting thermal pressure were implemented.

>>

>> The first method to be evaluated was to convert the

>> capped max frequency into capacity and have the scheduler use the

>> instantaneous value when updating cpu_capacity.

>> This method is referenced as "Instantaneous Thermal Pressure" in the

>> test results below. 

>>

>> The next two methods employs different methods of averaging the

>> thermal pressure before applying it when updating cpu_capacity.

>> The first of these methods re-used the PELT algorithm already present

>> in the kernel that does the averaging of rt and dl load and utilization.

>> This method is referenced as "Thermal Pressure Averaging using PELT fmwk"

>> in the test results below.

>>

>> The final method employs an averaging algorithm that collects and

>> decays thermal pressure based on the decay period. In this method,

>> the decay period is configurable. This method is referenced as

>> "Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the

>> test results below.

>>

>> The test results below shows 3-5% improvement in performance when

>> using the third solution compared to the default system today where

>> scheduler is unware of cpu capacity limitations due to thermal events.

>>

> 

> Did you happen to record the amount of capping imposed on the big cores

> when these results were obtained? Did you find scenarios where the

> capacity of the bigs resulted in being lower than the capacity of the

> LITTLEs (capacity inversion)?

> This is one case where we'll see a big impact in considering thermal

> pressure.


I think I saw capacity inversion in some scenarios. I did not
particularly capture them.

> 

> Also, given that these are more or less sustained workloads, I'm

> wondering if there is any effect on workloads running on an uncapped

> system following capping. I would image such a test being composed of a

> single threaded period (no capping) followed by a multi-threaded period

> (with capping), continued in a loop. It might be interesting to have

> something like this as well, as part of your test coverage


I do not understand this. There is either capping for a workload or no
capping. There is no sysctl entry to turn on or off capping.

Regards
Thara
> 

> 

> Thanks,

> Ionela.

> 



-- 
Regards
Thara
Ionela Voinescu April 26, 2019, 2:46 p.m. | #15
Hi Thara,

>>> Regarding testing, basic build, boot and sanity testing have been

>>> performed on hikey960 mainline kernel with debian file system.

>>> Further, aobench (An occlusion renderer for benchmarking realworld

>>> floating point performance), dhrystone and hackbench test have been

>>> run with the thermal pressure algorithm. During testing, due to

>>> constraints of step wise governor in dealing with big little systems,

>>> cpu cooling was disabled on little core, the idea being that

>>> big core will heat up and cpu cooling device will throttle the

>>> frequency of the big cores there by limiting the maximum available

>>> capacity and the scheduler will spread out tasks to little cores as well.

>>> Finally, this patch series has been boot tested on db410C running v5.1-rc4

>>> kernel.

>>>

>>

>> Did you try using IPA as well? It is better equipped to deal with

>> big-LITTLE systems and it's more probable IPA will be used for these

>> systems, where your solution will have the biggest impact as well.

>> The difference will be that you'll have both the big cluster and the

>> LITTLE cluster capped in different proportions depending on their

>> utilization and their efficiency.

> 

> No. I did not use IPA simply because it was not enabled in mainline. I

> agree it is better equipped  to deal with big-little systems. The idea

> to remove cpu cooling on little cluster was to in some (not the

> cleanest) manner to mimic this. But I agree that IPA testing is possibly

> the next step.Any help in this regard is appreciated.

> 


I see CONFIG_THERMAL_GOV_POWER_ALLOCATOR=y in the defconfig for arm64:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/configs/defconfig?h=v5.1-rc6#n413

You can enable the use of it or make it default in the defconfig.

Also, Hikey960 has the needed setup in DT:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/hisilicon/hi3660.dtsi?h=v5.1-rc6#n1093

This should work fine.

>>

>>> During the course of development various methods of capturing

>>> and reflecting thermal pressure were implemented.

>>>

>>> The first method to be evaluated was to convert the

>>> capped max frequency into capacity and have the scheduler use the

>>> instantaneous value when updating cpu_capacity.

>>> This method is referenced as "Instantaneous Thermal Pressure" in the

>>> test results below. 

>>>

>>> The next two methods employs different methods of averaging the

>>> thermal pressure before applying it when updating cpu_capacity.

>>> The first of these methods re-used the PELT algorithm already present

>>> in the kernel that does the averaging of rt and dl load and utilization.

>>> This method is referenced as "Thermal Pressure Averaging using PELT fmwk"

>>> in the test results below.

>>>

>>> The final method employs an averaging algorithm that collects and

>>> decays thermal pressure based on the decay period. In this method,

>>> the decay period is configurable. This method is referenced as

>>> "Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the

>>> test results below.

>>>

>>> The test results below shows 3-5% improvement in performance when

>>> using the third solution compared to the default system today where

>>> scheduler is unware of cpu capacity limitations due to thermal events.

>>>

>>

>> Did you happen to record the amount of capping imposed on the big cores

>> when these results were obtained? Did you find scenarios where the

>> capacity of the bigs resulted in being lower than the capacity of the

>> LITTLEs (capacity inversion)?

>> This is one case where we'll see a big impact in considering thermal

>> pressure.

> 

> I think I saw capacity inversion in some scenarios. I did not

> particularly capture them.

> 


It would be good to observe this and possibly correlate the amount of
capping with resulting behavior and performance numbers. This would
give more confidence in the testing coverage.

You can create a specific testcase for capacity inversion by only
capping the big CPUs, as you've done for these tests, and by running
sysbench/dhrystone for example with at least nr_big_cpus tasks.

This assumes that the bigs fully utilized would generate enough heat and
would be capped enough to achieve a capacity lower than the littles,
which on Hikey960 I don't doubt it can be obtained.

>>

>> Also, given that these are more or less sustained workloads, I'm

>> wondering if there is any effect on workloads running on an uncapped

>> system following capping. I would image such a test being composed of a

>> single threaded period (no capping) followed by a multi-threaded period

>> (with capping), continued in a loop. It might be interesting to have

>> something like this as well, as part of your test coverage

> 

> I do not understand this. There is either capping for a workload or no

> capping. There is no sysctl entry to turn on or off capping.

> 


I was thinking of this as a second hand effect. If you have only one big
CPU even fully utilized, with the others quiet, you might not see any
capping. But when you have a multi-threaded workload, with all or at
least the bigs at a high OPP, the platform will definitely overheat and
there will be capping. 

Thanks,
Ionela.

> Regards

> Thara

>>

>>

>> Thanks,

>> Ionela.

>>

> 

>
Ionela Voinescu April 29, 2019, 1:29 p.m. | #16
Hi Thara,

> 

> 			Hackbench: (1 group , 30000 loops, 10 runs)

> 				Result            Standard Deviation

> 				(Time Secs)        (% of mean)

> 

> No Thermal Pressure             10.21                   7.99%

> 

> Instantaneous thermal pressure  10.16                   5.36%

> 

> Thermal Pressure Averaging

> using PELT fmwk                 9.88                    3.94%

> 

> Thermal Pressure Averaging

> non-PELT Algo. Decay : 500 ms   9.94                    4.59%

> 

> Thermal Pressure Averaging

> non-PELT Algo. Decay : 250 ms   7.52                    5.42%

> 

> Thermal Pressure Averaging

> non-PELT Algo. Decay : 125 ms   9.87                    3.94%

> 

> 


I'm trying your patches on my Hikey960 and I'm getting different results
than the ones here.

I'm running with the step-wise governor, enabled only on the big cores.
The decay period is set to 250ms.

The result for hackbench is:

# ./hackbench -g 1 -l 30000
Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)
Each sender will pass 30000 messages of 100 bytes
Time: 20.756

During the run I see the little cores running at maximum frequency
(1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped
at 1.42GHz. There should not be any capacity inversion.
The temperature is kept around 75 degrees (73 to 77 degrees).

I don't have any kind of active cooling (no fans on the board), only a
heatsink on the SoC.

But as you see my results(~20s) are very far from the 7-10s in your
results.

Do you see anything wrong with this process? Can you give me more
details on your setup that I can use to test on my board?

Thank you,
Ionela.
Ionela Voinescu April 30, 2019, 2:39 p.m. | #17
Hi Thara,

On 29/04/2019 14:29, Ionela Voinescu wrote:
> Hi Thara,

> 

>>

>> 			Hackbench: (1 group , 30000 loops, 10 runs)

>> 				Result            Standard Deviation

>> 				(Time Secs)        (% of mean)

>>

>> No Thermal Pressure             10.21                   7.99%

>>

>> Instantaneous thermal pressure  10.16                   5.36%

>>

>> Thermal Pressure Averaging

>> using PELT fmwk                 9.88                    3.94%

>>

>> Thermal Pressure Averaging

>> non-PELT Algo. Decay : 500 ms   9.94                    4.59%

>>

>> Thermal Pressure Averaging

>> non-PELT Algo. Decay : 250 ms   7.52                    5.42%

>>

>> Thermal Pressure Averaging

>> non-PELT Algo. Decay : 125 ms   9.87                    3.94%

>>

>>

> 

> I'm trying your patches on my Hikey960 and I'm getting different results

> than the ones here.

> 

> I'm running with the step-wise governor, enabled only on the big cores.

> The decay period is set to 250ms.

> 

> The result for hackbench is:

> 

> # ./hackbench -g 1 -l 30000

> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)

> Each sender will pass 30000 messages of 100 bytes

> Time: 20.756

> 

> During the run I see the little cores running at maximum frequency

> (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped

> at 1.42GHz. There should not be any capacity inversion.

> The temperature is kept around 75 degrees (73 to 77 degrees).

> 

> I don't have any kind of active cooling (no fans on the board), only a

> heatsink on the SoC.

> 

> But as you see my results(~20s) are very far from the 7-10s in your

> results.

> 

> Do you see anything wrong with this process? Can you give me more

> details on your setup that I can use to test on my board?

> 


I've found that my poor results above were due to debug options
mistakenly left enabled in the defconfig. Sorry about that!

After cleaning it up I'm getting results around 5.6s for this test case.
I've run 50 iterations for each test, with 90s cool down period between
them.


 			Hackbench: (1 group , 30000 loops, 50 runs)
 				Result            Standard Deviation
 				(Time Secs)        (% of mean)

 No Thermal Pressure(step_wise)  5.644                   7.760%
 No Thermal Pressure(IPA)        5.677                   9.062%

 Thermal Pressure Averaging
 non-PELT Algo. Decay : 250 ms   5.627                   5.593%
 (step-wise, bigs capped only)

 Thermal Pressure Averaging
 non-PELT Algo. Decay : 250 ms   5.690                   3.738%
 (IPA)

All of the results above are within 1.1% difference with a
significantly higher standard deviation.

I wanted to run this initially to validate my setup and understand
if there is any conclusion we can draw from a test like this, that
floods the CPUs with tasks. Looking over the traces, the tasks are
running almost back to back, trying to use all available resources,
on all the CPUs.
Therefore, I doubt that there could be better decisions that could be
made, knowing about thermal pressure, for this usecase.

I'll try next some capacity inversion usecase and post the results when
they are ready.

Hope it helps,
Ionela.


> Thank you,

> Ionela.

>
Thara Gopinath April 30, 2019, 3:57 p.m. | #18
On 04/29/2019 09:29 AM, Ionela Voinescu wrote:
> Hi Thara,

> 

>>

>> 			Hackbench: (1 group , 30000 loops, 10 runs)

>> 				Result            Standard Deviation

>> 				(Time Secs)        (% of mean)

>>

>> No Thermal Pressure             10.21                   7.99%

>>

>> Instantaneous thermal pressure  10.16                   5.36%

>>

>> Thermal Pressure Averaging

>> using PELT fmwk                 9.88                    3.94%

>>

>> Thermal Pressure Averaging

>> non-PELT Algo. Decay : 500 ms   9.94                    4.59%

>>

>> Thermal Pressure Averaging

>> non-PELT Algo. Decay : 250 ms   7.52                    5.42%

>>

>> Thermal Pressure Averaging

>> non-PELT Algo. Decay : 125 ms   9.87                    3.94%

>>

>>

> 

> I'm trying your patches on my Hikey960 and I'm getting different results

> than the ones here.

> 

> I'm running with the step-wise governor, enabled only on the big cores.

> The decay period is set to 250ms.

> 

> The result for hackbench is:

> 

> # ./hackbench -g 1 -l 30000

> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)

> Each sender will pass 30000 messages of 100 bytes

> Time: 20.756

> 

> During the run I see the little cores running at maximum frequency

> (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped

> at 1.42GHz. There should not be any capacity inversion.

> The temperature is kept around 75 degrees (73 to 77 degrees).

> 

> I don't have any kind of active cooling (no fans on the board), only a

> heatsink on the SoC.

> 

> But as you see my results(~20s) are very far from the 7-10s in your

> results.

> 

> Do you see anything wrong with this process? Can you give me more

> details on your setup that I can use to test on my board? 


Hi Ionela,

I used the latest mainline kernel with sched/ tip merged in for my
testing. My hikey960 did not have any fan or heat sink during testing. I
disabled cpu cooling for little cores in the dts files.
Also I have to warn you that I have managed to blow up my hikey960. So I
no longer have a functional board for past two weeks or so.

I don't have my test scripts to send you, but I have some of the results
files downloaded which I can send you in a separate email.
I did run the test 10 rounds.

Also I think 20s is too much of variation for the test results. Like I
mentioned in my previous emails I think the 7.52 is an anomaly but the
results should be around the range of 8-9 s.

Regards
Thara

> 

> Thank you,

> Ionela.

> 



-- 
Regards
Thara
Thara Gopinath April 30, 2019, 4:02 p.m. | #19
On 04/30/2019 11:57 AM, Thara Gopinath wrote:
> On 04/29/2019 09:29 AM, Ionela Voinescu wrote:

>> Hi Thara,

>>

>>>

>>> 			Hackbench: (1 group , 30000 loops, 10 runs)

>>> 				Result            Standard Deviation

>>> 				(Time Secs)        (% of mean)

>>>

>>> No Thermal Pressure             10.21                   7.99%

>>>

>>> Instantaneous thermal pressure  10.16                   5.36%

>>>

>>> Thermal Pressure Averaging

>>> using PELT fmwk                 9.88                    3.94%

>>>

>>> Thermal Pressure Averaging

>>> non-PELT Algo. Decay : 500 ms   9.94                    4.59%

>>>

>>> Thermal Pressure Averaging

>>> non-PELT Algo. Decay : 250 ms   7.52                    5.42%

>>>

>>> Thermal Pressure Averaging

>>> non-PELT Algo. Decay : 125 ms   9.87                    3.94%

>>>

>>>

>>

>> I'm trying your patches on my Hikey960 and I'm getting different results

>> than the ones here.

>>

>> I'm running with the step-wise governor, enabled only on the big cores.

>> The decay period is set to 250ms.

>>

>> The result for hackbench is:

>>

>> # ./hackbench -g 1 -l 30000

>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)

>> Each sender will pass 30000 messages of 100 bytes

>> Time: 20.756

>>

>> During the run I see the little cores running at maximum frequency

>> (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped

>> at 1.42GHz. There should not be any capacity inversion.

>> The temperature is kept around 75 degrees (73 to 77 degrees).

>>

>> I don't have any kind of active cooling (no fans on the board), only a

>> heatsink on the SoC.

>>

>> But as you see my results(~20s) are very far from the 7-10s in your

>> results.

>>

>> Do you see anything wrong with this process? Can you give me more

>> details on your setup that I can use to test on my board? 

> 

> Hi Ionela,

> 

> I used the latest mainline kernel with sched/ tip merged in for my

> testing. My hikey960 did not have any fan or heat sink during testing. I

> disabled cpu cooling for little cores in the dts files.

> Also I have to warn you that I have managed to blow up my hikey960. So I

> no longer have a functional board for past two weeks or so.

> 

> I don't have my test scripts to send you, but I have some of the results

> files downloaded which I can send you in a separate email.

> I did run the test 10 rounds.


Hi Ionela,

I failed to mention that I drop the first run for averaging.

> 

> Also I think 20s is too much of variation for the test results. Like I

> mentioned in my previous emails I think the 7.52 is an anomaly but the

> results should be around the range of 8-9 s.


Also since we are more interested in comparison rather than absolute
numbers did you run tests in a system with no thermal pressure( to see
if there are any improvements)?

Regards
Thara

> 

> Regards

> Thara

> 

>>

>> Thank you,

>> Ionela.

>>

> 

> 



-- 
Regards
Thara
Thara Gopinath April 30, 2019, 4:10 p.m. | #20
On 04/30/2019 10:39 AM, Ionela Voinescu wrote:
> Hi Thara,

> 

> On 29/04/2019 14:29, Ionela Voinescu wrote:

>> Hi Thara,

>>

>>>

>>> 			Hackbench: (1 group , 30000 loops, 10 runs)

>>> 				Result            Standard Deviation

>>> 				(Time Secs)        (% of mean)

>>>

>>> No Thermal Pressure             10.21                   7.99%

>>>

>>> Instantaneous thermal pressure  10.16                   5.36%

>>>

>>> Thermal Pressure Averaging

>>> using PELT fmwk                 9.88                    3.94%

>>>

>>> Thermal Pressure Averaging

>>> non-PELT Algo. Decay : 500 ms   9.94                    4.59%

>>>

>>> Thermal Pressure Averaging

>>> non-PELT Algo. Decay : 250 ms   7.52                    5.42%

>>>

>>> Thermal Pressure Averaging

>>> non-PELT Algo. Decay : 125 ms   9.87                    3.94%

>>>

>>>

>>

>> I'm trying your patches on my Hikey960 and I'm getting different results

>> than the ones here.

>>

>> I'm running with the step-wise governor, enabled only on the big cores.

>> The decay period is set to 250ms.

>>

>> The result for hackbench is:

>>

>> # ./hackbench -g 1 -l 30000

>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)

>> Each sender will pass 30000 messages of 100 bytes

>> Time: 20.756

>>

>> During the run I see the little cores running at maximum frequency

>> (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped

>> at 1.42GHz. There should not be any capacity inversion.

>> The temperature is kept around 75 degrees (73 to 77 degrees).

>>

>> I don't have any kind of active cooling (no fans on the board), only a

>> heatsink on the SoC.

>>

>> But as you see my results(~20s) are very far from the 7-10s in your

>> results.

>>

>> Do you see anything wrong with this process? Can you give me more

>> details on your setup that I can use to test on my board?

>>

> 

> I've found that my poor results above were due to debug options

> mistakenly left enabled in the defconfig. Sorry about that!

> 

> After cleaning it up I'm getting results around 5.6s for this test case.

> I've run 50 iterations for each test, with 90s cool down period between

> them.

> 

> 

>  			Hackbench: (1 group , 30000 loops, 50 runs)

>  				Result            Standard Deviation

>  				(Time Secs)        (% of mean)

> 

>  No Thermal Pressure(step_wise)  5.644                   7.760%

>  No Thermal Pressure(IPA)        5.677                   9.062%

> 

>  Thermal Pressure Averaging

>  non-PELT Algo. Decay : 250 ms   5.627                   5.593%

>  (step-wise, bigs capped only)

> 

>  Thermal Pressure Averaging

>  non-PELT Algo. Decay : 250 ms   5.690                   3.738%

>  (IPA)

> 

> All of the results above are within 1.1% difference with a

> significantly higher standard deviation.


Hi Ionela,

I have replied to your original emails without seeing this one. So,
interesting results. I see IPA is worse off (Slightly) than step wise in
both thermal pressure and non-thermal pressure scenarios. Did you try
500 ms decay period by any chance?

> 

> I wanted to run this initially to validate my setup and understand

> if there is any conclusion we can draw from a test like this, that

> floods the CPUs with tasks. Looking over the traces, the tasks are

> running almost back to back, trying to use all available resources,

> on all the CPUs.

> Therefore, I doubt that there could be better decisions that could be

> made, knowing about thermal pressure, for this usecase.

> 

> I'll try next some capacity inversion usecase and post the results when

> they are ready.


Sure. let me know if I can help.

Regards
Thara

> 

> Hope it helps,

> Ionela.

> 

> 

>> Thank you,

>> Ionela.

>>



-- 
Regards
Thara
Ionela Voinescu May 2, 2019, 10:44 a.m. | #21
Hi Thara,

>> After cleaning it up I'm getting results around 5.6s for this test case.

>> I've run 50 iterations for each test, with 90s cool down period between

>> them.

>>

>>

>>  			Hackbench: (1 group , 30000 loops, 50 runs)

>>  				Result            Standard Deviation

>>  				(Time Secs)        (% of mean)

>>

>>  No Thermal Pressure(step_wise)  5.644                   7.760%

>>  No Thermal Pressure(IPA)        5.677                   9.062%

>>

>>  Thermal Pressure Averaging

>>  non-PELT Algo. Decay : 250 ms   5.627                   5.593%

>>  (step-wise, bigs capped only)

>>

>>  Thermal Pressure Averaging

>>  non-PELT Algo. Decay : 250 ms   5.690                   3.738%

>>  (IPA)

>>

>> All of the results above are within 1.1% difference with a

>> significantly higher standard deviation.

> 

> Hi Ionela,

> 

> I have replied to your original emails without seeing this one. So,

> interesting results. I see IPA is worse off (Slightly) than step wise in

> both thermal pressure and non-thermal pressure scenarios. Did you try

> 500 ms decay period by any chance?

>


I don't think we can draw a conclusion on that given how close the
results are and given the high standard deviation. Probably if I run
them again the tables will be turned :).

I did not run experiments with different decay periods yet, as I want to
have first a list  of experiments that are relevant for thermal pressure,
that can help later with refining the solution, which can mean either
deciding on a decay period or possibly going with the instantaneous
thermal pressure. Please find more details below.

>>

>> I wanted to run this initially to validate my setup and understand

>> if there is any conclusion we can draw from a test like this, that

>> floods the CPUs with tasks. Looking over the traces, the tasks are

>> running almost back to back, trying to use all available resources,

>> on all the CPUs.

>> Therefore, I doubt that there could be better decisions that could be

>> made, knowing about thermal pressure, for this usecase.

>>

>> I'll try next some capacity inversion usecase and post the results when

>> they are ready.

> 


I've started looking into this, starting from the most obvious case of
capacity inversion: using the user-space thermal governor and capping
the bigs to their lowest OPP. The LITTLEs are left uncapped.

This was not enough on the Hikey960 as the bigs at their lowest OPP were
in the capacity margin of the LITTLEs at their highest OPP. That meant
that LITTLEs would not pull tasks from the bigs, even if they had higher
capacity, as the capacity was in within the 25% margin. So another
change I've made was to set the capacity margin in fair.c to 10%.

I've run both sysbench and dhrystone. I'll put here only the results for
sysbench, interleaved, with and without considering thermal pressure (TP
and !TP).
As before, the TP solution uses averaging with a 250ms decay period.

               			Sysbench: (500000 req, 4 runs)
  				Result            Standard Deviation
  				(Time Secs)        (% of mean)

  !TP/4 threads                   146.46          0.063%
  TP/4 threads                    136.36          0.002%

  !TP/5 threads                   115.38          0.028%
  TP/5 threads                    110.62          0.006%

  !TP/6 threads                   95.38           0.051%
  TP/6 threads                    93.07           0.054%

  !TP/7 threads                   81.19           0.012%
  TP/7 threads                    80.32           0.028%

  !TP/8 threads                   72.58           2.295%
  TP/8 threads                    71.37           0.044%

As expected, the results are significantly improved when the scheduler
is let know of reduced capacity on the bigs which results in tasks being
placed or migrated to the littles which are able to provide better
performance. Traces nicely confirm this.

To be noted that these results only show that reflecting thermal
pressure in the capacity of the CPUs is useful and the scheduler is
equipped to make proper use of this information.
Possibly a thing to consider is whether or not to reduce the capacity
margin, but that's for another discussion.

This does not reflect the benefits of averaging, as, with the bigs
always being capped to the lowest OPP, the thermal pressure value will
be constant over the duration of the workload. The same results would
have been obtained with instantaneous thermal pressure.


Secondly, I've tried to use the step-wise governor, modified to only
cap the big CPUs, with the intention to obtain smaller periods of
capacity inversion for which a thermal pressure solution would show its
benefits.

Unfortunately dhrystone was misbehaving for some reason and was
giving me a high variation between results for the same test case.
Also, sysbench, ran with the same arguments as above, was not creating
enough load and thermal capping as to show the benefits of considering
thermal pressure.

So my recommendation is continue exploiting more test-cases like these.
I would continue with sysbench as it looks more stable, but modify the
the temperature threshold to determine periods of drastic capping of the
bigs. Once a dynamic test case and setup like this (no fixing
frequencies) is identified, it can be used to understand if averaging
is needed and to refine the decay period, and establish a good default.

What do you think? Does this make sense as a direction for obtaining
test cases? In my opinion the previous test cases were not  triggering
the right behaviors that can help prove the need for thermal pressure,
or help refine it. 

I will try to continue in this direction, but I won't be able to get to
in for a few days.

You'll find more results at: 
https://docs.google.com/spreadsheets/d/1ibxDSSSLTodLzihNAw6jM36eVZABuPMMnjvV-Xh4NEo/edit?usp=sharing


> Sure. let me know if I can help.


Any test results or recommendations for test cases would be helpful.
The need for thermal pressure is obvious, but the way that thermal
pressure is reflected in the capacity of the CPUs could be supported by
more thorough testing.

Regards,
Ionela.

> 

> Regards

> Thara

> 

>>

>> Hope it helps,

>> Ionela.

>>

>>

>>> Thank you,

>>> Ionela.

>>>

> 

>