[v5,0/6] Introduce Thermal Pressure

Message ID	1572979786-20361-1-git-send-email-thara.gopinath@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Thara Gopinath <thara.gopinath@linaro.org> To: mingo@redhat.com, peterz@infradead.org, ionela.voinescu@arm.com, vincent.guittot@linaro.org, rui.zhang@intel.com, edubezval@gmail.com, qperret@google.com Cc: linux-kernel@vger.kernel.org, amit.kachhap@gmail.com, javi.merino@kernel.org, daniel.lezcano@linaro.org Subject: [Patch v5 0/6] Introduce Thermal Pressure Date: Tue, 5 Nov 2019 13:49:40 -0500 Message-Id: <1572979786-20361-1-git-send-email-thara.gopinath@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	Introduce Thermal Pressure \| expand [v5,0/6] Introduce Thermal Pressure [v5,1/6] sched/pelt.c: Add support to track thermal pressure [v5,2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure [v5,3/6] sched/fair: Enable periodic update of average thermal pressure [v5,4/6] sched/fair: update cpu_capcity to reflect thermal pressure [v5,5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping [v5,6/6] sched/fair: Enable tuning of decay period

Message ID

1572979786-20361-1-git-send-email-thara.gopinath@linaro.org

Headers

Received-SPF: pass (google.com: best guess record for domain of
	linux-kernel-owner@vger.kernel.org designates 209.132.180.67
	as permitted sender) client-ip=209.132.180.67; 
From: Thara Gopinath <thara.gopinath@linaro.org>
To: mingo@redhat.com, peterz@infradead.org, ionela.voinescu@arm.com,
	vincent.guittot@linaro.org, rui.zhang@intel.com,
	edubezval@gmail.com, qperret@google.com
Cc: linux-kernel@vger.kernel.org, amit.kachhap@gmail.com,
	javi.merino@kernel.org, daniel.lezcano@linaro.org
Subject: [Patch v5 0/6] Introduce Thermal Pressure
Date: Tue,  5 Nov 2019 13:49:40 -0500
Message-Id: <1572979786-20361-1-git-send-email-thara.gopinath@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Series

Introduce Thermal Pressure | expand

Message

Thara Gopinath Nov. 5, 2019, 6:49 p.m. UTC

Thermal governors can respond to an overheat event of a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in the kernel, task scheduler is
not notified of capping of maximum frequency of a cpu.
In other words, scheduler is unaware of maximum capacity
restrictions placed on a cpu due to thermal activity.
This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The reduction in the maximum possible capacity of a cpu due to a
thermal event can be considered as thermal pressure. Instantaneous
thermal pressure is hard to record and can sometime be erroneous
as there can be mismatch between the actual capping of capacity
and scheduler recording it. Thus solution is to have a weighted
average per cpu value for thermal pressure over time.
The weight reflects the amount of time the cpu has spent at a
capped maximum frequency. Since thermal pressure is recorded as
an average, it must be decayed periodically. Exisiting algorithm
in the kernel scheduler pelt framework is re-used to calculate
the weighted average. This patch series also defines a sysctl
inerface to allow for a configurable decay period.

Regarding testing, basic build, boot and sanity testing have been
performed on db845c platform with debian file system.
Further, dhrystone and hackbench tests have been
run with the thermal pressure algorithm. During testing, due to
constraints of step wise governor in dealing with big little systems,
trip point 0 temperature was made assymetric between cpus in little
cluster and big cluster; the idea being that
big core will heat up and cpu cooling device will throttle the
frequency of the big cores faster, there by limiting the maximum available
capacity and the scheduler will spread out tasks to little cores as well.

Test Results

Hackbench: 1 group , 30000 loops, 10 runs
                                               Result         SD
                                               (Secs)     (% of mean)
 No Thermal Pressure                            14.03       2.69%
 Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%
 Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%
 Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%
 Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%
 Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%

Dhrystone Run Time  : 20 threads, 3000 MLOOPS
                                                 Result      SD
                                                 (Secs)    (% of mean)
 No Thermal Pressure                              9.452      4.49%
 Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
 Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
 Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
 Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
 Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%

A Brief History

The first version of this patch-series was posted with resuing
PELT algorithm to decay thermal pressure signal. The discussions
that followed were around whether intanteneous thermal pressure
solution is better and whether a stand-alone algortihm to accumulate
and decay thermal pressure is more appropriate than re-using the
PELT framework.
Tests on Hikey960 showed the stand-alone algorithm performing slightly
better than resuing PELT algorithm and V2 was posted with the stand
alone algorithm. Test results were shared as part of this series.
Discussions were around re-using PELT algorithm and running
further tests with more granular decay period.

For some time after this development was impeded due to hardware
unavailability, some other unforseen and possibly unfortunate events.
For this version, h/w was switched from hikey960 to db845c.
Also Instantaneous thermal pressure was never tested as part of this
cycle as it is clear that weighted average is a better implementation.
The non-PELT algorithm never gave any conclusive results to prove that it
is better than reusing PELT algorithm, in this round of testing.
Also reusing PELT algorithm means thermal pressure tracks the
other utilization signals in the scheduler.

v3->v4:
        - "Patch 3/7:sched: Initialize per cpu thermal pressure structure"
           is dropped as it is no longer needed following changes in other
           other patches.
        - rest of the change log mentioned in specific patches.

Thara Gopinath (6):
  sched/pelt.c: Add support to track thermal pressure
  sched/fair: Add infrastructure to store and update  instantaneous
    thermal pressure
  sched/fair: Enable periodic update of thermal pressure
  sched/fair: update cpu_capcity to reflect thermal pressure
  thermal/cpu-cooling: Update thermal pressure in case of a maximum
    frequency capping
  sched/fair: Enable tuning of decay period

 Documentation/admin-guide/kernel-parameters.txt |  5 ++
 drivers/thermal/cpu_cooling.c                   | 36 ++++++++++++-
 include/linux/sched.h                           |  9 ++++
 kernel/sched/fair.c                             | 69 +++++++++++++++++++++++++
 kernel/sched/pelt.c                             | 13 +++++
 kernel/sched/pelt.h                             |  7 +++
 kernel/sched/sched.h                            |  1 +
 7 files changed, 138 insertions(+), 2 deletions(-)

-- 
2.1.4

Comments

Lukasz Luba Nov. 12, 2019, 11:21 a.m. UTC | #1

Hi Thara,

I am going to try your patch set on some different board.
To do that I need more information regarding your setup.
Please find my comments below. I need probably one hack
which do not fully understand.

On 11/5/19 6:49 PM, Thara Gopinath wrote:
> Thermal governors can respond to an overheat event of a cpu by

> capping the cpu's maximum possible frequency. This in turn

> means that the maximum available compute capacity of the

> cpu is restricted. But today in the kernel, task scheduler is

> not notified of capping of maximum frequency of a cpu.

> In other words, scheduler is unaware of maximum capacity

> restrictions placed on a cpu due to thermal activity.

> This patch series attempts to address this issue.

> The benefits identified are better task placement among available

> cpus in event of overheating which in turn leads to better

> performance numbers.

>

> The reduction in the maximum possible capacity of a cpu due to a

> thermal event can be considered as thermal pressure. Instantaneous

> thermal pressure is hard to record and can sometime be erroneous

> as there can be mismatch between the actual capping of capacity

> and scheduler recording it. Thus solution is to have a weighted

> average per cpu value for thermal pressure over time.

> The weight reflects the amount of time the cpu has spent at a

> capped maximum frequency. Since thermal pressure is recorded as

> an average, it must be decayed periodically. Exisiting algorithm

> in the kernel scheduler pelt framework is re-used to calculate

> the weighted average. This patch series also defines a sysctl

> inerface to allow for a configurable decay period.

>

> Regarding testing, basic build, boot and sanity testing have been

> performed on db845c platform with debian file system.

> Further, dhrystone and hackbench tests have been

> run with the thermal pressure algorithm. During testing, due to

> constraints of step wise governor in dealing with big little systems,

I don't understand this modification. Could you explain what was the
issue and if this modification did not break the original
thermal solution upfront? You are then comparing this modified
version and treat it as an 'origin', am I right?

> trip point 0 temperature was made assymetric between cpus in little

> cluster and big cluster; the idea being that

> big core will heat up and cpu cooling device will throttle the

> frequency of the big cores faster, there by limiting the maximum available

> capacity and the scheduler will spread out tasks to little cores as well.

>

> Test Results

>

> Hackbench: 1 group , 30000 loops, 10 runs

>                                                 Result         SD

>                                                 (Secs)     (% of mean)

>   No Thermal Pressure                            14.03       2.69%

>   Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%

>   Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%

>   Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%

>   Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%

>   Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%

>

> Dhrystone Run Time  : 20 threads, 3000 MLOOPS

>                                                   Result      SD

>                                                   (Secs)    (% of mean)

>   No Thermal Pressure                              9.452      4.49%

>   Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%

>   Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%

>   Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%

>   Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%

>   Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%

>

What I would like to see also for this performance results is
avg temperature of the chip. Is it higher than in the 'origin'?

Regards,
Lukasz Luba

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Amit Kucheria Nov. 19, 2019, 10:54 a.m. UTC | #2

On Wed, Nov 6, 2019 at 12:20 AM Thara Gopinath
<thara.gopinath@linaro.org> wrote:
>

> Thermal governors can respond to an overheat event of a cpu by

> capping the cpu's maximum possible frequency. This in turn

> means that the maximum available compute capacity of the

> cpu is restricted. But today in the kernel, task scheduler is

> not notified of capping of maximum frequency of a cpu.

> In other words, scheduler is unaware of maximum capacity

> restrictions placed on a cpu due to thermal activity.

> This patch series attempts to address this issue.

> The benefits identified are better task placement among available

> cpus in event of overheating which in turn leads to better

> performance numbers.

>

> The reduction in the maximum possible capacity of a cpu due to a

> thermal event can be considered as thermal pressure. Instantaneous

> thermal pressure is hard to record and can sometime be erroneous

> as there can be mismatch between the actual capping of capacity

> and scheduler recording it. Thus solution is to have a weighted

> average per cpu value for thermal pressure over time.

> The weight reflects the amount of time the cpu has spent at a

> capped maximum frequency. Since thermal pressure is recorded as

> an average, it must be decayed periodically. Exisiting algorithm

> in the kernel scheduler pelt framework is re-used to calculate

> the weighted average. This patch series also defines a sysctl

> inerface to allow for a configurable decay period.

>

> Regarding testing, basic build, boot and sanity testing have been

> performed on db845c platform with debian file system.

> Further, dhrystone and hackbench tests have been

> run with the thermal pressure algorithm. During testing, due to

> constraints of step wise governor in dealing with big little systems,


What contraints?

> trip point 0 temperature was made assymetric between cpus in little

> cluster and big cluster; the idea being that

> big core will heat up and cpu cooling device will throttle the

> frequency of the big cores faster, there by limiting the maximum available

> capacity and the scheduler will spread out tasks to little cores as well.


Can you share the hack to get this behaviour as well so I can try to
reproduce on 845c?

> Test Results

>

> Hackbench: 1 group , 30000 loops, 10 runs

>                                                Result         SD

>                                                (Secs)     (% of mean)

>  No Thermal Pressure                            14.03       2.69%

>  Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%

>  Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%

>  Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%

>  Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%

>  Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%

>

Lukasz Luba Nov. 19, 2019, 3:12 p.m. UTC | #3

On 11/12/19 11:21 AM, Lukasz Luba wrote:
> Hi Thara,

> 

> I am going to try your patch set on some different board.

> To do that I need more information regarding your setup.

> Please find my comments below. I need probably one hack

> which do not fully understand.

> 

> On 11/5/19 6:49 PM, Thara Gopinath wrote:

>> Thermal governors can respond to an overheat event of a cpu by

>> capping the cpu's maximum possible frequency. This in turn

>> means that the maximum available compute capacity of the

>> cpu is restricted. But today in the kernel, task scheduler is

>> not notified of capping of maximum frequency of a cpu.

>> In other words, scheduler is unaware of maximum capacity

>> restrictions placed on a cpu due to thermal activity.

>> This patch series attempts to address this issue.

>> The benefits identified are better task placement among available

>> cpus in event of overheating which in turn leads to better

>> performance numbers.

>>

>> The reduction in the maximum possible capacity of a cpu due to a

>> thermal event can be considered as thermal pressure. Instantaneous

>> thermal pressure is hard to record and can sometime be erroneous

>> as there can be mismatch between the actual capping of capacity

>> and scheduler recording it. Thus solution is to have a weighted

>> average per cpu value for thermal pressure over time.

>> The weight reflects the amount of time the cpu has spent at a

>> capped maximum frequency. Since thermal pressure is recorded as

>> an average, it must be decayed periodically. Exisiting algorithm

>> in the kernel scheduler pelt framework is re-used to calculate

>> the weighted average. This patch series also defines a sysctl

>> inerface to allow for a configurable decay period.

>>

>> Regarding testing, basic build, boot and sanity testing have been

>> performed on db845c platform with debian file system.

>> Further, dhrystone and hackbench tests have been

>> run with the thermal pressure algorithm. During testing, due to

>> constraints of step wise governor in dealing with big little systems,

> I don't understand this modification. Could you explain what was the

> issue and if this modification did not break the original

> thermal solution upfront? You are then comparing this modified

> version and treat it as an 'origin', am I right?

With Ionela's help I understood the reason for doing this hack.

For those who follow: She created a 'capacity inversion' between
big and little cores to tests if the patches really work.
How: she starts throttling big cores at lower temperature, so earlier
in time, thus the power is shifted towards little cores (which are more
energy efficient and can run with higher frequency). The big cores
run at minimum frequency and little (hopefully) at max frequency.

This 'capacity inversion' is the use case which might occur in the
real world. It is hard to trigger it in normal benchmarks, though.
I don't know how often this 'capacity inversion' occurs and for how
long it stays in real workloads. Based on the tests run with default
thermal solution and results almost the same, I would say that it is
not often (maybe 3% of the test period, otherwise I would get better
results because this patch set solves this issue).
I have run a few different kernels and benchmarks without this
'capacity inversions' and I don't see the regression (and benefits
from this solution), which is also a big plus in case of mainlining it.

In case where the 'capacity inversion' is artificially introduced
into the system for 100% time, the stress tests show huge difference.
Please refer to Ionela's test results [1] (~30% better).

Regards,
Lukasz Luba

[1] 
https://docs.google.com/spreadsheets/d/1ibxDSSSLTodLzihNAw6jM36eVZABuPMMnjvV-Xh4NEo/edit#gid=0