[0/2] arm64: dts: qcom: sm8650: rework CPU & GPU thermal zones

Message ID	20250103-topic-sm8650-thermal-cpu-idle-v1-0-faa1f011ecd9@linaro.org
Headers	show Received: from mail-wr1-f44.google.com (mail-wr1-f44.google.com [209.85.221.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 562851FA15D for <linux-arm-msm@vger.kernel.org>; Fri, 3 Jan 2025 14:38:33 +0000 (UTC) From: Neil Armstrong <neil.armstrong@linaro.org> Subject: [PATCH 0/2] arm64: dts: qcom: sm8650: rework CPU & GPU thermal zones Date: Fri, 03 Jan 2025 15:38:25 +0100 Message-Id: <20250103-topic-sm8650-thermal-cpu-idle-v1-0-faa1f011ecd9@linaro.org> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: Bjorn Andersson <andersson@kernel.org>, Konrad Dybcio <konradybcio@kernel.org>, Rob Herring <robh@kernel.org>, Krzysztof Kozlowski <krzk+dt@kernel.org>, Conor Dooley <conor+dt@kernel.org> Cc: linux-arm-msm@vger.kernel.org, devicetree@vger.kernel.org, linux-kernel@vger.kernel.org, Neil Armstrong <neil.armstrong@linaro.org>
Series	arm64: dts: qcom: sm8650: rework CPU & GPU thermal zones \| expand [0/2] arm64: dts: qcom: sm8650: rework CPU & GPU thermal zones [1/2] arm64: dts: qcom: sm8650: setup cpu thermal with idle on high temperatures [2/2] arm64: dts: qcom: sm8650: setup gpu thermal with higher temperatures

Message ID

20250103-topic-sm8650-thermal-cpu-idle-v1-0-faa1f011ecd9@linaro.org

Headers

From: Neil Armstrong <neil.armstrong@linaro.org>
Subject: [PATCH 0/2] arm64: dts: qcom: sm8650: rework CPU & GPU thermal
 zones
Date: Fri, 03 Jan 2025 15:38:25 +0100
Message-Id: 
 <20250103-topic-sm8650-thermal-cpu-idle-v1-0-faa1f011ecd9@linaro.org>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
To: Bjorn Andersson <andersson@kernel.org>,
 Konrad Dybcio <konradybcio@kernel.org>, Rob Herring <robh@kernel.org>,
 Krzysztof Kozlowski <krzk+dt@kernel.org>,
 Conor Dooley <conor+dt@kernel.org>
Cc: linux-arm-msm@vger.kernel.org, devicetree@vger.kernel.org,
 linux-kernel@vger.kernel.org, Neil Armstrong <neil.armstrong@linaro.org>

Series

arm64: dts: qcom: sm8650: rework CPU & GPU thermal zones | expand

Message

Neil Armstrong Jan. 3, 2025, 2:38 p.m. UTC

On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for
the CPUs and GPU is handled by hardware & firmware using factory and
form-factor determined parameters in order to maximize frequency while
keeping the temperature way below the junction temperature where the SoC
would experience a thermal shutdown if not permanent damages.

On the other side, the High Level Ooperating System (HLOS), like Linux,
is able to adjust the CPU and GPU frequency using the internal SoC
temperature sensors (here tsens) and it's UP/LOW interrupts, but it
effectly does the same work twice in an less effective manner.

Let's take the Hardware & Firmware action in account and design the
thermal zones trip points and cooling devices mapping to use the HLOS
as a safety warant in case the platform experiences a temperature surge
to helpfully avoid a thermal shutdown and handle the scenario gracefully.

On the CPU side, the LMh hardware does the DCVS control loop, so
let's set higher trip points temperatures closer to the junction
and thermal shutdown temperatures and add some idle injection cooling
device with 100% duty cycle for each CPU that would act as emergency
action to avoid the thermal shutdown.

On the GPU side, the GPU Management Unit (GMU) acts as the DCVS
control loop, but since we can't perform idle injection, let's
also set higher trip points temperatures closer to the junction
and thermal shutdown temperatures to reduce the GPU frequency only
as an emergency action before the thermal shutdown.

Those 2 changes optimizes the thermal management design by avoiding
concurrent thermal management, calculations & avoidable interrupts
by moving the HLOS management to a last resort emergency if the
Hardware & Firmwares fails to avoid a thermal shutdown.

Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
---
Neil Armstrong (2):
      arm64: dts: qcom: sm8650: setup cpu thermal with idle on high temperatures
      arm64: dts: qcom: sm8650: setup gpu thermal with higher temperatures

 arch/arm64/boot/dts/qcom/sm8650.dtsi | 322 ++++++++++++++++++++++++++---------
 1 file changed, 238 insertions(+), 84 deletions(-)
---
base-commit: 8155b4ef3466f0e289e8fcc9e6e62f3f4dceeac2
change-id: 20250103-topic-sm8650-thermal-cpu-idle-1e19181a94ed

Best regards,

Comments

Konrad Dybcio Jan. 3, 2025, 2:43 p.m. UTC | #1

On 3.01.2025 3:38 PM, Neil Armstrong wrote:
> On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for
> the CPUs and GPU is handled by hardware & firmware using factory and
> form-factor determined parameters in order to maximize frequency while
> keeping the temperature way below the junction temperature where the SoC
> would experience a thermal shutdown if not permanent damages.
> 
> On the other side, the High Level Ooperating System (HLOS), like Linux,
> is able to adjust the CPU and GPU frequency using the internal SoC
> temperature sensors (here tsens) and it's UP/LOW interrupts, but it
> effectly does the same work twice in an less effective manner.
> 
> Let's take the Hardware & Firmware action in account and design the
> thermal zones trip points and cooling devices mapping to use the HLOS
> as a safety warant in case the platform experiences a temperature surge
> to helpfully avoid a thermal shutdown and handle the scenario gracefully.
> 
> On the CPU side, the LMh hardware does the DCVS control loop, so
> let's set higher trip points temperatures closer to the junction
> and thermal shutdown temperatures and add some idle injection cooling
> device with 100% duty cycle for each CPU that would act as emergency
> action to avoid the thermal shutdown.
> 
> On the GPU side, the GPU Management Unit (GMU) acts as the DCVS
> control loop, but since we can't perform idle injection, let's
> also set higher trip points temperatures closer to the junction
> and thermal shutdown temperatures to reduce the GPU frequency only
> as an emergency action before the thermal shutdown.
> 
> Those 2 changes optimizes the thermal management design by avoiding
> concurrent thermal management, calculations & avoidable interrupts
> by moving the HLOS management to a last resort emergency if the
> Hardware & Firmwares fails to avoid a thermal shutdown.
> 
> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
> ---

Got any numbers to back this?

Konrad

Neil Armstrong Jan. 3, 2025, 2:49 p.m. UTC | #2

On 03/01/2025 15:43, Konrad Dybcio wrote:
> On 3.01.2025 3:38 PM, Neil Armstrong wrote:
>> On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for
>> the CPUs and GPU is handled by hardware & firmware using factory and
>> form-factor determined parameters in order to maximize frequency while
>> keeping the temperature way below the junction temperature where the SoC
>> would experience a thermal shutdown if not permanent damages.
>>
>> On the other side, the High Level Ooperating System (HLOS), like Linux,
>> is able to adjust the CPU and GPU frequency using the internal SoC
>> temperature sensors (here tsens) and it's UP/LOW interrupts, but it
>> effectly does the same work twice in an less effective manner.
>>
>> Let's take the Hardware & Firmware action in account and design the
>> thermal zones trip points and cooling devices mapping to use the HLOS
>> as a safety warant in case the platform experiences a temperature surge
>> to helpfully avoid a thermal shutdown and handle the scenario gracefully.
>>
>> On the CPU side, the LMh hardware does the DCVS control loop, so
>> let's set higher trip points temperatures closer to the junction
>> and thermal shutdown temperatures and add some idle injection cooling
>> device with 100% duty cycle for each CPU that would act as emergency
>> action to avoid the thermal shutdown.
>>
>> On the GPU side, the GPU Management Unit (GMU) acts as the DCVS
>> control loop, but since we can't perform idle injection, let's
>> also set higher trip points temperatures closer to the junction
>> and thermal shutdown temperatures to reduce the GPU frequency only
>> as an emergency action before the thermal shutdown.
>>
>> Those 2 changes optimizes the thermal management design by avoiding
>> concurrent thermal management, calculations & avoidable interrupts
>> by moving the HLOS management to a last resort emergency if the
>> Hardware & Firmwares fails to avoid a thermal shutdown.
>>
>> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
>> ---
> 
> Got any numbers to back this?

To back which part ? Yes I've been running loads with difference
scenarios and effectively the hardware work is much better with
a more linear correction and slighly better performances because
it sets slighly higger OPPs while maintaining the core closer to
the target temperature range. Which is kind of expected.

I don't have easy numbers to share, sorry...

So yes I consider avoiding the concurrent effort is better, but
since we also take the firmware design in account in the whole platform
representation in DT (DSPs, SCM, GMU, ...) we should also extend this
to thermal.

Neil

> 
> Konrad