diff mbox series

[1/2] thermal: cooling: Check Energy Model type in cpufreq_cooling and devfreq_cooling

Message ID 20220207073036.14901-2-lukasz.luba@arm.com
State New
Headers show
Series [1/2] thermal: cooling: Check Energy Model type in cpufreq_cooling and devfreq_cooling | expand

Commit Message

Lukasz Luba Feb. 7, 2022, 7:30 a.m. UTC
The Energy Model supports power values either in Watts or in some abstract
scale. When the 2nd option is in use, the thermal governor IPA should not
be allowed to operate, since the relation between cooling devices is not
properly defined. Thus, it might be possible that big GPU has lower power
values in abstract scale than a Little CPU. To mitigate a misbehaviour
of the thermal control algorithm, simply not register a cooling device
capable of working with IPA.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/cpufreq_cooling.c |  2 +-
 drivers/thermal/devfreq_cooling.c | 16 +++++++++++++---
 2 files changed, 14 insertions(+), 4 deletions(-)

Comments

Lukasz Luba Feb. 8, 2022, 9:32 a.m. UTC | #1
On 2/8/22 12:50 AM, Matthias Kaehlcke wrote:
> On Mon, Feb 07, 2022 at 07:30:35AM +0000, Lukasz Luba wrote:
>> The Energy Model supports power values either in Watts or in some abstract
>> scale. When the 2nd option is in use, the thermal governor IPA should not
>> be allowed to operate, since the relation between cooling devices is not
>> properly defined. Thus, it might be possible that big GPU has lower power
>> values in abstract scale than a Little CPU. To mitigate a misbehaviour
>> of the thermal control algorithm, simply not register a cooling device
>> capable of working with IPA.
> 
> Ugh, this would break thermal throttling for existing devices that are
> currently supported in the upstream kernel.

Could you point me to those devices? I cannot find them in the mainline
DT. There are no GPU devices which register Energy Model (EM) in
upstream, neither using DT (which would be power in mW) nor explicitly
providing EM get_power() callback. The EM is needed to have IPA.

Please clarify which existing devices are going to be broken with this
change.

> 
> Wasn't the conclusion that it is the responsability of the device tree
> owners to ensure that cooling devices with different scales aren't used
> in the same thermal zone?

It's based on assumption that someone has DT and control. There was also
implicit assumption that IPA would work properly on such platform - but
it won't.

1. You cannot have 2 thermal zones: one with CPUs and other with GPU
only and both working with two instances of IPA.

2. The abstract power scale doesn't guaranty anything about power values
and IPA was simply designed with milli-Watts in mind. So even working
on CPUs only using bogoWatts, is not what we could guaranty in IPA.

> 
> That's also what's currently specified in the power allocator
> documentation:
> 
>    Another important thing is the consistent scale of the power values
>    provided by the cooling devices. All of the cooling devices in a single
>    thermal zone should have power values reported either in milli-Watts
>    or scaled to the same 'abstract scale'.

This can change. We have removed the userspace governor from kernel
recently. The trend is to implement thermal policy in FW. Dealing with
some intermediate configurations are causing complicated design, support
of the algorithm logic is also more complex.

> 
> Which was actually added by yourself:
> 
> commit 5a64f775691647c242aa40d34f3512e7b179a921
> Author: Lukasz Luba <lukasz.luba@arm.com>
> Date:   Tue Nov 3 09:05:58 2020 +0000
> 
>      PM: EM: Clarify abstract scale usage for power values in Energy Model
> 
>      The Energy Model (EM) can store power values in milli-Watts or in abstract
>      scale. This might cause issues in the subsystems which use the EM for
>          estimating the device power, such as:
> 
>       - mixing of different scales in a subsystem which uses multiple
>              (cooling) devices (e.g. thermal Intelligent Power Allocation (IPA))
> 
>       - assuming that energy [milli-Joules] can be derived from the EM power
>              values which might not be possible since the power scale doesn't have
> 	           to be in milli-Watts
> 
>      To avoid misconfiguration add the requisite documentation to the EM and
>          related subsystems: EAS and IPA.
> 
>      Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>      Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> 
> It's ugly to have the abstract scales in the first place, but that's
> unfortunately what we currently have for at least some cooling devices.

A few questions:
Do you use 'we' as Chrome engineers?
Could you point me to those devices please?
Are they new or some old platforms which need just maintenance?
How IPA works for you in such real platform configuration?
If it would be possible could you share some plots of temperature,
frequency and CPUs, GPU utilization?
Do you maybe know how the real power was scaled for them?

It would help me understand and judge.

> 
> IMO it would be preferable to stick to catching incompliant configurations
> in reviews, rather than breaking thermal throttling for existing devices
> with configurations that comply with the current documentation.
> 

Without access to the source code of those devices, it's hard for me to
see if they are broken.

Regards,
Lukasz
Matthias Kaehlcke Feb. 8, 2022, 5:25 p.m. UTC | #2
On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
> 
> 
> On 2/8/22 12:50 AM, Matthias Kaehlcke wrote:
> > On Mon, Feb 07, 2022 at 07:30:35AM +0000, Lukasz Luba wrote:
> > > The Energy Model supports power values either in Watts or in some abstract
> > > scale. When the 2nd option is in use, the thermal governor IPA should not
> > > be allowed to operate, since the relation between cooling devices is not
> > > properly defined. Thus, it might be possible that big GPU has lower power
> > > values in abstract scale than a Little CPU. To mitigate a misbehaviour
> > > of the thermal control algorithm, simply not register a cooling device
> > > capable of working with IPA.
> > 
> > Ugh, this would break thermal throttling for existing devices that are
> > currently supported in the upstream kernel.
> 
> Could you point me to those devices? I cannot find them in the mainline
> DT. There are no GPU devices which register Energy Model (EM) in
> upstream, neither using DT (which would be power in mW) nor explicitly
> providing EM get_power() callback. The EM is needed to have IPA.
> 
> Please clarify which existing devices are going to be broken with this
> change.

I was thinking about arch/arm64/boot/dts/qcom/sc7180-trogdor-*, and
potentially other SC7180 boards that use IPA for the CPU thermal
zones.

Initially SC7180 used an abstract scale for the CPU energy model,
however I realized your change doesn't actually impact SC7180 CPUs
for two reasons:

1. The energy model of the CPUs is registered through

  cpufreq_register_em_with_opp
    dev_pm_opp_of_register_em
      em_dev_register_perf_domain

em_dev_register_perf_domain() is called with 'milliwatts = true',
regardless of the potentially abstract scale, so IPA would not be
disabled with your change.

2. There is a patch from Doug that adjusted the dynamic power
coefficients of the CPUs to be closer to reality:

commit 82ea7d411d43f60dce878252558e926f957109f0
Author: Douglas Anderson <dianders@chromium.org>
Date:   Thu Sep 2 14:51:37 2021 -0700

    arm64: dts: qcom: sc7180: Base dynamic CPU power coefficients in reality

> > Wasn't the conclusion that it is the responsability of the device tree
> > owners to ensure that cooling devices with different scales aren't used
> > in the same thermal zone?
> 
> It's based on assumption that someone has DT and control. There was also
> implicit assumption that IPA would work properly on such platform - but
> it won't.
> 
> 1. You cannot have 2 thermal zones: one with CPUs and other with GPU
> only and both working with two instances of IPA.

It's not clear to me why such a configuration wouldn't work. Is it also a
problem to have multiple CPU thermal zones (one for each core) that use
multiple instances of IPA? SC7180 has separate thermal zones for each core
(or thermal sensor), Chrome OS uses IPA for CPU thermal throttling.

> 2. The abstract power scale doesn't guaranty anything about power values
> and IPA was simply designed with milli-Watts in mind. So even working
> on CPUs only using bogoWatts, is not what we could guaranty in IPA.

That's bad news for SoCs that are configured with bogoWatt values, from
past discussions I had the impression that this is unfortunately not
uncommon.

> It's ugly to have the abstract scales in the first place, but that's
> unfortunately what we currently have for at least some cooling devices.

> A few questions:
>
> Do you use 'we' as Chrome engineers?

I was referring to the kernel, in particular QCOM SC7180.

> Could you point me to those devices please?

arch/arm64/boot/dts/qcom/sc7180-trogdor-*

Though as per above they shouldn't be impacted by your change, since the
CPUs always pretend to use milli-Watts.

[skipped some questions/answers since sc7180 isn't actually impacted by
 the change]

Thanks

Matthias
Lukasz Luba Feb. 9, 2022, 11:16 a.m. UTC | #3
On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
>>
>>
>> On 2/8/22 12:50 AM, Matthias Kaehlcke wrote:
>>> On Mon, Feb 07, 2022 at 07:30:35AM +0000, Lukasz Luba wrote:
>>>> The Energy Model supports power values either in Watts or in some abstract
>>>> scale. When the 2nd option is in use, the thermal governor IPA should not
>>>> be allowed to operate, since the relation between cooling devices is not
>>>> properly defined. Thus, it might be possible that big GPU has lower power
>>>> values in abstract scale than a Little CPU. To mitigate a misbehaviour
>>>> of the thermal control algorithm, simply not register a cooling device
>>>> capable of working with IPA.
>>>
>>> Ugh, this would break thermal throttling for existing devices that are
>>> currently supported in the upstream kernel.
>>
>> Could you point me to those devices? I cannot find them in the mainline
>> DT. There are no GPU devices which register Energy Model (EM) in
>> upstream, neither using DT (which would be power in mW) nor explicitly
>> providing EM get_power() callback. The EM is needed to have IPA.
>>
>> Please clarify which existing devices are going to be broken with this
>> change.
> 
> I was thinking about arch/arm64/boot/dts/qcom/sc7180-trogdor-*, and
> potentially other SC7180 boards that use IPA for the CPU thermal
> zones.
> 
> Initially SC7180 used an abstract scale for the CPU energy model,
> however I realized your change doesn't actually impact SC7180 CPUs
> for two reasons:
> 
> 1. The energy model of the CPUs is registered through
> 
>    cpufreq_register_em_with_opp
>      dev_pm_opp_of_register_em
>        em_dev_register_perf_domain
> 
> em_dev_register_perf_domain() is called with 'milliwatts = true',
> regardless of the potentially abstract scale, so IPA would not be
> disabled with your change.

That good registration.

> 
> 2. There is a patch from Doug that adjusted the dynamic power
> coefficients of the CPUs to be closer to reality:
> 
> commit 82ea7d411d43f60dce878252558e926f957109f0
> Author: Douglas Anderson <dianders@chromium.org>
> Date:   Thu Sep 2 14:51:37 2021 -0700
> 
>      arm64: dts: qcom: sc7180: Base dynamic CPU power coefficients in reality

Thanks for the link to the commit. I'll have a look. It's good that it
doesn't break this board.

> 
>>> Wasn't the conclusion that it is the responsability of the device tree
>>> owners to ensure that cooling devices with different scales aren't used
>>> in the same thermal zone?
>>
>> It's based on assumption that someone has DT and control. There was also
>> implicit assumption that IPA would work properly on such platform - but
>> it won't.
>>
>> 1. You cannot have 2 thermal zones: one with CPUs and other with GPU
>> only and both working with two instances of IPA.
> 
> It's not clear to me why such a configuration wouldn't work. Is it also a
> problem to have multiple CPU thermal zones (one for each core) that use
> multiple instances of IPA? SC7180 has separate thermal zones for each core
> (or thermal sensor), Chrome OS uses IPA for CPU thermal throttling.

Unfortunately, the control mechanism is not working best in such setup.
Main idea of IPA is to find 'best' OPP for a cooling device, e.g.
one in the middle of a freq table. Worst scenario is when we change
our decision between lowest and highest OPP, in short periods.
It's burning too much power at highest freq, then we throttle too much,
then we unthrottle because temp is dropping by some 'good enough' value.
This situation can happen due to a few reasons:
1. Power values from EM are not reflecting reality, e.g. scaled too much
2. We don't have proper information about CPU load and frequencies used
3. PID coefficients are not properly set
4. board has small physical thermal sink potential but silicon if 'hot'
5. the workload is too dynamic
6. there is another power hungry device (GPU, DSP, ISP) which is outside
    of our control loop, e.g. is controlled in other thermal zone and has
    impact on our temp sensor value

Proper setup and tuning of IPA could bring quite good benefit, because
it could pick the 'best at that moment' OPPs for the devices, instead
of throttling too hard and then unthrottling. Unfortunately, it's
tricky and needs time (experiments, understanding the system).

We have been trying to easy this entry barrier for developers. Very good
results brings the optimization of the PID coefficients. The automated
mechanism would figure out best values for the given board based on the
tests run. It's under review now, there are also shared results [1][2].
It would solve a some of the issues with complex tuning.

I was going to give it a try for my old experiments with the bogoWatts
devices, but I don't expect that this is a silver bullet. Regarding
the 'two thermal zones with IPA issues' I'm prity sure it won't help.

> 
>> 2. The abstract power scale doesn't guaranty anything about power values
>> and IPA was simply designed with milli-Watts in mind. So even working
>> on CPUs only using bogoWatts, is not what we could guaranty in IPA.
> 
> That's bad news for SoCs that are configured with bogoWatt values, from
> past discussions I had the impression that this is unfortunately not
> uncommon.
> 
>> It's ugly to have the abstract scales in the first place, but that's
>> unfortunately what we currently have for at least some cooling devices.
> 
>> A few questions:
>>
>> Do you use 'we' as Chrome engineers?
> 
> I was referring to the kernel, in particular QCOM SC7180.

Thanks for sharing the name.

> 
>> Could you point me to those devices please?
> 
> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
> 
> Though as per above they shouldn't be impacted by your change, since the
> CPUs always pretend to use milli-Watts.
> 
> [skipped some questions/answers since sc7180 isn't actually impacted by
>   the change]

Thank you Matthias. I will investigate your setup to get better
understanding.

Regards,
Lukasz

> 
> Thanks
> 
> Matthias
> 

[1] https://nbviewer.org/gist/chemis01/0e86ad81508860659a57338dae8274f9
[2] 
https://lore.kernel.org/all/20211217184907.2103677-1-chetan.mistry@arm.com/
Matthias Kaehlcke Feb. 9, 2022, 10:17 p.m. UTC | #4
On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
> 
> 
> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
> > On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
> > > 
> > > 
> > > On 2/8/22 12:50 AM, Matthias Kaehlcke wrote:
> > > > On Mon, Feb 07, 2022 at 07:30:35AM +0000, Lukasz Luba wrote:
> > > > > The Energy Model supports power values either in Watts or in some abstract
> > > > > scale. When the 2nd option is in use, the thermal governor IPA should not
> > > > > be allowed to operate, since the relation between cooling devices is not
> > > > > properly defined. Thus, it might be possible that big GPU has lower power
> > > > > values in abstract scale than a Little CPU. To mitigate a misbehaviour
> > > > > of the thermal control algorithm, simply not register a cooling device
> > > > > capable of working with IPA.
> > > > 
> > > > Ugh, this would break thermal throttling for existing devices that are
> > > > currently supported in the upstream kernel.
> > > 
> > > Could you point me to those devices? I cannot find them in the mainline
> > > DT. There are no GPU devices which register Energy Model (EM) in
> > > upstream, neither using DT (which would be power in mW) nor explicitly
> > > providing EM get_power() callback. The EM is needed to have IPA.
> > > 
> > > Please clarify which existing devices are going to be broken with this
> > > change.
> > 
> > I was thinking about arch/arm64/boot/dts/qcom/sc7180-trogdor-*, and
> > potentially other SC7180 boards that use IPA for the CPU thermal
> > zones.
> > 
> > Initially SC7180 used an abstract scale for the CPU energy model,
> > however I realized your change doesn't actually impact SC7180 CPUs
> > for two reasons:
> > 
> > 1. The energy model of the CPUs is registered through
> > 
> >    cpufreq_register_em_with_opp
> >      dev_pm_opp_of_register_em
> >        em_dev_register_perf_domain
> > 
> > em_dev_register_perf_domain() is called with 'milliwatts = true',
> > regardless of the potentially abstract scale, so IPA would not be
> > disabled with your change.
> 
> That good registration.
> 
> > 
> > 2. There is a patch from Doug that adjusted the dynamic power
> > coefficients of the CPUs to be closer to reality:
> > 
> > commit 82ea7d411d43f60dce878252558e926f957109f0
> > Author: Douglas Anderson <dianders@chromium.org>
> > Date:   Thu Sep 2 14:51:37 2021 -0700
> > 
> >      arm64: dts: qcom: sc7180: Base dynamic CPU power coefficients in reality
> 
> Thanks for the link to the commit. I'll have a look. It's good that it
> doesn't break this board.
> 
> > 
> > > > Wasn't the conclusion that it is the responsability of the device tree
> > > > owners to ensure that cooling devices with different scales aren't used
> > > > in the same thermal zone?
> > > 
> > > It's based on assumption that someone has DT and control. There was also
> > > implicit assumption that IPA would work properly on such platform - but
> > > it won't.
> > > 
> > > 1. You cannot have 2 thermal zones: one with CPUs and other with GPU
> > > only and both working with two instances of IPA.
> > 
> > It's not clear to me why such a configuration wouldn't work. Is it also a
> > problem to have multiple CPU thermal zones (one for each core) that use
> > multiple instances of IPA? SC7180 has separate thermal zones for each core
> > (or thermal sensor), Chrome OS uses IPA for CPU thermal throttling.
> 
> Unfortunately, the control mechanism is not working best in such setup.
> Main idea of IPA is to find 'best' OPP for a cooling device, e.g.
> one in the middle of a freq table. Worst scenario is when we change
> our decision between lowest and highest OPP, in short periods.
> It's burning too much power at highest freq, then we throttle too much,
> then we unthrottle because temp is dropping by some 'good enough' value.
> This situation can happen due to a few reasons:
> 1. Power values from EM are not reflecting reality, e.g. scaled too much
> 2. We don't have proper information about CPU load and frequencies used
> 3. PID coefficients are not properly set
> 4. board has small physical thermal sink potential but silicon if 'hot'
> 5. the workload is too dynamic
> 6. there is another power hungry device (GPU, DSP, ISP) which is outside
>    of our control loop, e.g. is controlled in other thermal zone and has
>    impact on our temp sensor value

Thanks for the details!

> Proper setup and tuning of IPA could bring quite good benefit, because
> it could pick the 'best at that moment' OPPs for the devices, instead
> of throttling too hard and then unthrottling. Unfortunately, it's
> tricky and needs time (experiments, understanding the system).

On some sc7180 ChromeOS boards we observed a behavior as you describe,
we mitigated it by tuning the device 'sustainable-power' device tree
property and the trip point temperatures.

> We have been trying to easy this entry barrier for developers. Very good
> results brings the optimization of the PID coefficients. The automated
> mechanism would figure out best values for the given board based on the
> tests run. It's under review now, there are also shared results [1][2].
> It would solve a some of the issues with complex tuning.

Good to know, thanks for the pointer!

> I was going to give it a try for my old experiments with the bogoWatts
> devices, but I don't expect that this is a silver bullet. Regarding
> the 'two thermal zones with IPA issues' I'm prity sure it won't help.
> 
> > 
> > > 2. The abstract power scale doesn't guaranty anything about power values
> > > and IPA was simply designed with milli-Watts in mind. So even working
> > > on CPUs only using bogoWatts, is not what we could guaranty in IPA.
> > 
> > That's bad news for SoCs that are configured with bogoWatt values, from
> > past discussions I had the impression that this is unfortunately not
> > uncommon.
> > 
> > > It's ugly to have the abstract scales in the first place, but that's
> > > unfortunately what we currently have for at least some cooling devices.
> > 
> > > A few questions:
> > > 
> > > Do you use 'we' as Chrome engineers?
> > 
> > I was referring to the kernel, in particular QCOM SC7180.
> 
> Thanks for sharing the name.
> 
> > 
> > > Could you point me to those devices please?
> > 
> > arch/arm64/boot/dts/qcom/sc7180-trogdor-*
> > 
> > Though as per above they shouldn't be impacted by your change, since the
> > CPUs always pretend to use milli-Watts.
> > 
> > [skipped some questions/answers since sc7180 isn't actually impacted by
> >   the change]
> 
> Thank you Matthias. I will investigate your setup to get better
> understanding.

Thanks!
Lukasz Luba Feb. 16, 2022, 3:35 p.m. UTC | #5
Hi Matthias,

On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
>>
>>
>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
>>>>
>>>>

[snip]

>>>> Could you point me to those devices please?
>>>
>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
>>>
>>> Though as per above they shouldn't be impacted by your change, since the
>>> CPUs always pretend to use milli-Watts.
>>>
>>> [skipped some questions/answers since sc7180 isn't actually impacted by
>>>    the change]
>>
>> Thank you Matthias. I will investigate your setup to get better
>> understanding.
> 
> Thanks!
> 

I've checked those DT files and related code.
As you already said, this patch is safe for them.
So we can apply it IMO.


-------------Off-topic------------------
Not in $subject comments:

AFAICS based on two files which define thermal zones:
sc7180-trogdor-homestar.dtsi
sc7180-trogdor-coachz.dtsi

only the 'big' cores are used as cooling devices in the
'skin_temp_thermal' - the CPU6 and CPU7.

I assume you don't want to model at all the power usage
from the Little cluster (which is quite big: 6 CPUs), do you?
I can see that the Little CPUs have small dyn-power-coeff
~30% of the big and lower max freq, but still might be worth
to add them to IPA. You might give them more 'weight', to
make sure they receive more power during power split.

You also don't have GPU cooling device in that thermal zone.
Based on my experience if your GPU is a power hungry one,
e.g. 2-4Watts, you might get better results when you model
this 'hot' device (which impacts your temp sensor reported value).

Regards,
Lukasz
Doug Anderson Feb. 16, 2022, 5:21 p.m. UTC | #6
Hi,

On Tue, Feb 8, 2022 at 1:32 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> >    Another important thing is the consistent scale of the power values
> >    provided by the cooling devices. All of the cooling devices in a single
> >    thermal zone should have power values reported either in milli-Watts
> >    or scaled to the same 'abstract scale'.
>
> This can change. We have removed the userspace governor from kernel
> recently. The trend is to implement thermal policy in FW. Dealing with
> some intermediate configurations are causing complicated design, support
> of the algorithm logic is also more complex.

One thing that didn't get addressed is the whole "The trend is to
implement thermal policy in FW". I'm not sure I can get on board with
that trend. IMO "moving to FW" isn't a super great trend. FW is harder
to update than kernel and trying to keep it in sync with the kernel
isn't wonderful. Unless something _has_ to be in FW I personally
prefer it to be in the kernel.

...although now that I re-read this, I'm not sure which firmware you
might be talking about. Is this the AP firmware, or some companion
chip / coprocessor? Even so, I'd still rather see things done in the
kernel when possible...

-Doug
Doug Anderson Feb. 16, 2022, 5:33 p.m. UTC | #7
Hi,

On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Matthias,
>
> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
> > On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
> >>
> >>
> >> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
> >>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
> >>>>
> >>>>
>
> [snip]
>
> >>>> Could you point me to those devices please?
> >>>
> >>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
> >>>
> >>> Though as per above they shouldn't be impacted by your change, since the
> >>> CPUs always pretend to use milli-Watts.
> >>>
> >>> [skipped some questions/answers since sc7180 isn't actually impacted by
> >>>    the change]
> >>
> >> Thank you Matthias. I will investigate your setup to get better
> >> understanding.
> >
> > Thanks!
> >
>
> I've checked those DT files and related code.
> As you already said, this patch is safe for them.
> So we can apply it IMO.
>
>
> -------------Off-topic------------------
> Not in $subject comments:
>
> AFAICS based on two files which define thermal zones:
> sc7180-trogdor-homestar.dtsi
> sc7180-trogdor-coachz.dtsi
>
> only the 'big' cores are used as cooling devices in the
> 'skin_temp_thermal' - the CPU6 and CPU7.
>
> I assume you don't want to model at all the power usage
> from the Little cluster (which is quite big: 6 CPUs), do you?
> I can see that the Little CPUs have small dyn-power-coeff
> ~30% of the big and lower max freq, but still might be worth
> to add them to IPA. You might give them more 'weight', to
> make sure they receive more power during power split.
>
> You also don't have GPU cooling device in that thermal zone.
> Based on my experience if your GPU is a power hungry one,
> e.g. 2-4Watts, you might get better results when you model
> this 'hot' device (which impacts your temp sensor reported value).

I think the two boards you point at (homestar and coachz) are just the
two that override the default defined in the SoC dtsi file. If you
look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
map. You can also see the cooling maps for the littles.

I guess we don't have a `dynamic-power-coefficient` for the GPU,
though? Seems like we should, but I haven't dug through all the code
here...

-Doug
Matthias Kaehlcke Feb. 16, 2022, 10:13 p.m. UTC | #8
On Wed, Feb 16, 2022 at 09:33:50AM -0800, Doug Anderson wrote:
> Hi,
> 
> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >
> > Hi Matthias,
> >
> > On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
> > > On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
> > >>
> > >>
> > >> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
> > >>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
> > >>>>
> > >>>>
> >
> > [snip]
> >
> > >>>> Could you point me to those devices please?
> > >>>
> > >>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
> > >>>
> > >>> Though as per above they shouldn't be impacted by your change, since the
> > >>> CPUs always pretend to use milli-Watts.
> > >>>
> > >>> [skipped some questions/answers since sc7180 isn't actually impacted by
> > >>>    the change]
> > >>
> > >> Thank you Matthias. I will investigate your setup to get better
> > >> understanding.
> > >
> > > Thanks!
> > >
> >
> > I've checked those DT files and related code.
> > As you already said, this patch is safe for them.
> > So we can apply it IMO.
> >
> >
> > -------------Off-topic------------------
> > Not in $subject comments:
> >
> > AFAICS based on two files which define thermal zones:
> > sc7180-trogdor-homestar.dtsi
> > sc7180-trogdor-coachz.dtsi
> >
> > only the 'big' cores are used as cooling devices in the
> > 'skin_temp_thermal' - the CPU6 and CPU7.
> >
> > I assume you don't want to model at all the power usage
> > from the Little cluster (which is quite big: 6 CPUs), do you?
> > I can see that the Little CPUs have small dyn-power-coeff
> > ~30% of the big and lower max freq, but still might be worth
> > to add them to IPA. You might give them more 'weight', to
> > make sure they receive more power during power split.

In experiments we saw that including the little cores as cooling
devices for 'skin_temp_thermal' didn't have a significant impact on
thermals, so we left them out.

> > You also don't have GPU cooling device in that thermal zone.
> > Based on my experience if your GPU is a power hungry one,
> > e.g. 2-4Watts, you might get better results when you model
> > this 'hot' device (which impacts your temp sensor reported value).
> 
> I think the two boards you point at (homestar and coachz) are just the
> two that override the default defined in the SoC dtsi file. If you
> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
> map. You can also see the cooling maps for the littles.

Yep, plus thermal zones with cooling maps for the big cores.

> I guess we don't have a `dynamic-power-coefficient` for the GPU,
> though? Seems like we should, but I haven't dug through all the code
> here...

To my knowledge the SC7x80 GPU doesn't register an energy model, which is
one of the reasons the GPU wasn't included as cooling device for
'skin_temp_thermal'.
Lukasz Luba Feb. 16, 2022, 10:43 p.m. UTC | #9
On 2/16/22 10:13 PM, Matthias Kaehlcke wrote:
> On Wed, Feb 16, 2022 at 09:33:50AM -0800, Doug Anderson wrote:
>> Hi,
>>
>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>
>>> Hi Matthias,
>>>
>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
>>>>>
>>>>>
>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
>>>>>>>
>>>>>>>
>>>
>>> [snip]
>>>
>>>>>>> Could you point me to those devices please?
>>>>>>
>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
>>>>>>
>>>>>> Though as per above they shouldn't be impacted by your change, since the
>>>>>> CPUs always pretend to use milli-Watts.
>>>>>>
>>>>>> [skipped some questions/answers since sc7180 isn't actually impacted by
>>>>>>     the change]
>>>>>
>>>>> Thank you Matthias. I will investigate your setup to get better
>>>>> understanding.
>>>>
>>>> Thanks!
>>>>
>>>
>>> I've checked those DT files and related code.
>>> As you already said, this patch is safe for them.
>>> So we can apply it IMO.
>>>
>>>
>>> -------------Off-topic------------------
>>> Not in $subject comments:
>>>
>>> AFAICS based on two files which define thermal zones:
>>> sc7180-trogdor-homestar.dtsi
>>> sc7180-trogdor-coachz.dtsi
>>>
>>> only the 'big' cores are used as cooling devices in the
>>> 'skin_temp_thermal' - the CPU6 and CPU7.
>>>
>>> I assume you don't want to model at all the power usage
>>> from the Little cluster (which is quite big: 6 CPUs), do you?
>>> I can see that the Little CPUs have small dyn-power-coeff
>>> ~30% of the big and lower max freq, but still might be worth
>>> to add them to IPA. You might give them more 'weight', to
>>> make sure they receive more power during power split.
> 
> In experiments we saw that including the little cores as cooling
> devices for 'skin_temp_thermal' didn't have a significant impact on
> thermals, so we left them out.
> 
>>> You also don't have GPU cooling device in that thermal zone.
>>> Based on my experience if your GPU is a power hungry one,
>>> e.g. 2-4Watts, you might get better results when you model
>>> this 'hot' device (which impacts your temp sensor reported value).
>>
>> I think the two boards you point at (homestar and coachz) are just the
>> two that override the default defined in the SoC dtsi file. If you
>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
>> map. You can also see the cooling maps for the littles.
> 
> Yep, plus thermal zones with cooling maps for the big cores.
> 
>> I guess we don't have a `dynamic-power-coefficient` for the GPU,
>> though? Seems like we should, but I haven't dug through all the code
>> here...
> 
> To my knowledge the SC7x80 GPU doesn't register an energy model, which is
> one of the reasons the GPU wasn't included as cooling device for
> 'skin_temp_thermal'.
> 

You can give it a try by editing the DT and adding in the
GPU node the 'dynamic-power-coefficient' + probably
small modification in the driver code.

If the GPU driver registers the cooling device in the new way, you
would also get EM registered thanks to the devfreq cooling new code
(commit: 84e0d87c9944eb36ae6037a).

You can check an example from Panfrost GPU driver [1].

I can see some upstream MSM GPU driver, but I don't know if that is
your GPU driver. It registers the 'old' way the devfreq cooling [2]
but it would be easy to change to use the new function.
The GPU driver would use the same dev_pm_opp_of_register_em() as
your CPUs do, so EM would be in 'milli-Watts' (so should be fine).


[1] 
https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/panfrost/panfrost_devfreq.c#L153
[2] 
https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/msm/msm_gpu_devfreq.c#L129
Lukasz Luba Feb. 16, 2022, 11:28 p.m. UTC | #10
On 2/16/22 5:21 PM, Doug Anderson wrote:
> Hi,
> 
> On Tue, Feb 8, 2022 at 1:32 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>>>     Another important thing is the consistent scale of the power values
>>>     provided by the cooling devices. All of the cooling devices in a single
>>>     thermal zone should have power values reported either in milli-Watts
>>>     or scaled to the same 'abstract scale'.
>>
>> This can change. We have removed the userspace governor from kernel
>> recently. The trend is to implement thermal policy in FW. Dealing with
>> some intermediate configurations are causing complicated design, support
>> of the algorithm logic is also more complex.
> 
> One thing that didn't get addressed is the whole "The trend is to
> implement thermal policy in FW". I'm not sure I can get on board with
> that trend. IMO "moving to FW" isn't a super great trend. FW is harder
> to update than kernel and trying to keep it in sync with the kernel
> isn't wonderful. Unless something _has_ to be in FW I personally
> prefer it to be in the kernel.

There are pros and cons for both approaches (as always).

Although, there are some use cases, where the kernel is not able to
react that fast, e.g. sudden power usage changes, which can cause
that the power rail is not able to sustain within required conditions.
When we are talking about tough requirements for those power & thermal
policies, the mechanism must be fast, precised and reliable.

Here you can find Arm reference FW implementation and an IPA clone
in there (I have been reviewing this) [1][2].

As you can see there is a new FW feature set:
"MPMM, Traffic-cop and Thermal management".

Apart from Arm implementation, there are already known thermal
monitoring mechanisms in HW/FW. Like in the new Qcom SoCs which
are using this driver code [3]. The driver receives an interrupt
about throttling conditions and just populates the thermal pressure.

> 
> ...although now that I re-read this, I'm not sure which firmware you
> might be talking about. Is this the AP firmware, or some companion
> chip / coprocessor? Even so, I'd still rather see things done in the
> kernel when possible...

It's a FW run on a dedicated microprocessor. In Arm SoCs it's usually
some Cortex-M. We communicated with it from the kernel via SCMI drivers
(using shared memory and mailboxes). We recommend to use the SCMI
protocol to send e.g. 'performance request' to the FW via 'fast
channel' instead of having an implementation of PMIC and clock, and do
the voltage & freq change in the kernel (using drivers & locking). That
implementation allows to avoid costly locking and allows to go via
that SCMI cpufreq driver [4] and SCMI perf layer [5] the task scheduler.
We don't need a dedicated 'sugov' kthread in a Deadline policy to
do that work and preempt the currently running task.

IMHO the FW approach opens new opportunities.

Regards,
Lukasz

[1] https://github.com/ARM-software/SCP-firmware/pull/588
[2] 
https://github.com/ARM-software/SCP-firmware/pull/588/commits/59c62ead5eb66353ae805c367bfa86192e28c410
[3] 
https://elixir.bootlin.com/linux/v5.17-rc4/source/drivers/cpufreq/qcom-cpufreq-hw.c#L287
[4] 
https://elixir.bootlin.com/linux/latest/source/drivers/cpufreq/scmi-cpufreq.c#L65
[5] 
https://elixir.bootlin.com/linux/v5.17-rc4/source/drivers/firmware/arm_scmi/perf.c#L465
Matthias Kaehlcke Feb. 17, 2022, 12:26 a.m. UTC | #11
On Wed, Feb 16, 2022 at 10:43:34PM +0000, Lukasz Luba wrote:
> 
> 
> On 2/16/22 10:13 PM, Matthias Kaehlcke wrote:
> > On Wed, Feb 16, 2022 at 09:33:50AM -0800, Doug Anderson wrote:
> > > Hi,
> > > 
> > > On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> > > > 
> > > > Hi Matthias,
> > > > 
> > > > On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
> > > > > On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
> > > > > > > On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
> > > > > > > > 
> > > > > > > > 
> > > > 
> > > > [snip]
> > > > 
> > > > > > > > Could you point me to those devices please?
> > > > > > > 
> > > > > > > arch/arm64/boot/dts/qcom/sc7180-trogdor-*
> > > > > > > 
> > > > > > > Though as per above they shouldn't be impacted by your change, since the
> > > > > > > CPUs always pretend to use milli-Watts.
> > > > > > > 
> > > > > > > [skipped some questions/answers since sc7180 isn't actually impacted by
> > > > > > >     the change]
> > > > > > 
> > > > > > Thank you Matthias. I will investigate your setup to get better
> > > > > > understanding.
> > > > > 
> > > > > Thanks!
> > > > > 
> > > > 
> > > > I've checked those DT files and related code.
> > > > As you already said, this patch is safe for them.
> > > > So we can apply it IMO.
> > > > 
> > > > 
> > > > -------------Off-topic------------------
> > > > Not in $subject comments:
> > > > 
> > > > AFAICS based on two files which define thermal zones:
> > > > sc7180-trogdor-homestar.dtsi
> > > > sc7180-trogdor-coachz.dtsi
> > > > 
> > > > only the 'big' cores are used as cooling devices in the
> > > > 'skin_temp_thermal' - the CPU6 and CPU7.
> > > > 
> > > > I assume you don't want to model at all the power usage
> > > > from the Little cluster (which is quite big: 6 CPUs), do you?
> > > > I can see that the Little CPUs have small dyn-power-coeff
> > > > ~30% of the big and lower max freq, but still might be worth
> > > > to add them to IPA. You might give them more 'weight', to
> > > > make sure they receive more power during power split.
> > 
> > In experiments we saw that including the little cores as cooling
> > devices for 'skin_temp_thermal' didn't have a significant impact on
> > thermals, so we left them out.
> > 
> > > > You also don't have GPU cooling device in that thermal zone.
> > > > Based on my experience if your GPU is a power hungry one,
> > > > e.g. 2-4Watts, you might get better results when you model
> > > > this 'hot' device (which impacts your temp sensor reported value).
> > > 
> > > I think the two boards you point at (homestar and coachz) are just the
> > > two that override the default defined in the SoC dtsi file. If you
> > > look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
> > > map. You can also see the cooling maps for the littles.
> > 
> > Yep, plus thermal zones with cooling maps for the big cores.
> > 
> > > I guess we don't have a `dynamic-power-coefficient` for the GPU,
> > > though? Seems like we should, but I haven't dug through all the code
> > > here...
> > 
> > To my knowledge the SC7x80 GPU doesn't register an energy model, which is
> > one of the reasons the GPU wasn't included as cooling device for
> > 'skin_temp_thermal'.
> > 
> 
> You can give it a try by editing the DT and adding in the
> GPU node the 'dynamic-power-coefficient' + probably
> small modification in the driver code.
> 
> If the GPU driver registers the cooling device in the new way, you
> would also get EM registered thanks to the devfreq cooling new code
> (commit: 84e0d87c9944eb36ae6037a).
> 
> You can check an example from Panfrost GPU driver [1].

Ah, I missed that, thanks for the pointer!

> I can see some upstream MSM GPU driver, but I don't know if that is
> your GPU driver. It registers the 'old' way the devfreq cooling [2]
> but it would be easy to change to use the new function.
> The GPU driver would use the same dev_pm_opp_of_register_em() as
> your CPUs do, so EM would be in 'milli-Watts' (so should be fine).

Yep, that's whay we are using.
Daniel Lezcano Feb. 17, 2022, 9:59 a.m. UTC | #12
On 16/02/2022 23:13, Matthias Kaehlcke wrote:
> On Wed, Feb 16, 2022 at 09:33:50AM -0800, Doug Anderson wrote:
>> Hi,
>>
>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>
>>> Hi Matthias,
>>>
>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
>>>>>
>>>>>
>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
>>>>>>>
>>>>>>>
>>>
>>> [snip]
>>>
>>>>>>> Could you point me to those devices please?
>>>>>>
>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
>>>>>>
>>>>>> Though as per above they shouldn't be impacted by your change, since the
>>>>>> CPUs always pretend to use milli-Watts.
>>>>>>
>>>>>> [skipped some questions/answers since sc7180 isn't actually impacted by
>>>>>>     the change]
>>>>>
>>>>> Thank you Matthias. I will investigate your setup to get better
>>>>> understanding.
>>>>
>>>> Thanks!
>>>>
>>>
>>> I've checked those DT files and related code.
>>> As you already said, this patch is safe for them.
>>> So we can apply it IMO.
>>>
>>>
>>> -------------Off-topic------------------
>>> Not in $subject comments:
>>>
>>> AFAICS based on two files which define thermal zones:
>>> sc7180-trogdor-homestar.dtsi
>>> sc7180-trogdor-coachz.dtsi
>>>
>>> only the 'big' cores are used as cooling devices in the
>>> 'skin_temp_thermal' - the CPU6 and CPU7.
>>>
>>> I assume you don't want to model at all the power usage
>>> from the Little cluster (which is quite big: 6 CPUs), do you?
>>> I can see that the Little CPUs have small dyn-power-coeff
>>> ~30% of the big and lower max freq, but still might be worth
>>> to add them to IPA. You might give them more 'weight', to
>>> make sure they receive more power during power split.
> 
> In experiments we saw that including the little cores as cooling
> devices for 'skin_temp_thermal' didn't have a significant impact on
> thermals, so we left them out.

I agree, that was also my conclusion after doing some measurements.

Basically, the little cores are always cold and are victims of the big 
cores heat dissipation.

They of course contribute a bit to the heat but capping their 
performance does not change the temperature trend of the whole.

That is less true with silver-gold were it is the same micro-arch but 
different frequencies.


>>> You also don't have GPU cooling device in that thermal zone.
>>> Based on my experience if your GPU is a power hungry one,
>>> e.g. 2-4Watts, you might get better results when you model
>>> this 'hot' device (which impacts your temp sensor reported value).
>>
>> I think the two boards you point at (homestar and coachz) are just the
>> two that override the default defined in the SoC dtsi file. If you
>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
>> map. You can also see the cooling maps for the littles.
> 
> Yep, plus thermal zones with cooling maps for the big cores.
> 
>> I guess we don't have a `dynamic-power-coefficient` for the GPU,
>> though? Seems like we should, but I haven't dug through all the code
>> here...
> 
> To my knowledge the SC7x80 GPU doesn't register an energy model, which is
> one of the reasons the GPU wasn't included as cooling device for
> 'skin_temp_thermal'.
Daniel Lezcano Feb. 17, 2022, 10:10 a.m. UTC | #13
On 16/02/2022 18:33, Doug Anderson wrote:
> Hi,
> 
> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Hi Matthias,
>>
>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
>>>>
>>>>
>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
>>>>>>
>>>>>>
>>
>> [snip]
>>
>>>>>> Could you point me to those devices please?
>>>>>
>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
>>>>>
>>>>> Though as per above they shouldn't be impacted by your change, since the
>>>>> CPUs always pretend to use milli-Watts.
>>>>>
>>>>> [skipped some questions/answers since sc7180 isn't actually impacted by
>>>>>     the change]
>>>>
>>>> Thank you Matthias. I will investigate your setup to get better
>>>> understanding.
>>>
>>> Thanks!
>>>
>>
>> I've checked those DT files and related code.
>> As you already said, this patch is safe for them.
>> So we can apply it IMO.
>>
>>
>> -------------Off-topic------------------
>> Not in $subject comments:
>>
>> AFAICS based on two files which define thermal zones:
>> sc7180-trogdor-homestar.dtsi
>> sc7180-trogdor-coachz.dtsi
>>
>> only the 'big' cores are used as cooling devices in the
>> 'skin_temp_thermal' - the CPU6 and CPU7.
>>
>> I assume you don't want to model at all the power usage
>> from the Little cluster (which is quite big: 6 CPUs), do you?
>> I can see that the Little CPUs have small dyn-power-coeff
>> ~30% of the big and lower max freq, but still might be worth
>> to add them to IPA. You might give them more 'weight', to
>> make sure they receive more power during power split.
>>
>> You also don't have GPU cooling device in that thermal zone.
>> Based on my experience if your GPU is a power hungry one,
>> e.g. 2-4Watts, you might get better results when you model
>> this 'hot' device (which impacts your temp sensor reported value).
> 
> I think the two boards you point at (homestar and coachz) are just the
> two that override the default defined in the SoC dtsi file. If you
> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
> map. You can also see the cooling maps for the littles.
> 
> I guess we don't have a `dynamic-power-coefficient` for the GPU,
> though? Seems like we should, but I haven't dug through all the code
> here...

The dynamic-power-coefficient is available for OPPs which includes 
CPUfreq and devfreq. As the GPU is managed by devfreq, setting the 
dynamic-power-coefficient makes the energy model available for it.

However, the OPPs must define the frequency and the voltage. That is the 
case for most platforms except on QCom platform.

That may not be specified as it uses a frequency index and the hardware 
does the voltage change in our back. The QCom cpufreq backend get the 
voltage table from a register (or whatever) and completes the voltage 
values for the OPPs, thus adding the information which is missing in the 
device tree. The energy model can then initializes itself and allows the 
usage of the Energy Aware Scheduler.

However this piece of code is missing for the GPU part.
Daniel Lezcano Feb. 17, 2022, 10:15 a.m. UTC | #14
On 16/02/2022 23:43, Lukasz Luba wrote:
> 
> 
> On 2/16/22 10:13 PM, Matthias Kaehlcke wrote:
>> On Wed, Feb 16, 2022 at 09:33:50AM -0800, Doug Anderson wrote:
>>> Hi,
>>>
>>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>>
>>>> Hi Matthias,
>>>>
>>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
>>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
>>>>>>
>>>>>>
>>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
>>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
>>>>>>>>
>>>>>>>>
>>>>
>>>> [snip]
>>>>
>>>>>>>> Could you point me to those devices please?
>>>>>>>
>>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
>>>>>>>
>>>>>>> Though as per above they shouldn't be impacted by your change, 
>>>>>>> since the
>>>>>>> CPUs always pretend to use milli-Watts.
>>>>>>>
>>>>>>> [skipped some questions/answers since sc7180 isn't actually 
>>>>>>> impacted by
>>>>>>>     the change]
>>>>>>
>>>>>> Thank you Matthias. I will investigate your setup to get better
>>>>>> understanding.
>>>>>
>>>>> Thanks!
>>>>>
>>>>
>>>> I've checked those DT files and related code.
>>>> As you already said, this patch is safe for them.
>>>> So we can apply it IMO.
>>>>
>>>>
>>>> -------------Off-topic------------------
>>>> Not in $subject comments:
>>>>
>>>> AFAICS based on two files which define thermal zones:
>>>> sc7180-trogdor-homestar.dtsi
>>>> sc7180-trogdor-coachz.dtsi
>>>>
>>>> only the 'big' cores are used as cooling devices in the
>>>> 'skin_temp_thermal' - the CPU6 and CPU7.
>>>>
>>>> I assume you don't want to model at all the power usage
>>>> from the Little cluster (which is quite big: 6 CPUs), do you?
>>>> I can see that the Little CPUs have small dyn-power-coeff
>>>> ~30% of the big and lower max freq, but still might be worth
>>>> to add them to IPA. You might give them more 'weight', to
>>>> make sure they receive more power during power split.
>>
>> In experiments we saw that including the little cores as cooling
>> devices for 'skin_temp_thermal' didn't have a significant impact on
>> thermals, so we left them out.
>>
>>>> You also don't have GPU cooling device in that thermal zone.
>>>> Based on my experience if your GPU is a power hungry one,
>>>> e.g. 2-4Watts, you might get better results when you model
>>>> this 'hot' device (which impacts your temp sensor reported value).
>>>
>>> I think the two boards you point at (homestar and coachz) are just the
>>> two that override the default defined in the SoC dtsi file. If you
>>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
>>> map. You can also see the cooling maps for the littles.
>>
>> Yep, plus thermal zones with cooling maps for the big cores.
>>
>>> I guess we don't have a `dynamic-power-coefficient` for the GPU,
>>> though? Seems like we should, but I haven't dug through all the code
>>> here...
>>
>> To my knowledge the SC7x80 GPU doesn't register an energy model, which is
>> one of the reasons the GPU wasn't included as cooling device for
>> 'skin_temp_thermal'.
>>
> 
> You can give it a try by editing the DT and adding in the
> GPU node the 'dynamic-power-coefficient' + probably
> small modification in the driver code.

That should not work for sc7180 as the DT does not specify the voltage 
values. The QCom backend should retrieve the voltage values for each 
OPPs from a register or a hardcoded table coming from the SoC documentation.

However, I can confirm dynamic-power-coefficient will add an energy 
model on GPU (or any devfreq device) if the voltage is specified in the 
OPP description. Checked with panfrost driver and memory controller on 
rk3399.

> If the GPU driver registers the cooling device in the new way, you
> would also get EM registered thanks to the devfreq cooling new code
> (commit: 84e0d87c9944eb36ae6037a).
> 
> You can check an example from Panfrost GPU driver [1].
> 
> I can see some upstream MSM GPU driver, but I don't know if that is
> your GPU driver. It registers the 'old' way the devfreq cooling [2]
> but it would be easy to change to use the new function.
> The GPU driver would use the same dev_pm_opp_of_register_em() as
> your CPUs do, so EM would be in 'milli-Watts' (so should be fine).
> 
> 
> [1] 
> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/panfrost/panfrost_devfreq.c#L153 
> 
> [2] 
> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/msm/msm_gpu_devfreq.c#L129 
>
Lukasz Luba Feb. 17, 2022, 10:47 a.m. UTC | #15
Hi Daniel,

On 2/17/22 10:10 AM, Daniel Lezcano wrote:
> On 16/02/2022 18:33, Doug Anderson wrote:
>> Hi,
>>
>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>
>>> Hi Matthias,
>>>
>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
>>>>>
>>>>>
>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
>>>>>>>
>>>>>>>
>>>
>>> [snip]
>>>
>>>>>>> Could you point me to those devices please?
>>>>>>
>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
>>>>>>
>>>>>> Though as per above they shouldn't be impacted by your change, 
>>>>>> since the
>>>>>> CPUs always pretend to use milli-Watts.
>>>>>>
>>>>>> [skipped some questions/answers since sc7180 isn't actually 
>>>>>> impacted by
>>>>>>     the change]
>>>>>
>>>>> Thank you Matthias. I will investigate your setup to get better
>>>>> understanding.
>>>>
>>>> Thanks!
>>>>
>>>
>>> I've checked those DT files and related code.
>>> As you already said, this patch is safe for them.
>>> So we can apply it IMO.
>>>
>>>
>>> -------------Off-topic------------------
>>> Not in $subject comments:
>>>
>>> AFAICS based on two files which define thermal zones:
>>> sc7180-trogdor-homestar.dtsi
>>> sc7180-trogdor-coachz.dtsi
>>>
>>> only the 'big' cores are used as cooling devices in the
>>> 'skin_temp_thermal' - the CPU6 and CPU7.
>>>
>>> I assume you don't want to model at all the power usage
>>> from the Little cluster (which is quite big: 6 CPUs), do you?
>>> I can see that the Little CPUs have small dyn-power-coeff
>>> ~30% of the big and lower max freq, but still might be worth
>>> to add them to IPA. You might give them more 'weight', to
>>> make sure they receive more power during power split.
>>>
>>> You also don't have GPU cooling device in that thermal zone.
>>> Based on my experience if your GPU is a power hungry one,
>>> e.g. 2-4Watts, you might get better results when you model
>>> this 'hot' device (which impacts your temp sensor reported value).
>>
>> I think the two boards you point at (homestar and coachz) are just the
>> two that override the default defined in the SoC dtsi file. If you
>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
>> map. You can also see the cooling maps for the littles.
>>
>> I guess we don't have a `dynamic-power-coefficient` for the GPU,
>> though? Seems like we should, but I haven't dug through all the code
>> here...
> 
> The dynamic-power-coefficient is available for OPPs which includes 
> CPUfreq and devfreq. As the GPU is managed by devfreq, setting the 
> dynamic-power-coefficient makes the energy model available for it.
> 
> However, the OPPs must define the frequency and the voltage. That is the 
> case for most platforms except on QCom platform.
> 
> That may not be specified as it uses a frequency index and the hardware 
> does the voltage change in our back. The QCom cpufreq backend get the 
> voltage table from a register (or whatever) and completes the voltage 
> values for the OPPs, thus adding the information which is missing in the 
> device tree. The energy model can then initializes itself and allows the 
> usage of the Energy Aware Scheduler.
> 
> However this piece of code is missing for the GPU part.
> 

Thank you for joining the discussion. I don't know about that Qcom
GPU voltage information is missing.

If the voltage is not available (only the frequencies), there is
another way. There is an 'advanced' EM which uses registration function:
em_dev_register_perf_domain(). It uses a local driver callback to get
power for each found frequency. It has benefit because there is no
restriction to 'fit' into the math formula, instead just avg power
values can be feed into EM. It's called 'advanced' EM [1].

Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten
was proposing from ~2014 this upstream, but EAS wasn't merged back
then):
where to store these power-freq values, which are then used by the
callback. We have the 'dynamic-power-coefficient' in DT, but
it has limitations. It would be good to have this simple array
attached to the GPU/CPU node. IMHO it meet the requirement of DT,
it describes the HW (it would have HZ and Watts values).

Doug, Matthias could you have a look at that function and its
usage, please [1]?
If you guys would support me in this, I would start, with an RFC
proposal, a discussion on LKML.


[1] 
https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87
Daniel Lezcano Feb. 17, 2022, 11:28 a.m. UTC | #16
On 17/02/2022 11:47, Lukasz Luba wrote:
> Hi Daniel,
> 
> On 2/17/22 10:10 AM, Daniel Lezcano wrote:
>> On 16/02/2022 18:33, Doug Anderson wrote:
>>> Hi,
>>>
>>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>>
>>>> Hi Matthias,
>>>>
>>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
>>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
>>>>>>
>>>>>>
>>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
>>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
>>>>>>>>
>>>>>>>>
>>>>
>>>> [snip]
>>>>
>>>>>>>> Could you point me to those devices please?
>>>>>>>
>>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
>>>>>>>
>>>>>>> Though as per above they shouldn't be impacted by your change, 
>>>>>>> since the
>>>>>>> CPUs always pretend to use milli-Watts.
>>>>>>>
>>>>>>> [skipped some questions/answers since sc7180 isn't actually 
>>>>>>> impacted by
>>>>>>>     the change]
>>>>>>
>>>>>> Thank you Matthias. I will investigate your setup to get better
>>>>>> understanding.
>>>>>
>>>>> Thanks!
>>>>>
>>>>
>>>> I've checked those DT files and related code.
>>>> As you already said, this patch is safe for them.
>>>> So we can apply it IMO.
>>>>
>>>>
>>>> -------------Off-topic------------------
>>>> Not in $subject comments:
>>>>
>>>> AFAICS based on two files which define thermal zones:
>>>> sc7180-trogdor-homestar.dtsi
>>>> sc7180-trogdor-coachz.dtsi
>>>>
>>>> only the 'big' cores are used as cooling devices in the
>>>> 'skin_temp_thermal' - the CPU6 and CPU7.
>>>>
>>>> I assume you don't want to model at all the power usage
>>>> from the Little cluster (which is quite big: 6 CPUs), do you?
>>>> I can see that the Little CPUs have small dyn-power-coeff
>>>> ~30% of the big and lower max freq, but still might be worth
>>>> to add them to IPA. You might give them more 'weight', to
>>>> make sure they receive more power during power split.
>>>>
>>>> You also don't have GPU cooling device in that thermal zone.
>>>> Based on my experience if your GPU is a power hungry one,
>>>> e.g. 2-4Watts, you might get better results when you model
>>>> this 'hot' device (which impacts your temp sensor reported value).
>>>
>>> I think the two boards you point at (homestar and coachz) are just the
>>> two that override the default defined in the SoC dtsi file. If you
>>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
>>> map. You can also see the cooling maps for the littles.
>>>
>>> I guess we don't have a `dynamic-power-coefficient` for the GPU,
>>> though? Seems like we should, but I haven't dug through all the code
>>> here...
>>
>> The dynamic-power-coefficient is available for OPPs which includes 
>> CPUfreq and devfreq. As the GPU is managed by devfreq, setting the 
>> dynamic-power-coefficient makes the energy model available for it.
>>
>> However, the OPPs must define the frequency and the voltage. That is 
>> the case for most platforms except on QCom platform.
>>
>> That may not be specified as it uses a frequency index and the 
>> hardware does the voltage change in our back. The QCom cpufreq backend 
>> get the voltage table from a register (or whatever) and completes the 
>> voltage values for the OPPs, thus adding the information which is 
>> missing in the device tree. The energy model can then initializes 
>> itself and allows the usage of the Energy Aware Scheduler.
>>
>> However this piece of code is missing for the GPU part.
>>
> 
> Thank you for joining the discussion. I don't know about that Qcom
> GPU voltage information is missing.
> 
> If the voltage is not available (only the frequencies), there is
> another way. There is an 'advanced' EM which uses registration function:
> em_dev_register_perf_domain(). It uses a local driver callback to get
> power for each found frequency. It has benefit because there is no
> restriction to 'fit' into the math formula, instead just avg power
> values can be feed into EM. It's called 'advanced' EM [1].
> 
> Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten
> was proposing from ~2014 this upstream, but EAS wasn't merged back
> then):
> where to store these power-freq values, which are then used by the
> callback. 

Why not make it more generic and replace the frequency by a performance 
index, so it can be used by any kind of perf limiter?

> We have the 'dynamic-power-coefficient' in DT, but
> it has limitations. It would be good to have this simple array
> attached to the GPU/CPU node. IMHO it meet the requirement of DT,
> it describes the HW (it would have HZ and Watts values).
> 
> Doug, Matthias could you have a look at that function and its
> usage, please [1]?
> If you guys would support me in this, I would start, with an RFC
> proposal, a discussion on LKML.
> 
> 
> [1] 
> https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87 
>
Lukasz Luba Feb. 17, 2022, 12:11 p.m. UTC | #17
On 2/17/22 11:28 AM, Daniel Lezcano wrote:
> On 17/02/2022 11:47, Lukasz Luba wrote:
>> Hi Daniel,
>>
>> On 2/17/22 10:10 AM, Daniel Lezcano wrote:
>>> On 16/02/2022 18:33, Doug Anderson wrote:
>>>> Hi,
>>>>
>>>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> 
>>>> wrote:
>>>>>
>>>>> Hi Matthias,
>>>>>
>>>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
>>>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
>>>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> [snip]
>>>>>
>>>>>>>>> Could you point me to those devices please?
>>>>>>>>
>>>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
>>>>>>>>
>>>>>>>> Though as per above they shouldn't be impacted by your change, 
>>>>>>>> since the
>>>>>>>> CPUs always pretend to use milli-Watts.
>>>>>>>>
>>>>>>>> [skipped some questions/answers since sc7180 isn't actually 
>>>>>>>> impacted by
>>>>>>>>     the change]
>>>>>>>
>>>>>>> Thank you Matthias. I will investigate your setup to get better
>>>>>>> understanding.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>
>>>>> I've checked those DT files and related code.
>>>>> As you already said, this patch is safe for them.
>>>>> So we can apply it IMO.
>>>>>
>>>>>
>>>>> -------------Off-topic------------------
>>>>> Not in $subject comments:
>>>>>
>>>>> AFAICS based on two files which define thermal zones:
>>>>> sc7180-trogdor-homestar.dtsi
>>>>> sc7180-trogdor-coachz.dtsi
>>>>>
>>>>> only the 'big' cores are used as cooling devices in the
>>>>> 'skin_temp_thermal' - the CPU6 and CPU7.
>>>>>
>>>>> I assume you don't want to model at all the power usage
>>>>> from the Little cluster (which is quite big: 6 CPUs), do you?
>>>>> I can see that the Little CPUs have small dyn-power-coeff
>>>>> ~30% of the big and lower max freq, but still might be worth
>>>>> to add them to IPA. You might give them more 'weight', to
>>>>> make sure they receive more power during power split.
>>>>>
>>>>> You also don't have GPU cooling device in that thermal zone.
>>>>> Based on my experience if your GPU is a power hungry one,
>>>>> e.g. 2-4Watts, you might get better results when you model
>>>>> this 'hot' device (which impacts your temp sensor reported value).
>>>>
>>>> I think the two boards you point at (homestar and coachz) are just the
>>>> two that override the default defined in the SoC dtsi file. If you
>>>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
>>>> map. You can also see the cooling maps for the littles.
>>>>
>>>> I guess we don't have a `dynamic-power-coefficient` for the GPU,
>>>> though? Seems like we should, but I haven't dug through all the code
>>>> here...
>>>
>>> The dynamic-power-coefficient is available for OPPs which includes 
>>> CPUfreq and devfreq. As the GPU is managed by devfreq, setting the 
>>> dynamic-power-coefficient makes the energy model available for it.
>>>
>>> However, the OPPs must define the frequency and the voltage. That is 
>>> the case for most platforms except on QCom platform.
>>>
>>> That may not be specified as it uses a frequency index and the 
>>> hardware does the voltage change in our back. The QCom cpufreq 
>>> backend get the voltage table from a register (or whatever) and 
>>> completes the voltage values for the OPPs, thus adding the 
>>> information which is missing in the device tree. The energy model can 
>>> then initializes itself and allows the usage of the Energy Aware 
>>> Scheduler.
>>>
>>> However this piece of code is missing for the GPU part.
>>>
>>
>> Thank you for joining the discussion. I don't know about that Qcom
>> GPU voltage information is missing.
>>
>> If the voltage is not available (only the frequencies), there is
>> another way. There is an 'advanced' EM which uses registration function:
>> em_dev_register_perf_domain(). It uses a local driver callback to get
>> power for each found frequency. It has benefit because there is no
>> restriction to 'fit' into the math formula, instead just avg power
>> values can be feed into EM. It's called 'advanced' EM [1].
>>
>> Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten
>> was proposing from ~2014 this upstream, but EAS wasn't merged back
>> then):
>> where to store these power-freq values, which are then used by the
>> callback. 
> 
> Why not make it more generic and replace the frequency by a performance 
> index, so it can be used by any kind of perf limiter?

For that DT array, yes, it can be an index, so effectively it could be
a simple 1d array.

something like:

msm_gpu_energy_model: msm-gpu-energy-model {
	compatible = "energy-model"
	/* Values are sorted micro-Watts which correspond to each OPP
	   or performance state. The total amount of them must match
	   number of OPPs. */
	power-microwatt = <100000>,
			<230000>,
			<380000>,
			<600000>;
};

then in gpu node instead of having 'dynamic-power-coefficient',
which is useless because voltage is missing, we would have
'energy-model', like:

	energy-model = <&msm_gpu_energy_model>;


If you agree to continue this topic. I will send an RFC so we could
further discuss this idea. This $subject doesn't fit well.

Thank you again for your feedback Daniel!
Daniel Lezcano Feb. 17, 2022, 12:33 p.m. UTC | #18
On 17/02/2022 13:11, Lukasz Luba wrote:

[ ... ]

>> Why not make it more generic and replace the frequency by a 
>> performance index, so it can be used by any kind of perf limiter?
> 
> For that DT array, yes, it can be an index, so effectively it could be
> a simple 1d array.
> 
> something like:
> 
> msm_gpu_energy_model: msm-gpu-energy-model {
>      compatible = "energy-model"
>      /* Values are sorted micro-Watts which correspond to each OPP
>         or performance state. The total amount of them must match
>         number of OPPs. */
>      power-microwatt = <100000>,
>              <230000>,
>              <380000>,
>              <600000>;
> };
> 
> then in gpu node instead of having 'dynamic-power-coefficient',
> which is useless because voltage is missing, we would have
> 'energy-model', like:
> 
>      energy-model = <&msm_gpu_energy_model>;
> 
> 
> If you agree to continue this topic. I will send an RFC so we could
> further discuss this idea. This $subject doesn't fit well.

Yes, definitively I agree to continue on this topic.
Lukasz Luba Feb. 17, 2022, 12:37 p.m. UTC | #19
On 2/17/22 12:33 PM, Daniel Lezcano wrote:
> On 17/02/2022 13:11, Lukasz Luba wrote:
> 
> [ ... ]
> 
>>> Why not make it more generic and replace the frequency by a 
>>> performance index, so it can be used by any kind of perf limiter?
>>
>> For that DT array, yes, it can be an index, so effectively it could be
>> a simple 1d array.
>>
>> something like:
>>
>> msm_gpu_energy_model: msm-gpu-energy-model {
>>      compatible = "energy-model"
>>      /* Values are sorted micro-Watts which correspond to each OPP
>>         or performance state. The total amount of them must match
>>         number of OPPs. */
>>      power-microwatt = <100000>,
>>              <230000>,
>>              <380000>,
>>              <600000>;
>> };
>>
>> then in gpu node instead of having 'dynamic-power-coefficient',
>> which is useless because voltage is missing, we would have
>> 'energy-model', like:
>>
>>      energy-model = <&msm_gpu_energy_model>;
>>
>>
>> If you agree to continue this topic. I will send an RFC so we could
>> further discuss this idea. This $subject doesn't fit well.
> 
> Yes, definitively I agree to continue on this topic.
> 
> 

Great! I'm going to craft something...
Doug Anderson Feb. 17, 2022, 4:37 p.m. UTC | #20
Hi,

On Thu, Feb 17, 2022 at 2:47 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Daniel,
>
> On 2/17/22 10:10 AM, Daniel Lezcano wrote:
> > On 16/02/2022 18:33, Doug Anderson wrote:
> >> Hi,
> >>
> >> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>>
> >>> Hi Matthias,
> >>>
> >>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
> >>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
> >>>>>
> >>>>>
> >>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
> >>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
> >>>>>>>
> >>>>>>>
> >>>
> >>> [snip]
> >>>
> >>>>>>> Could you point me to those devices please?
> >>>>>>
> >>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
> >>>>>>
> >>>>>> Though as per above they shouldn't be impacted by your change,
> >>>>>> since the
> >>>>>> CPUs always pretend to use milli-Watts.
> >>>>>>
> >>>>>> [skipped some questions/answers since sc7180 isn't actually
> >>>>>> impacted by
> >>>>>>     the change]
> >>>>>
> >>>>> Thank you Matthias. I will investigate your setup to get better
> >>>>> understanding.
> >>>>
> >>>> Thanks!
> >>>>
> >>>
> >>> I've checked those DT files and related code.
> >>> As you already said, this patch is safe for them.
> >>> So we can apply it IMO.
> >>>
> >>>
> >>> -------------Off-topic------------------
> >>> Not in $subject comments:
> >>>
> >>> AFAICS based on two files which define thermal zones:
> >>> sc7180-trogdor-homestar.dtsi
> >>> sc7180-trogdor-coachz.dtsi
> >>>
> >>> only the 'big' cores are used as cooling devices in the
> >>> 'skin_temp_thermal' - the CPU6 and CPU7.
> >>>
> >>> I assume you don't want to model at all the power usage
> >>> from the Little cluster (which is quite big: 6 CPUs), do you?
> >>> I can see that the Little CPUs have small dyn-power-coeff
> >>> ~30% of the big and lower max freq, but still might be worth
> >>> to add them to IPA. You might give them more 'weight', to
> >>> make sure they receive more power during power split.
> >>>
> >>> You also don't have GPU cooling device in that thermal zone.
> >>> Based on my experience if your GPU is a power hungry one,
> >>> e.g. 2-4Watts, you might get better results when you model
> >>> this 'hot' device (which impacts your temp sensor reported value).
> >>
> >> I think the two boards you point at (homestar and coachz) are just the
> >> two that override the default defined in the SoC dtsi file. If you
> >> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
> >> map. You can also see the cooling maps for the littles.
> >>
> >> I guess we don't have a `dynamic-power-coefficient` for the GPU,
> >> though? Seems like we should, but I haven't dug through all the code
> >> here...
> >
> > The dynamic-power-coefficient is available for OPPs which includes
> > CPUfreq and devfreq. As the GPU is managed by devfreq, setting the
> > dynamic-power-coefficient makes the energy model available for it.
> >
> > However, the OPPs must define the frequency and the voltage. That is the
> > case for most platforms except on QCom platform.
> >
> > That may not be specified as it uses a frequency index and the hardware
> > does the voltage change in our back. The QCom cpufreq backend get the
> > voltage table from a register (or whatever) and completes the voltage
> > values for the OPPs, thus adding the information which is missing in the
> > device tree. The energy model can then initializes itself and allows the
> > usage of the Energy Aware Scheduler.
> >
> > However this piece of code is missing for the GPU part.
> >
>
> Thank you for joining the discussion. I don't know about that Qcom
> GPU voltage information is missing.
>
> If the voltage is not available (only the frequencies), there is
> another way. There is an 'advanced' EM which uses registration function:
> em_dev_register_perf_domain(). It uses a local driver callback to get
> power for each found frequency. It has benefit because there is no
> restriction to 'fit' into the math formula, instead just avg power
> values can be feed into EM. It's called 'advanced' EM [1].

It seems like there _should_ be a way to get the voltage out for GPU
operating points, like is done with cpufreq in
qcom_cpufreq_hw_read_lut(), but it might need someone with Qualcomm
documentation to help with it. Maybe Rajendra would be able to help?
Adding Jordon and Rob to this conversation in case they're aware of
anything.

As you said, we could just list a power for each frequency, though.

I'm actually not sure which one would be more accurate across a range
of devices with different "corners": specifying a dynamic power
coefficient used for all "corners" and then using the actual voltage
and doing the math, or specifying a power number for each frequency
and ignoring the actual voltage used. In any case we're trying to get
ballpark numbers and not every device will be exactly the same, so
probably it doesn't matter that much.


> Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten
> was proposing from ~2014 this upstream, but EAS wasn't merged back
> then):
> where to store these power-freq values, which are then used by the
> callback. We have the 'dynamic-power-coefficient' in DT, but
> it has limitations. It would be good to have this simple array
> attached to the GPU/CPU node. IMHO it meet the requirement of DT,
> it describes the HW (it would have HZ and Watts values).
>
> Doug, Matthias could you have a look at that function and its
> usage, please [1]?
> If you guys would support me in this, I would start, with an RFC
> proposal, a discussion on LKML.
>
> [1]
> https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87

Matthias: I think you've spent more time on the thermal stuff than me
so I'll assume you'll follow-up here. If not then please yell!

Ideally, though, someone from Qualcomm would jump in an own this.
Basically it allows more intelligently throttling the GPU and CPU
together in tandem instead of treating them separately IIUC, right?

-Doug
Matthias Kaehlcke Feb. 17, 2022, 4:46 p.m. UTC | #21
On Thu, Feb 17, 2022 at 12:11:27PM +0000, Lukasz Luba wrote:
> 
> 
> On 2/17/22 11:28 AM, Daniel Lezcano wrote:
> > On 17/02/2022 11:47, Lukasz Luba wrote:
> > > Hi Daniel,
> > > 
> > > On 2/17/22 10:10 AM, Daniel Lezcano wrote:
> > > > On 16/02/2022 18:33, Doug Anderson wrote:
> > > > > Hi,
> > > > > 
> > > > > On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba
> > > > > <lukasz.luba@arm.com> wrote:
> > > > > > 
> > > > > > Hi Matthias,
> > > > > > 
> > > > > > On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
> > > > > > > On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
> > > > > > > > > On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > 
> > > > > > [snip]
> > > > > > 
> > > > > > > > > > Could you point me to those devices please?
> > > > > > > > > 
> > > > > > > > > arch/arm64/boot/dts/qcom/sc7180-trogdor-*
> > > > > > > > > 
> > > > > > > > > Though as per above they shouldn't be
> > > > > > > > > impacted by your change, since the
> > > > > > > > > CPUs always pretend to use milli-Watts.
> > > > > > > > > 
> > > > > > > > > [skipped some questions/answers since sc7180
> > > > > > > > > isn't actually impacted by
> > > > > > > > >     the change]
> > > > > > > > 
> > > > > > > > Thank you Matthias. I will investigate your setup to get better
> > > > > > > > understanding.
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > 
> > > > > > 
> > > > > > I've checked those DT files and related code.
> > > > > > As you already said, this patch is safe for them.
> > > > > > So we can apply it IMO.
> > > > > > 
> > > > > > 
> > > > > > -------------Off-topic------------------
> > > > > > Not in $subject comments:
> > > > > > 
> > > > > > AFAICS based on two files which define thermal zones:
> > > > > > sc7180-trogdor-homestar.dtsi
> > > > > > sc7180-trogdor-coachz.dtsi
> > > > > > 
> > > > > > only the 'big' cores are used as cooling devices in the
> > > > > > 'skin_temp_thermal' - the CPU6 and CPU7.
> > > > > > 
> > > > > > I assume you don't want to model at all the power usage
> > > > > > from the Little cluster (which is quite big: 6 CPUs), do you?
> > > > > > I can see that the Little CPUs have small dyn-power-coeff
> > > > > > ~30% of the big and lower max freq, but still might be worth
> > > > > > to add them to IPA. You might give them more 'weight', to
> > > > > > make sure they receive more power during power split.
> > > > > > 
> > > > > > You also don't have GPU cooling device in that thermal zone.
> > > > > > Based on my experience if your GPU is a power hungry one,
> > > > > > e.g. 2-4Watts, you might get better results when you model
> > > > > > this 'hot' device (which impacts your temp sensor reported value).
> > > > > 
> > > > > I think the two boards you point at (homestar and coachz) are just the
> > > > > two that override the default defined in the SoC dtsi file. If you
> > > > > look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
> > > > > map. You can also see the cooling maps for the littles.
> > > > > 
> > > > > I guess we don't have a `dynamic-power-coefficient` for the GPU,
> > > > > though? Seems like we should, but I haven't dug through all the code
> > > > > here...
> > > > 
> > > > The dynamic-power-coefficient is available for OPPs which
> > > > includes CPUfreq and devfreq. As the GPU is managed by devfreq,
> > > > setting the dynamic-power-coefficient makes the energy model
> > > > available for it.
> > > > 
> > > > However, the OPPs must define the frequency and the voltage.
> > > > That is the case for most platforms except on QCom platform.
> > > > 
> > > > That may not be specified as it uses a frequency index and the
> > > > hardware does the voltage change in our back. The QCom cpufreq
> > > > backend get the voltage table from a register (or whatever) and
> > > > completes the voltage values for the OPPs, thus adding the
> > > > information which is missing in the device tree. The energy
> > > > model can then initializes itself and allows the usage of the
> > > > Energy Aware Scheduler.
> > > > 
> > > > However this piece of code is missing for the GPU part.
> > > > 
> > > 
> > > Thank you for joining the discussion. I don't know about that Qcom
> > > GPU voltage information is missing.
> > > 
> > > If the voltage is not available (only the frequencies), there is
> > > another way. There is an 'advanced' EM which uses registration function:
> > > em_dev_register_perf_domain(). It uses a local driver callback to get
> > > power for each found frequency. It has benefit because there is no
> > > restriction to 'fit' into the math formula, instead just avg power
> > > values can be feed into EM. It's called 'advanced' EM [1].
> > > 
> > > Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten
> > > was proposing from ~2014 this upstream, but EAS wasn't merged back
> > > then):
> > > where to store these power-freq values, which are then used by the
> > > callback.
> > 
> > Why not make it more generic and replace the frequency by a performance
> > index, so it can be used by any kind of perf limiter?
> 
> For that DT array, yes, it can be an index, so effectively it could be
> a simple 1d array.
> 
> something like:
> 
> msm_gpu_energy_model: msm-gpu-energy-model {
> 	compatible = "energy-model"
> 	/* Values are sorted micro-Watts which correspond to each OPP
> 	   or performance state. The total amount of them must match
> 	   number of OPPs. */
> 	power-microwatt = <100000>,
> 			<230000>,
> 			<380000>,
> 			<600000>;
> };

IIUC for the QCOM GPU the voltages/power consumption per OPP aren't fixed
but can vary between different SoCs of the same model. If the ranges aren't
too wide it might still be suitable to have a table with the average power
consumption.

Another question is whether QCOM would be willing to provide information
about the GPU power consumption. For the SC7180 CPUs they only provided
bogoWatt numbers. Once boards with a given SoC/GPU are available to the
public someone could come up with such a table based on measurements,
similar to what Doug did for the SC7180 CPUs (commit 82ea7d411d43f).

> then in gpu node instead of having 'dynamic-power-coefficient',
> which is useless because voltage is missing, we would have
> 'energy-model', like:
> 
> 	energy-model = <&msm_gpu_energy_model>;
> 
> 
> If you agree to continue this topic. I will send an RFC so we could
> further discuss this idea. This $subject doesn't fit well.
> 
> Thank you again for your feedback Daniel!

Thanks Lukasz and Daniel for looking into this!

m.
Doug Anderson Feb. 17, 2022, 4:50 p.m. UTC | #22
Hi,

On Wed, Feb 16, 2022 at 3:28 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> On 2/16/22 5:21 PM, Doug Anderson wrote:
> > Hi,
> >
> > On Tue, Feb 8, 2022 at 1:32 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >>>     Another important thing is the consistent scale of the power values
> >>>     provided by the cooling devices. All of the cooling devices in a single
> >>>     thermal zone should have power values reported either in milli-Watts
> >>>     or scaled to the same 'abstract scale'.
> >>
> >> This can change. We have removed the userspace governor from kernel
> >> recently. The trend is to implement thermal policy in FW. Dealing with
> >> some intermediate configurations are causing complicated design, support
> >> of the algorithm logic is also more complex.
> >
> > One thing that didn't get addressed is the whole "The trend is to
> > implement thermal policy in FW". I'm not sure I can get on board with
> > that trend. IMO "moving to FW" isn't a super great trend. FW is harder
> > to update than kernel and trying to keep it in sync with the kernel
> > isn't wonderful. Unless something _has_ to be in FW I personally
> > prefer it to be in the kernel.
>
> There are pros and cons for both approaches (as always).
>
> Although, there are some use cases, where the kernel is not able to
> react that fast, e.g. sudden power usage changes, which can cause
> that the power rail is not able to sustain within required conditions.
> When we are talking about tough requirements for those power & thermal
> policies, the mechanism must be fast, precised and reliable.
>
> Here you can find Arm reference FW implementation and an IPA clone
> in there (I have been reviewing this) [1][2].
>
> As you can see there is a new FW feature set:
> "MPMM, Traffic-cop and Thermal management".
>
> Apart from Arm implementation, there are already known thermal
> monitoring mechanisms in HW/FW. Like in the new Qcom SoCs which
> are using this driver code [3]. The driver receives an interrupt
> about throttling conditions and just populates the thermal pressure.

Yeah, this has come up in another context recently too. Right on on
the Qcom SoCs I'm working with (sc7180 on Chromebooks) we've
essentially disabled all the HW/FW throttling (LMH), preferring to let
Linux manage things. We chose to do it this way with the assumption
that Linux would be able to make better decisions than the firmware
and it was easier to understand / update than an opaque
vendor-provided blob. LMH is still there with super high limits in
case Linux goofs up (we don't want to damage the CPU) but it's not the
primary means of throttling.

As you said, Linux reacts a bit slower, though I've heard that might
be fixed soon-ish? So far on sc7180 Chromebooks it hasn't been a
problem because we have more total thermal mass and the CPUs in sc7180
don't actually generate that much heat compared to other CPUs. We also
have thermal interrupts enabled, which helps. That being said,
improvements are certainly welcome!


> > ...although now that I re-read this, I'm not sure which firmware you
> > might be talking about. Is this the AP firmware, or some companion
> > chip / coprocessor? Even so, I'd still rather see things done in the
> > kernel when possible...
>
> It's a FW run on a dedicated microprocessor. In Arm SoCs it's usually
> some Cortex-M. We communicated with it from the kernel via SCMI drivers
> (using shared memory and mailboxes). We recommend to use the SCMI
> protocol to send e.g. 'performance request' to the FW via 'fast
> channel' instead of having an implementation of PMIC and clock, and do
> the voltage & freq change in the kernel (using drivers & locking). That
> implementation allows to avoid costly locking and allows to go via
> that SCMI cpufreq driver [4] and SCMI perf layer [5] the task scheduler.
> We don't need a dedicated 'sugov' kthread in a Deadline policy to
> do that work and preempt the currently running task.
>
> IMHO the FW approach opens new opportunities.
>
> Regards,
> Lukasz
>
> [1] https://github.com/ARM-software/SCP-firmware/pull/588
> [2]
> https://github.com/ARM-software/SCP-firmware/pull/588/commits/59c62ead5eb66353ae805c367bfa86192e28c410
> [3]
> https://elixir.bootlin.com/linux/v5.17-rc4/source/drivers/cpufreq/qcom-cpufreq-hw.c#L287
> [4]
> https://elixir.bootlin.com/linux/latest/source/drivers/cpufreq/scmi-cpufreq.c#L65
> [5]
> https://elixir.bootlin.com/linux/v5.17-rc4/source/drivers/firmware/arm_scmi/perf.c#L465
Matthias Kaehlcke Feb. 17, 2022, 5:14 p.m. UTC | #23
On Thu, Feb 17, 2022 at 08:37:39AM -0800, Doug Anderson wrote:
> Hi,
> 
> On Thu, Feb 17, 2022 at 2:47 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >
> > Hi Daniel,
> >
> > On 2/17/22 10:10 AM, Daniel Lezcano wrote:
> > > On 16/02/2022 18:33, Doug Anderson wrote:
> > >> Hi,
> > >>
> > >> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> > >>>
> > >>> Hi Matthias,
> > >>>
> > >>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
> > >>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
> > >>>>>
> > >>>>>
> > >>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
> > >>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
> > >>>>>>>
> > >>>>>>>
> > >>>
> > >>> [snip]
> > >>>
> > >>>>>>> Could you point me to those devices please?
> > >>>>>>
> > >>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
> > >>>>>>
> > >>>>>> Though as per above they shouldn't be impacted by your change,
> > >>>>>> since the
> > >>>>>> CPUs always pretend to use milli-Watts.
> > >>>>>>
> > >>>>>> [skipped some questions/answers since sc7180 isn't actually
> > >>>>>> impacted by
> > >>>>>>     the change]
> > >>>>>
> > >>>>> Thank you Matthias. I will investigate your setup to get better
> > >>>>> understanding.
> > >>>>
> > >>>> Thanks!
> > >>>>
> > >>>
> > >>> I've checked those DT files and related code.
> > >>> As you already said, this patch is safe for them.
> > >>> So we can apply it IMO.
> > >>>
> > >>>
> > >>> -------------Off-topic------------------
> > >>> Not in $subject comments:
> > >>>
> > >>> AFAICS based on two files which define thermal zones:
> > >>> sc7180-trogdor-homestar.dtsi
> > >>> sc7180-trogdor-coachz.dtsi
> > >>>
> > >>> only the 'big' cores are used as cooling devices in the
> > >>> 'skin_temp_thermal' - the CPU6 and CPU7.
> > >>>
> > >>> I assume you don't want to model at all the power usage
> > >>> from the Little cluster (which is quite big: 6 CPUs), do you?
> > >>> I can see that the Little CPUs have small dyn-power-coeff
> > >>> ~30% of the big and lower max freq, but still might be worth
> > >>> to add them to IPA. You might give them more 'weight', to
> > >>> make sure they receive more power during power split.
> > >>>
> > >>> You also don't have GPU cooling device in that thermal zone.
> > >>> Based on my experience if your GPU is a power hungry one,
> > >>> e.g. 2-4Watts, you might get better results when you model
> > >>> this 'hot' device (which impacts your temp sensor reported value).
> > >>
> > >> I think the two boards you point at (homestar and coachz) are just the
> > >> two that override the default defined in the SoC dtsi file. If you
> > >> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
> > >> map. You can also see the cooling maps for the littles.
> > >>
> > >> I guess we don't have a `dynamic-power-coefficient` for the GPU,
> > >> though? Seems like we should, but I haven't dug through all the code
> > >> here...
> > >
> > > The dynamic-power-coefficient is available for OPPs which includes
> > > CPUfreq and devfreq. As the GPU is managed by devfreq, setting the
> > > dynamic-power-coefficient makes the energy model available for it.
> > >
> > > However, the OPPs must define the frequency and the voltage. That is the
> > > case for most platforms except on QCom platform.
> > >
> > > That may not be specified as it uses a frequency index and the hardware
> > > does the voltage change in our back. The QCom cpufreq backend get the
> > > voltage table from a register (or whatever) and completes the voltage
> > > values for the OPPs, thus adding the information which is missing in the
> > > device tree. The energy model can then initializes itself and allows the
> > > usage of the Energy Aware Scheduler.
> > >
> > > However this piece of code is missing for the GPU part.
> > >
> >
> > Thank you for joining the discussion. I don't know about that Qcom
> > GPU voltage information is missing.
> >
> > If the voltage is not available (only the frequencies), there is
> > another way. There is an 'advanced' EM which uses registration function:
> > em_dev_register_perf_domain(). It uses a local driver callback to get
> > power for each found frequency. It has benefit because there is no
> > restriction to 'fit' into the math formula, instead just avg power
> > values can be feed into EM. It's called 'advanced' EM [1].
> 
> It seems like there _should_ be a way to get the voltage out for GPU
> operating points, like is done with cpufreq in
> qcom_cpufreq_hw_read_lut(), but it might need someone with Qualcomm
> documentation to help with it. Maybe Rajendra would be able to help?
> Adding Jordon and Rob to this conversation in case they're aware of
> anything.
> 
> As you said, we could just list a power for each frequency, though.
> 
> I'm actually not sure which one would be more accurate across a range
> of devices with different "corners": specifying a dynamic power
> coefficient used for all "corners" and then using the actual voltage
> and doing the math, or specifying a power number for each frequency
> and ignoring the actual voltage used. In any case we're trying to get
> ballpark numbers and not every device will be exactly the same, so
> probably it doesn't matter that much.
> 
> 
> > Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten
> > was proposing from ~2014 this upstream, but EAS wasn't merged back
> > then):
> > where to store these power-freq values, which are then used by the
> > callback. We have the 'dynamic-power-coefficient' in DT, but
> > it has limitations. It would be good to have this simple array
> > attached to the GPU/CPU node. IMHO it meet the requirement of DT,
> > it describes the HW (it would have HZ and Watts values).
> >
> > Doug, Matthias could you have a look at that function and its
> > usage, please [1]?
> > If you guys would support me in this, I would start, with an RFC
> > proposal, a discussion on LKML.
> >
> > [1]
> > https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87
> 
> Matthias: I think you've spent more time on the thermal stuff than me
> so I'll assume you'll follow-up here. If not then please yell!
> 
> Ideally, though, someone from Qualcomm would jump in an own this.
> Basically it allows more intelligently throttling the GPU and CPU
> together in tandem instead of treating them separately IIUC, right?

Yes, I think for the em_dev_register_perf_domain() route support from
Qualcomm would be needed since "Drivers must provide a callback
function returning <frequency, power> tuples for each performance
state. ".
Lukasz Luba Feb. 17, 2022, 5:58 p.m. UTC | #24
On 2/17/22 4:50 PM, Doug Anderson wrote:
> Hi,
> 
> On Wed, Feb 16, 2022 at 3:28 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> On 2/16/22 5:21 PM, Doug Anderson wrote:
>>> Hi,
>>>
>>> On Tue, Feb 8, 2022 at 1:32 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>>
>>>>>      Another important thing is the consistent scale of the power values
>>>>>      provided by the cooling devices. All of the cooling devices in a single
>>>>>      thermal zone should have power values reported either in milli-Watts
>>>>>      or scaled to the same 'abstract scale'.
>>>>
>>>> This can change. We have removed the userspace governor from kernel
>>>> recently. The trend is to implement thermal policy in FW. Dealing with
>>>> some intermediate configurations are causing complicated design, support
>>>> of the algorithm logic is also more complex.
>>>
>>> One thing that didn't get addressed is the whole "The trend is to
>>> implement thermal policy in FW". I'm not sure I can get on board with
>>> that trend. IMO "moving to FW" isn't a super great trend. FW is harder
>>> to update than kernel and trying to keep it in sync with the kernel
>>> isn't wonderful. Unless something _has_ to be in FW I personally
>>> prefer it to be in the kernel.
>>
>> There are pros and cons for both approaches (as always).
>>
>> Although, there are some use cases, where the kernel is not able to
>> react that fast, e.g. sudden power usage changes, which can cause
>> that the power rail is not able to sustain within required conditions.
>> When we are talking about tough requirements for those power & thermal
>> policies, the mechanism must be fast, precised and reliable.
>>
>> Here you can find Arm reference FW implementation and an IPA clone
>> in there (I have been reviewing this) [1][2].
>>
>> As you can see there is a new FW feature set:
>> "MPMM, Traffic-cop and Thermal management".
>>
>> Apart from Arm implementation, there are already known thermal
>> monitoring mechanisms in HW/FW. Like in the new Qcom SoCs which
>> are using this driver code [3]. The driver receives an interrupt
>> about throttling conditions and just populates the thermal pressure.
> 
> Yeah, this has come up in another context recently too. Right on on
> the Qcom SoCs I'm working with (sc7180 on Chromebooks) we've
> essentially disabled all the HW/FW throttling (LMH), preferring to let
> Linux manage things. We chose to do it this way with the assumption
> that Linux would be able to make better decisions than the firmware
> and it was easier to understand / update than an opaque
> vendor-provided blob. LMH is still there with super high limits in
> case Linux goofs up (we don't want to damage the CPU) but it's not the
> primary means of throttling.
> 
> As you said, Linux reacts a bit slower, though I've heard that might
> be fixed soon-ish? So far on sc7180 Chromebooks it hasn't been a
> problem because we have more total thermal mass and the CPUs in sc7180
> don't actually generate that much heat compared to other CPUs. We also
> have thermal interrupts enabled, which helps. That being said,
> improvements are certainly welcome!
> 

Thanks Doug for sharing this with me. I'll keep this in mind, your
requirements, configuration and usage.
I've learned recently that some SoCs start throttling very early during
boot, even before the cpumask is available. That was causing
kernel to blow up (deference of a cpumask NULL pointer in that LMH [1]).

[1] 
https://lore.kernel.org/linux-pm/20220128032554.155132-2-bjorn.andersson@linaro.org/
Lukasz Luba Feb. 17, 2022, 6:05 p.m. UTC | #25
On 2/17/22 5:14 PM, Matthias Kaehlcke wrote:
> On Thu, Feb 17, 2022 at 08:37:39AM -0800, Doug Anderson wrote:
>> Hi,
>>
>> On Thu, Feb 17, 2022 at 2:47 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>
>>> Hi Daniel,
>>>
>>> On 2/17/22 10:10 AM, Daniel Lezcano wrote:
>>>> On 16/02/2022 18:33, Doug Anderson wrote:
>>>>> Hi,
>>>>>
>>>>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>>>>
>>>>>> Hi Matthias,
>>>>>>
>>>>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
>>>>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
>>>>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>>>>> Could you point me to those devices please?
>>>>>>>>>
>>>>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
>>>>>>>>>
>>>>>>>>> Though as per above they shouldn't be impacted by your change,
>>>>>>>>> since the
>>>>>>>>> CPUs always pretend to use milli-Watts.
>>>>>>>>>
>>>>>>>>> [skipped some questions/answers since sc7180 isn't actually
>>>>>>>>> impacted by
>>>>>>>>>      the change]
>>>>>>>>
>>>>>>>> Thank you Matthias. I will investigate your setup to get better
>>>>>>>> understanding.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>
>>>>>> I've checked those DT files and related code.
>>>>>> As you already said, this patch is safe for them.
>>>>>> So we can apply it IMO.
>>>>>>
>>>>>>
>>>>>> -------------Off-topic------------------
>>>>>> Not in $subject comments:
>>>>>>
>>>>>> AFAICS based on two files which define thermal zones:
>>>>>> sc7180-trogdor-homestar.dtsi
>>>>>> sc7180-trogdor-coachz.dtsi
>>>>>>
>>>>>> only the 'big' cores are used as cooling devices in the
>>>>>> 'skin_temp_thermal' - the CPU6 and CPU7.
>>>>>>
>>>>>> I assume you don't want to model at all the power usage
>>>>>> from the Little cluster (which is quite big: 6 CPUs), do you?
>>>>>> I can see that the Little CPUs have small dyn-power-coeff
>>>>>> ~30% of the big and lower max freq, but still might be worth
>>>>>> to add them to IPA. You might give them more 'weight', to
>>>>>> make sure they receive more power during power split.
>>>>>>
>>>>>> You also don't have GPU cooling device in that thermal zone.
>>>>>> Based on my experience if your GPU is a power hungry one,
>>>>>> e.g. 2-4Watts, you might get better results when you model
>>>>>> this 'hot' device (which impacts your temp sensor reported value).
>>>>>
>>>>> I think the two boards you point at (homestar and coachz) are just the
>>>>> two that override the default defined in the SoC dtsi file. If you
>>>>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
>>>>> map. You can also see the cooling maps for the littles.
>>>>>
>>>>> I guess we don't have a `dynamic-power-coefficient` for the GPU,
>>>>> though? Seems like we should, but I haven't dug through all the code
>>>>> here...
>>>>
>>>> The dynamic-power-coefficient is available for OPPs which includes
>>>> CPUfreq and devfreq. As the GPU is managed by devfreq, setting the
>>>> dynamic-power-coefficient makes the energy model available for it.
>>>>
>>>> However, the OPPs must define the frequency and the voltage. That is the
>>>> case for most platforms except on QCom platform.
>>>>
>>>> That may not be specified as it uses a frequency index and the hardware
>>>> does the voltage change in our back. The QCom cpufreq backend get the
>>>> voltage table from a register (or whatever) and completes the voltage
>>>> values for the OPPs, thus adding the information which is missing in the
>>>> device tree. The energy model can then initializes itself and allows the
>>>> usage of the Energy Aware Scheduler.
>>>>
>>>> However this piece of code is missing for the GPU part.
>>>>
>>>
>>> Thank you for joining the discussion. I don't know about that Qcom
>>> GPU voltage information is missing.
>>>
>>> If the voltage is not available (only the frequencies), there is
>>> another way. There is an 'advanced' EM which uses registration function:
>>> em_dev_register_perf_domain(). It uses a local driver callback to get
>>> power for each found frequency. It has benefit because there is no
>>> restriction to 'fit' into the math formula, instead just avg power
>>> values can be feed into EM. It's called 'advanced' EM [1].
>>
>> It seems like there _should_ be a way to get the voltage out for GPU
>> operating points, like is done with cpufreq in
>> qcom_cpufreq_hw_read_lut(), but it might need someone with Qualcomm
>> documentation to help with it. Maybe Rajendra would be able to help?
>> Adding Jordon and Rob to this conversation in case they're aware of
>> anything.
>>
>> As you said, we could just list a power for each frequency, though.
>>
>> I'm actually not sure which one would be more accurate across a range
>> of devices with different "corners": specifying a dynamic power
>> coefficient used for all "corners" and then using the actual voltage
>> and doing the math, or specifying a power number for each frequency
>> and ignoring the actual voltage used. In any case we're trying to get
>> ballpark numbers and not every device will be exactly the same, so
>> probably it doesn't matter that much.
>>
>>
>>> Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten
>>> was proposing from ~2014 this upstream, but EAS wasn't merged back
>>> then):
>>> where to store these power-freq values, which are then used by the
>>> callback. We have the 'dynamic-power-coefficient' in DT, but
>>> it has limitations. It would be good to have this simple array
>>> attached to the GPU/CPU node. IMHO it meet the requirement of DT,
>>> it describes the HW (it would have HZ and Watts values).
>>>
>>> Doug, Matthias could you have a look at that function and its
>>> usage, please [1]?
>>> If you guys would support me in this, I would start, with an RFC
>>> proposal, a discussion on LKML.
>>>
>>> [1]
>>> https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87
>>
>> Matthias: I think you've spent more time on the thermal stuff than me
>> so I'll assume you'll follow-up here. If not then please yell!
>>
>> Ideally, though, someone from Qualcomm would jump in an own this.
>> Basically it allows more intelligently throttling the GPU and CPU
>> together in tandem instead of treating them separately IIUC, right?
> 
> Yes, I think for the em_dev_register_perf_domain() route support from
> Qualcomm would be needed since "Drivers must provide a callback
> function returning <frequency, power> tuples for each performance
> state. ".
> 

Not necessarily. It might be done 'generically' by fwk.

There are other benefits of this 'energy-model' entry in the DT.
I'll list them in the cover letter. Let me send an RFC, so we could
discuss there.

Thanks guys!

Regards,
Lukasz
Lukasz Luba Feb. 17, 2022, 6:18 p.m. UTC | #26
Hi Daniel,


On 2/7/22 7:30 AM, Lukasz Luba wrote:
> The Energy Model supports power values either in Watts or in some abstract
> scale. When the 2nd option is in use, the thermal governor IPA should not
> be allowed to operate, since the relation between cooling devices is not
> properly defined. Thus, it might be possible that big GPU has lower power
> values in abstract scale than a Little CPU. To mitigate a misbehaviour
> of the thermal control algorithm, simply not register a cooling device
> capable of working with IPA.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>   drivers/thermal/cpufreq_cooling.c |  2 +-
>   drivers/thermal/devfreq_cooling.c | 16 +++++++++++++---
>   2 files changed, 14 insertions(+), 4 deletions(-)

The discussion in below this patch went slightly off-topic but it was
valuable. It clarified also there are no broken platforms with this
change.

Could you take the patch into the thermal tree, please?

Regards,
Lukasz
Daniel Lezcano Feb. 17, 2022, 6:27 p.m. UTC | #27
Adding Manaf (QCom) in the loop

On 17/02/2022 17:37, Doug Anderson wrote:
> Hi,
> 
> On Thu, Feb 17, 2022 at 2:47 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Hi Daniel,
>>
>> On 2/17/22 10:10 AM, Daniel Lezcano wrote:
>>> On 16/02/2022 18:33, Doug Anderson wrote:
>>>> Hi,
>>>>
>>>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>>>
>>>>> Hi Matthias,
>>>>>
>>>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote:
>>>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote:
>>>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> [snip]
>>>>>
>>>>>>>>> Could you point me to those devices please?
>>>>>>>>
>>>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-*
>>>>>>>>
>>>>>>>> Though as per above they shouldn't be impacted by your change,
>>>>>>>> since the
>>>>>>>> CPUs always pretend to use milli-Watts.
>>>>>>>>
>>>>>>>> [skipped some questions/answers since sc7180 isn't actually
>>>>>>>> impacted by
>>>>>>>>      the change]
>>>>>>>
>>>>>>> Thank you Matthias. I will investigate your setup to get better
>>>>>>> understanding.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>
>>>>> I've checked those DT files and related code.
>>>>> As you already said, this patch is safe for them.
>>>>> So we can apply it IMO.
>>>>>
>>>>>
>>>>> -------------Off-topic------------------
>>>>> Not in $subject comments:
>>>>>
>>>>> AFAICS based on two files which define thermal zones:
>>>>> sc7180-trogdor-homestar.dtsi
>>>>> sc7180-trogdor-coachz.dtsi
>>>>>
>>>>> only the 'big' cores are used as cooling devices in the
>>>>> 'skin_temp_thermal' - the CPU6 and CPU7.
>>>>>
>>>>> I assume you don't want to model at all the power usage
>>>>> from the Little cluster (which is quite big: 6 CPUs), do you?
>>>>> I can see that the Little CPUs have small dyn-power-coeff
>>>>> ~30% of the big and lower max freq, but still might be worth
>>>>> to add them to IPA. You might give them more 'weight', to
>>>>> make sure they receive more power during power split.
>>>>>
>>>>> You also don't have GPU cooling device in that thermal zone.
>>>>> Based on my experience if your GPU is a power hungry one,
>>>>> e.g. 2-4Watts, you might get better results when you model
>>>>> this 'hot' device (which impacts your temp sensor reported value).
>>>>
>>>> I think the two boards you point at (homestar and coachz) are just the
>>>> two that override the default defined in the SoC dtsi file. If you
>>>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling
>>>> map. You can also see the cooling maps for the littles.
>>>>
>>>> I guess we don't have a `dynamic-power-coefficient` for the GPU,
>>>> though? Seems like we should, but I haven't dug through all the code
>>>> here...
>>>
>>> The dynamic-power-coefficient is available for OPPs which includes
>>> CPUfreq and devfreq. As the GPU is managed by devfreq, setting the
>>> dynamic-power-coefficient makes the energy model available for it.
>>>
>>> However, the OPPs must define the frequency and the voltage. That is the
>>> case for most platforms except on QCom platform.
>>>
>>> That may not be specified as it uses a frequency index and the hardware
>>> does the voltage change in our back. The QCom cpufreq backend get the
>>> voltage table from a register (or whatever) and completes the voltage
>>> values for the OPPs, thus adding the information which is missing in the
>>> device tree. The energy model can then initializes itself and allows the
>>> usage of the Energy Aware Scheduler.
>>>
>>> However this piece of code is missing for the GPU part.
>>>
>>
>> Thank you for joining the discussion. I don't know about that Qcom
>> GPU voltage information is missing.
>>
>> If the voltage is not available (only the frequencies), there is
>> another way. There is an 'advanced' EM which uses registration function:
>> em_dev_register_perf_domain(). It uses a local driver callback to get
>> power for each found frequency. It has benefit because there is no
>> restriction to 'fit' into the math formula, instead just avg power
>> values can be feed into EM. It's called 'advanced' EM [1].
> 
> It seems like there _should_ be a way to get the voltage out for GPU
> operating points, like is done with cpufreq in
> qcom_cpufreq_hw_read_lut(), but it might need someone with Qualcomm
> documentation to help with it. Maybe Rajendra would be able to help?
> Adding Jordon and Rob to this conversation in case they're aware of
> anything.
> 
> As you said, we could just list a power for each frequency, though.
> 
> I'm actually not sure which one would be more accurate across a range
> of devices with different "corners": specifying a dynamic power
> coefficient used for all "corners" and then using the actual voltage
> and doing the math, or specifying a power number for each frequency
> and ignoring the actual voltage used. In any case we're trying to get
> ballpark numbers and not every device will be exactly the same, so
> probably it doesn't matter that much.
> 
> 
>> Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten
>> was proposing from ~2014 this upstream, but EAS wasn't merged back
>> then):
>> where to store these power-freq values, which are then used by the
>> callback. We have the 'dynamic-power-coefficient' in DT, but
>> it has limitations. It would be good to have this simple array
>> attached to the GPU/CPU node. IMHO it meet the requirement of DT,
>> it describes the HW (it would have HZ and Watts values).
>>
>> Doug, Matthias could you have a look at that function and its
>> usage, please [1]?
>> If you guys would support me in this, I would start, with an RFC
>> proposal, a discussion on LKML.
>>
>> [1]
>> https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87
> 
> Matthias: I think you've spent more time on the thermal stuff than me
> so I'll assume you'll follow-up here. If not then please yell!
> 
> Ideally, though, someone from Qualcomm would jump in an own this.
> Basically it allows more intelligently throttling the GPU and CPU
> together in tandem instead of treating them separately IIUC, right?
> 
> -Doug
Lukasz Luba Feb. 22, 2022, 5:05 p.m. UTC | #28
Hi Daniel,

gentle ping

On 2/17/22 18:18, Lukasz Luba wrote:
> Hi Daniel,
> 
> 
> On 2/7/22 7:30 AM, Lukasz Luba wrote:
>> The Energy Model supports power values either in Watts or in some 
>> abstract
>> scale. When the 2nd option is in use, the thermal governor IPA should not
>> be allowed to operate, since the relation between cooling devices is not
>> properly defined. Thus, it might be possible that big GPU has lower power
>> values in abstract scale than a Little CPU. To mitigate a misbehaviour
>> of the thermal control algorithm, simply not register a cooling device
>> capable of working with IPA.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   drivers/thermal/cpufreq_cooling.c |  2 +-
>>   drivers/thermal/devfreq_cooling.c | 16 +++++++++++++---
>>   2 files changed, 14 insertions(+), 4 deletions(-)
> 
> The discussion in below this patch went slightly off-topic but it was
> valuable. It clarified also there are no broken platforms with this
> change.
> 
> Could you take the patch into the thermal tree, please?
> 
> Regards,
> Lukasz
Daniel Lezcano Feb. 22, 2022, 6:12 p.m. UTC | #29
Hi Lukasz,

I don't think it makes sense to remove the support of the energy model 
if the units are abstracts.

IIUC, regarding your previous answer, we don't really know what will do 
the SoC vendor with these numbers and likely they will provide 
consistent abstract values which won't prevent a correct behavior.

What would be the benefit of giving inconsistent abstract values which 
will be unusable except of giving a broken energy model?

Your proposed changes would be acceptable if the energy model has a 
broken flag IMO

On 22/02/2022 18:05, Lukasz Luba wrote:
> Hi Daniel,
> 
> gentle ping
> 
> On 2/17/22 18:18, Lukasz Luba wrote:
>> Hi Daniel,
>>
>>
>> On 2/7/22 7:30 AM, Lukasz Luba wrote:
>>> The Energy Model supports power values either in Watts or in some 
>>> abstract
>>> scale. When the 2nd option is in use, the thermal governor IPA should 
>>> not
>>> be allowed to operate, since the relation between cooling devices is not
>>> properly defined. Thus, it might be possible that big GPU has lower 
>>> power
>>> values in abstract scale than a Little CPU. To mitigate a misbehaviour
>>> of the thermal control algorithm, simply not register a cooling device
>>> capable of working with IPA.
>>>
>>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>>> ---
>>>   drivers/thermal/cpufreq_cooling.c |  2 +-
>>>   drivers/thermal/devfreq_cooling.c | 16 +++++++++++++---
>>>   2 files changed, 14 insertions(+), 4 deletions(-)
>>
>> The discussion in below this patch went slightly off-topic but it was
>> valuable. It clarified also there are no broken platforms with this
>> change.
>>
>> Could you take the patch into the thermal tree, please?
>>
>> Regards,
>> Lukasz
Lukasz Luba Feb. 22, 2022, 6:31 p.m. UTC | #30
On 2/22/22 18:12, Daniel Lezcano wrote:
> 
> Hi Lukasz,
> 
> I don't think it makes sense to remove the support of the energy model 
> if the units are abstracts.
> 
> IIUC, regarding your previous answer, we don't really know what will do 
> the SoC vendor with these numbers and likely they will provide 
> consistent abstract values which won't prevent a correct behavior.
> 
> What would be the benefit of giving inconsistent abstract values which 
> will be unusable except of giving a broken energy model?

The power values in the EM which has abstract scale, would make sense to 
EAS, but not for IPA or DTPM. Those platforms which want to enable EAS,
but don't need IPA, would register such '<a_good_name_here>' EM.

> 
> Your proposed changes would be acceptable if the energy model has a 
> broken flag IMO

That is doable. I can add that flag, so we can call it 'artificial' EM
(when this new flag is set).

Let me craft the RFC patch with this new flag proposal then.
Do you agree? Can I also add you as 'Suggested-by'?

Thank you for coming back to me with the comments.
Daniel Lezcano Feb. 22, 2022, 10:10 p.m. UTC | #31
Hi Lukasz,

On 22/02/2022 19:31, Lukasz Luba wrote:
> 
> 
> On 2/22/22 18:12, Daniel Lezcano wrote:
>>
>> Hi Lukasz,
>>
>> I don't think it makes sense to remove the support of the energy model 
>> if the units are abstracts.
>>
>> IIUC, regarding your previous answer, we don't really know what will 
>> do the SoC vendor with these numbers and likely they will provide 
>> consistent abstract values which won't prevent a correct behavior.
>>
>> What would be the benefit of giving inconsistent abstract values which 
>> will be unusable except of giving a broken energy model?
> 
> The power values in the EM which has abstract scale, would make sense to 
> EAS, but not for IPA or DTPM. Those platforms which want to enable EAS,
> but don't need IPA, would register such '<a_good_name_here>' EM.

Sorry, but I don't understand why DTPM can not deal with abstract values?


>> Your proposed changes would be acceptable if the energy model has a 
>> broken flag IMO
> 
> That is doable. I can add that flag, so we can call it 'artificial' EM
> (when this new flag is set).

It is too soon IMO, I would like to see the numbers first so we can take 
an enlighten decision. Right now, it is unclear what the numbers will be.


> Let me craft the RFC patch with this new flag proposal then.
> Do you agree? Can I also add you as 'Suggested-by'?
> 
> Thank you for coming back to me with the comments.
Lukasz Luba Feb. 23, 2022, 9:10 a.m. UTC | #32
On 2/22/22 22:10, Daniel Lezcano wrote:
> 
> Hi Lukasz,
> 
> On 22/02/2022 19:31, Lukasz Luba wrote:
>>
>>
>> On 2/22/22 18:12, Daniel Lezcano wrote:
>>>
>>> Hi Lukasz,
>>>
>>> I don't think it makes sense to remove the support of the energy 
>>> model if the units are abstracts.
>>>
>>> IIUC, regarding your previous answer, we don't really know what will 
>>> do the SoC vendor with these numbers and likely they will provide 
>>> consistent abstract values which won't prevent a correct behavior.
>>>
>>> What would be the benefit of giving inconsistent abstract values 
>>> which will be unusable except of giving a broken energy model?
>>
>> The power values in the EM which has abstract scale, would make sense 
>> to EAS, but not for IPA or DTPM. Those platforms which want to enable 
>> EAS,
>> but don't need IPA, would register such '<a_good_name_here>' EM.
> 
> Sorry, but I don't understand why DTPM can not deal with abstract values?

They will be totally meaningless/bogus.

> 
> 
>>> Your proposed changes would be acceptable if the energy model has a 
>>> broken flag IMO
>>
>> That is doable. I can add that flag, so we can call it 'artificial' EM
>> (when this new flag is set).
> 
> It is too soon IMO, I would like to see the numbers first so we can take 
> an enlighten decision. Right now, it is unclear what the numbers will be.

We are going to add new support from our roadmap for platforms which
don't have this power information but are going to use EAS.

I'm going to send some patches soon which create that support. Pierre
is going to send the platform code. I want to make sure that this
platform won't register power actors for IPA. Other thermal governors
will work, since they don't use EM for making a decision.
diff mbox series

Patch

diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index 43b1ae8a7789..f831ed40b333 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -328,7 +328,7 @@  static inline bool em_is_sane(struct cpufreq_cooling_device *cpufreq_cdev,
 	struct cpufreq_policy *policy;
 	unsigned int nr_levels;
 
-	if (!em)
+	if (!em || !(em->flags & EM_PERF_DOMAIN_MILLIWATTS))
 		return false;
 
 	policy = cpufreq_cdev->policy;
diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index 4310cb342a9f..7e8bd1368cab 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -336,6 +336,14 @@  static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc,
 	return 0;
 }
 
+static inline bool em_is_sane(struct em_perf_domain *em)
+{
+	if (!em || !(em->flags & EM_PERF_DOMAIN_MILLIWATTS))
+		return false;
+	else
+		return true;
+}
+
 /**
  * of_devfreq_cooling_register_power() - Register devfreq cooling device,
  *                                      with OF and power information.
@@ -358,6 +366,7 @@  of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
 	struct thermal_cooling_device *cdev;
 	struct device *dev = df->dev.parent;
 	struct devfreq_cooling_device *dfc;
+	struct em_perf_domain *em;
 	char *name;
 	int err, num_opps;
 
@@ -367,8 +376,9 @@  of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
 
 	dfc->devfreq = df;
 
-	dfc->em_pd = em_pd_get(dev);
-	if (dfc->em_pd) {
+	em = em_pd_get(dev);
+	if (em_is_sane(em)) {
+		dfc->em_pd = em;
 		devfreq_cooling_ops.get_requested_power =
 			devfreq_cooling_get_requested_power;
 		devfreq_cooling_ops.state2power = devfreq_cooling_state2power;
@@ -379,7 +389,7 @@  of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
 		num_opps = em_pd_nr_perf_states(dfc->em_pd);
 	} else {
 		/* Backward compatibility for drivers which do not use IPA */
-		dev_dbg(dev, "missing EM for cooling device\n");
+		dev_dbg(dev, "missing proper EM for cooling device\n");
 
 		num_opps = dev_pm_opp_get_opp_count(dev);