diff mbox series

[v1] thermal: core: Move passive polling management to the core

Message ID 5938055.MhkbZ0Pkbq@kreacher
State Superseded
Headers show
Series [v1] thermal: core: Move passive polling management to the core | expand

Commit Message

Rafael J. Wysocki April 25, 2024, 2:11 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Passive polling is enabled by setting the 'passive' field in
struct thermal_zone_device to a positive value so long as the
'passive_delay_jiffies' field is greater than zero.  It causes
the thermal core to actively check the thermal zone temperature
periodically which in theory should be done after crossing a
passive trip point on the way up in order to allow governors to
react more rapidly to temperature changes and adjust mitigation
more precisely.

However, the 'passive' field in struct thermal_zone_device is currently
managed by governors which is quite problematic.  First of all, only
two governors, Step-Wise and Power Allocator, update that field at
all, so the other governors do not benefit from passive polling,
although in principle they should.  Moreover, if the zone governor is
changed from, say, Step-Wise to Fair-Share after 'passive' has been
incremented by the former, it is not going to be reset back to zero by
the latter even if the zone temperature falls down below all passive
trip points.

For this reason, make handle_thermal_trip() increment 'passive'
to enable passive polling for the given thermal zone whenever a
passive trip point is crossed on the way up and decrement it
whenever a passive trip point is crossed on the way down.  Also
remove the 'passive' field updates from governors and additionally
clear it in thermal_zone_device_init() to prevent passive polling
from being enabled after a system resume just beacuse it was enabled
before suspending the system.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This has been mentioned here:

https://lore.kernel.org/linux-pm/61560bc6-d453-4b0c-a4ea-b375d547b143@linaro.org/

and I need someone to double check if the Power Allocator governor does not
need to be adjusted more for this change.

---
 drivers/thermal/gov_power_allocator.c |   12 +++++++-----
 drivers/thermal/gov_step_wise.c       |   10 ----------
 drivers/thermal/thermal_core.c        |   10 ++++++++--
 3 files changed, 15 insertions(+), 17 deletions(-)

Comments

Lukasz Luba April 29, 2024, 8:09 a.m. UTC | #1
On 4/25/24 15:11, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Passive polling is enabled by setting the 'passive' field in
> struct thermal_zone_device to a positive value so long as the
> 'passive_delay_jiffies' field is greater than zero.  It causes
> the thermal core to actively check the thermal zone temperature
> periodically which in theory should be done after crossing a
> passive trip point on the way up in order to allow governors to
> react more rapidly to temperature changes and adjust mitigation
> more precisely.
> 
> However, the 'passive' field in struct thermal_zone_device is currently
> managed by governors which is quite problematic.  First of all, only
> two governors, Step-Wise and Power Allocator, update that field at
> all, so the other governors do not benefit from passive polling,
> although in principle they should.  Moreover, if the zone governor is
> changed from, say, Step-Wise to Fair-Share after 'passive' has been
> incremented by the former, it is not going to be reset back to zero by
> the latter even if the zone temperature falls down below all passive
> trip points.
> 
> For this reason, make handle_thermal_trip() increment 'passive'
> to enable passive polling for the given thermal zone whenever a
> passive trip point is crossed on the way up and decrement it
> whenever a passive trip point is crossed on the way down.  Also
> remove the 'passive' field updates from governors and additionally
> clear it in thermal_zone_device_init() to prevent passive polling
> from being enabled after a system resume just beacuse it was enabled
> before suspending the system.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> This has been mentioned here:
> 
> https://lore.kernel.org/linux-pm/61560bc6-d453-4b0c-a4ea-b375d547b143@linaro.org/
> 
> and I need someone to double check if the Power Allocator governor does not
> need to be adjusted more for this change.

I'll do this today (and the rest of the patch as well).
Lukasz Luba April 29, 2024, 9:21 p.m. UTC | #2
Hi Rafael,

On 4/25/24 15:11, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Passive polling is enabled by setting the 'passive' field in
> struct thermal_zone_device to a positive value so long as the
> 'passive_delay_jiffies' field is greater than zero.  It causes
> the thermal core to actively check the thermal zone temperature
> periodically which in theory should be done after crossing a
> passive trip point on the way up in order to allow governors to
> react more rapidly to temperature changes and adjust mitigation
> more precisely.
> 
> However, the 'passive' field in struct thermal_zone_device is currently
> managed by governors which is quite problematic.  First of all, only
> two governors, Step-Wise and Power Allocator, update that field at
> all, so the other governors do not benefit from passive polling,
> although in principle they should.  Moreover, if the zone governor is
> changed from, say, Step-Wise to Fair-Share after 'passive' has been
> incremented by the former, it is not going to be reset back to zero by
> the latter even if the zone temperature falls down below all passive
> trip points.
> 
> For this reason, make handle_thermal_trip() increment 'passive'
> to enable passive polling for the given thermal zone whenever a
> passive trip point is crossed on the way up and decrement it
> whenever a passive trip point is crossed on the way down.  Also
> remove the 'passive' field updates from governors and additionally
> clear it in thermal_zone_device_init() to prevent passive polling
> from being enabled after a system resume just beacuse it was enabled
> before suspending the system.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> This has been mentioned here:
> 
> https://lore.kernel.org/linux-pm/61560bc6-d453-4b0c-a4ea-b375d547b143@linaro.org/
> 
> and I need someone to double check if the Power Allocator governor does not
> need to be adjusted more for this change.
> 
> ---
>   drivers/thermal/gov_power_allocator.c |   12 +++++++-----
>   drivers/thermal/gov_step_wise.c       |   10 ----------
>   drivers/thermal/thermal_core.c        |   10 ++++++++--
>   3 files changed, 15 insertions(+), 17 deletions(-)
> 
> Index: linux-pm/drivers/thermal/thermal_core.c
> ===================================================================
> --- linux-pm.orig/drivers/thermal/thermal_core.c
> +++ linux-pm/drivers/thermal/thermal_core.c
> @@ -389,6 +389,9 @@ static void handle_thermal_trip(struct t
>   		if (tz->temperature < trip->temperature - trip->hysteresis) {
>   			list_add(&td->notify_list_node, way_down_list);
>   			td->notify_temp = trip->temperature - trip->hysteresis;
> +
> +			if (trip->type == THERMAL_TRIP_PASSIVE)
> +				tz->passive--;

This gets negative values and than the core switches to fast 'polling'
mode. The values is decremented from 0 each time the
thermal_zone_device_enable() is called.

Then IPA is actually called every 100ms after boot w/ low temp,
while it should 1s.

Please see the logs below:
'short log' after boot
----------------------------------------------

[    1.632670] thermal_sys: TZ: tz_id=0 passive-- = -1
[    1.637984] thermal_sys: TZ: tz_id=0 passive-- = -2
[    1.643641] thermal_sys: TZ: tz_id=1 passive-- = -1
----------------------------------------------

long log with call stack dumped
----------------------------------------------

[    1.632973] thermal_sys: TZ: tz_id=0 passive-- = -1
[    1.638295] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 6.9.0-rc5+ #28
[    1.645409] Hardware name: Radxa ROCK 4SE (DT)
[    1.650376] Call trace:
[    1.653109]  dump_backtrace+0x9c/0x100
[    1.657309]  show_stack+0x20/0x38
[    1.661017]  dump_stack_lvl+0xc0/0xd0
[    1.665119]  dump_stack+0x18/0x28
[    1.668828]  __thermal_zone_device_update+0x1fc/0x550
[    1.674484]  thermal_zone_device_set_mode+0x64/0xc0
[    1.679943]  thermal_zone_device_enable+0x1c/0x30
[    1.685206]  thermal_of_zone_register+0x34c/0x590
[    1.690473]  devm_thermal_of_zone_register+0x6c/0xc0
[    1.696031]  rockchip_thermal_probe+0x238/0x5e8
[    1.701106]  platform_probe+0x70/0xe8
[    1.705208]  really_probe+0xc4/0x278
[    1.709205]  __driver_probe_device+0x80/0x140
[    1.714078]  driver_probe_device+0x48/0x130
[    1.718756]  __driver_attach+0x7c/0x138
[    1.723045]  bus_for_each_dev+0x80/0xf0
[    1.727342]  driver_attach+0x2c/0x40
[    1.731340]  bus_add_driver+0xec/0x1f8
[    1.735539]  driver_register+0x68/0x138
[    1.739828]  __platform_driver_register+0x30/0x48
[    1.745093]  rockchip_thermal_driver_init+0x24/0x38
[    1.750551]  do_one_initcall+0x50/0x2d8
[    1.754844]  kernel_init_freeable+0x204/0x440
[    1.759722]  kernel_init+0x28/0x140
[    1.763631]  ret_from_fork+0x10/0x20
[    1.767802] thermal_sys: TZ: tz_id=0 passive-- = -2
[    1.773086] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 6.9.0-rc5+ #28
[    1.780196] Hardware name: Radxa ROCK 4SE (DT)
[    1.785162] Call trace:
[    1.787893]  dump_backtrace+0x9c/0x100
[    1.792087]  show_stack+0x20/0x38
[    1.795795]  dump_stack_lvl+0xc0/0xd0
[    1.799895]  dump_stack+0x18/0x28
[    1.803604]  __thermal_zone_device_update+0x1fc/0x550
[    1.809257]  thermal_zone_device_set_mode+0x64/0xc0
[    1.814715]  thermal_zone_device_enable+0x1c/0x30
[    1.819977]  thermal_of_zone_register+0x34c/0x590
[    1.825242]  devm_thermal_of_zone_register+0x6c/0xc0
[    1.830799]  rockchip_thermal_probe+0x238/0x5e8
[    1.835874]  platform_probe+0x70/0xe8
[    1.839973]  really_probe+0xc4/0x278
[    1.843972]  __driver_probe_device+0x80/0x140
[    1.848846]  driver_probe_device+0x48/0x130
[    1.853524]  __driver_attach+0x7c/0x138
[    1.857813]  bus_for_each_dev+0x80/0xf0
[    1.862110]  driver_attach+0x2c/0x40
[    1.866109]  bus_add_driver+0xec/0x1f8
[    1.870307]  driver_register+0x68/0x138
[    1.874597]  __platform_driver_register+0x30/0x48
[    1.879861]  rockchip_thermal_driver_init+0x24/0x38
[    1.885317]  do_one_initcall+0x50/0x2d8
[    1.889609]  kernel_init_freeable+0x204/0x440
[    1.894486]  kernel_init+0x28/0x140
[    1.898394]  ret_from_fork+0x10/0x20
[    1.902879] thermal_sys: TZ: tz_id=1 passive-- = -1
[    1.908172] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 6.9.0-rc5+ #28
[    1.915282] Hardware name: Radxa ROCK 4SE (DT)
[    1.920248] Call trace:
[    1.922979]  dump_backtrace+0x9c/0x100
[    1.927176]  show_stack+0x20/0x38
[    1.930883]  dump_stack_lvl+0xc0/0xd0
[    1.934982]  dump_stack+0x18/0x28
[    1.938691]  __thermal_zone_device_update+0x1fc/0x550
[    1.944342]  thermal_zone_device_set_mode+0x64/0xc0
[    1.949801]  thermal_zone_device_enable+0x1c/0x30
[    1.955063]  thermal_of_zone_register+0x34c/0x590
[    1.960328]  devm_thermal_of_zone_register+0x6c/0xc0
[    1.965886]  rockchip_thermal_probe+0x238/0x5e8
[    1.970951]  platform_probe+0x70/0xe8
[    1.975049]  really_probe+0xc4/0x278
[    1.979039]  __driver_probe_device+0x80/0x140
[    1.983911]  driver_probe_device+0x48/0x130
[    1.988589]  __driver_attach+0x7c/0x138
[    1.992880]  bus_for_each_dev+0x80/0xf0
[    1.997176]  driver_attach+0x2c/0x40
[    2.001173]  bus_add_driver+0xec/0x1f8
[    2.005372]  driver_register+0x68/0x138
[    2.009663]  __platform_driver_register+0x30/0x48
[    2.014927]  rockchip_thermal_driver_init+0x24/0x38
[    2.020383]  do_one_initcall+0x50/0x2d8
[    2.024674]  kernel_init_freeable+0x204/0x440
[    2.029549]  kernel_init+0x28/0x140
[    2.033456]  ret_from_fork+0x10/0x20

------------------------------------------


IMO we should check something like:
----------------------8<--------------------------
if (tz->passive)
	tz->passive--;
--------------------->8---------------------------
because we start from '0' as init value.

>   		} else {
>   			td->threshold -= trip->hysteresis;
>   		}
> @@ -402,8 +405,10 @@ static void handle_thermal_trip(struct t
>   		td->notify_temp = trip->temperature;
>   		td->threshold -= trip->hysteresis;
>   
> -		if (trip->type == THERMAL_TRIP_CRITICAL ||
> -		    trip->type == THERMAL_TRIP_HOT)
> +		if (trip->type == THERMAL_TRIP_PASSIVE)
> +			tz->passive++;

Also, we have to make sure here, that we are align after some change
and be able to return with value to 0 (to switch to 'passive'
slow checks).

Regards,
Lukasz
Rafael J. Wysocki April 30, 2024, 1:48 p.m. UTC | #3
Hi Lukasz,

On Mon, Apr 29, 2024 at 11:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Rafael,
>
> On 4/25/24 15:11, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Passive polling is enabled by setting the 'passive' field in
> > struct thermal_zone_device to a positive value so long as the
> > 'passive_delay_jiffies' field is greater than zero.  It causes
> > the thermal core to actively check the thermal zone temperature
> > periodically which in theory should be done after crossing a
> > passive trip point on the way up in order to allow governors to
> > react more rapidly to temperature changes and adjust mitigation
> > more precisely.
> >
> > However, the 'passive' field in struct thermal_zone_device is currently
> > managed by governors which is quite problematic.  First of all, only
> > two governors, Step-Wise and Power Allocator, update that field at
> > all, so the other governors do not benefit from passive polling,
> > although in principle they should.  Moreover, if the zone governor is
> > changed from, say, Step-Wise to Fair-Share after 'passive' has been
> > incremented by the former, it is not going to be reset back to zero by
> > the latter even if the zone temperature falls down below all passive
> > trip points.
> >
> > For this reason, make handle_thermal_trip() increment 'passive'
> > to enable passive polling for the given thermal zone whenever a
> > passive trip point is crossed on the way up and decrement it
> > whenever a passive trip point is crossed on the way down.  Also
> > remove the 'passive' field updates from governors and additionally
> > clear it in thermal_zone_device_init() to prevent passive polling
> > from being enabled after a system resume just beacuse it was enabled
> > before suspending the system.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > This has been mentioned here:
> >
> > https://lore.kernel.org/linux-pm/61560bc6-d453-4b0c-a4ea-b375d547b143@linaro.org/
> >
> > and I need someone to double check if the Power Allocator governor does not
> > need to be adjusted more for this change.
> >
> > ---
> >   drivers/thermal/gov_power_allocator.c |   12 +++++++-----
> >   drivers/thermal/gov_step_wise.c       |   10 ----------
> >   drivers/thermal/thermal_core.c        |   10 ++++++++--
> >   3 files changed, 15 insertions(+), 17 deletions(-)
> >
> > Index: linux-pm/drivers/thermal/thermal_core.c
> > ===================================================================
> > --- linux-pm.orig/drivers/thermal/thermal_core.c
> > +++ linux-pm/drivers/thermal/thermal_core.c
> > @@ -389,6 +389,9 @@ static void handle_thermal_trip(struct t
> >               if (tz->temperature < trip->temperature - trip->hysteresis) {
> >                       list_add(&td->notify_list_node, way_down_list);
> >                       td->notify_temp = trip->temperature - trip->hysteresis;
> > +
> > +                     if (trip->type == THERMAL_TRIP_PASSIVE)
> > +                             tz->passive--;
>
> This gets negative values and than the core switches to fast 'polling'
> mode. The values is decremented from 0 each time the
> thermal_zone_device_enable() is called.

Interesting.

This shouldn't happen because it means that the passive trip has been
crossed on the way down, but it wasn't crossed on the way up.

It looks like an initialization issue to me.

> Then IPA is actually called every 100ms after boot w/ low temp,
> while it should 1s.
>
> Please see the logs below:
> 'short log' after boot
> ----------------------------------------------
>
> [    1.632670] thermal_sys: TZ: tz_id=0 passive-- = -1
> [    1.637984] thermal_sys: TZ: tz_id=0 passive-- = -2
> [    1.643641] thermal_sys: TZ: tz_id=1 passive-- = -1
> ----------------------------------------------
>
> long log with call stack dumped
> ----------------------------------------------
>
> [    1.632973] thermal_sys: TZ: tz_id=0 passive-- = -1
> [    1.638295] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 6.9.0-rc5+ #28
> [    1.645409] Hardware name: Radxa ROCK 4SE (DT)
> [    1.650376] Call trace:
> [    1.653109]  dump_backtrace+0x9c/0x100
> [    1.657309]  show_stack+0x20/0x38
> [    1.661017]  dump_stack_lvl+0xc0/0xd0
> [    1.665119]  dump_stack+0x18/0x28
> [    1.668828]  __thermal_zone_device_update+0x1fc/0x550
> [    1.674484]  thermal_zone_device_set_mode+0x64/0xc0
> [    1.679943]  thermal_zone_device_enable+0x1c/0x30
> [    1.685206]  thermal_of_zone_register+0x34c/0x590

So let's see.

thermal_of_zone_register() calls
thermal_zone_device_register_with_trips() which calls
thermal_zone_device_update() for the first time, but
__thermal_zone_device_update() sees that
thermal_zone_device_is_enabled() returns false, so it does nothing.

This is right after thermal_zone_device_init() has been called, so
tz->temperature == THERMAL_TEMP_INVALID and tz->passive == 0.

Next, thermal_zone_device_enable() is called by
thermal_of_zone_register() and it calls __thermal_zone_device_update()
via thermal_zone_device_set_mode().

This time thermal_zone_device_is_enabled() returns true, so
update_temperature() is called and, unless __thermal_zone_get_temp()
returns an error, it sets tz->last_temperature to THERMAL_TEMP_INVALID
and tz->temperature to the current zone temperature.

Next, handle_thermal_trip() is called for all trips and it sees that
tz->last_temperature == THERMAL_TEMP_INVALID, so it skips the branch
in which tz->passive is decremented.

The only case I can see in which something else can happen in when
__thermal_zone_get_temp() called by update_temperature() returns an
error code (and if it is -EAGAIN, it won't even trigger a warning
message) in which case the error is silently discarded and
__thermal_zone_device_update() happily proceeds with tz->temperature
== THERMAL_TEMP_INVALID and tz->last_temperature == 0.

This can lead to many surprises down the road, so IMV
__thermal_zone_device_update() should abort if it sees tz->temperature
== THERMAL_TEMP_INVALID past calling update_temperature().

So I'm wondering if the patch below (modulo white-space damage from
GMail) helps.

> [    1.690473]  devm_thermal_of_zone_register+0x6c/0xc0
> [    1.696031]  rockchip_thermal_probe+0x238/0x5e8
> [    1.701106]  platform_probe+0x70/0xe8
> [    1.705208]  really_probe+0xc4/0x278
> [    1.709205]  __driver_probe_device+0x80/0x140
> [    1.714078]  driver_probe_device+0x48/0x130
> [    1.718756]  __driver_attach+0x7c/0x138
> [    1.723045]  bus_for_each_dev+0x80/0xf0
> [    1.727342]  driver_attach+0x2c/0x40
> [    1.731340]  bus_add_driver+0xec/0x1f8
> [    1.735539]  driver_register+0x68/0x138
> [    1.739828]  __platform_driver_register+0x30/0x48
> [    1.745093]  rockchip_thermal_driver_init+0x24/0x38
> [    1.750551]  do_one_initcall+0x50/0x2d8
> [    1.754844]  kernel_init_freeable+0x204/0x440
> [    1.759722]  kernel_init+0x28/0x140
> [    1.763631]  ret_from_fork+0x10/0x20
> [    1.767802] thermal_sys: TZ: tz_id=0 passive-- = -2

---
 drivers/thermal/thermal_core.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -496,6 +496,9 @@ void __thermal_zone_device_update(struct

     update_temperature(tz);

+    if (tz->temperature == THERMAL_TEMP_INVALID)
+        return;
+
     tz->notify_event = event;

     for_each_trip_desc(tz, td)
Lukasz Luba April 30, 2024, 2:58 p.m. UTC | #4
On 4/30/24 14:48, Rafael J. Wysocki wrote:
> Hi Lukasz,
> 
> On Mon, Apr 29, 2024 at 11:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Hi Rafael,
>>
>> On 4/25/24 15:11, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Passive polling is enabled by setting the 'passive' field in
>>> struct thermal_zone_device to a positive value so long as the
>>> 'passive_delay_jiffies' field is greater than zero.  It causes
>>> the thermal core to actively check the thermal zone temperature
>>> periodically which in theory should be done after crossing a
>>> passive trip point on the way up in order to allow governors to
>>> react more rapidly to temperature changes and adjust mitigation
>>> more precisely.
>>>
>>> However, the 'passive' field in struct thermal_zone_device is currently
>>> managed by governors which is quite problematic.  First of all, only
>>> two governors, Step-Wise and Power Allocator, update that field at
>>> all, so the other governors do not benefit from passive polling,
>>> although in principle they should.  Moreover, if the zone governor is
>>> changed from, say, Step-Wise to Fair-Share after 'passive' has been
>>> incremented by the former, it is not going to be reset back to zero by
>>> the latter even if the zone temperature falls down below all passive
>>> trip points.
>>>
>>> For this reason, make handle_thermal_trip() increment 'passive'
>>> to enable passive polling for the given thermal zone whenever a
>>> passive trip point is crossed on the way up and decrement it
>>> whenever a passive trip point is crossed on the way down.  Also
>>> remove the 'passive' field updates from governors and additionally
>>> clear it in thermal_zone_device_init() to prevent passive polling
>>> from being enabled after a system resume just beacuse it was enabled
>>> before suspending the system.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>
>>> This has been mentioned here:
>>>
>>> https://lore.kernel.org/linux-pm/61560bc6-d453-4b0c-a4ea-b375d547b143@linaro.org/
>>>
>>> and I need someone to double check if the Power Allocator governor does not
>>> need to be adjusted more for this change.
>>>
>>> ---
>>>    drivers/thermal/gov_power_allocator.c |   12 +++++++-----
>>>    drivers/thermal/gov_step_wise.c       |   10 ----------
>>>    drivers/thermal/thermal_core.c        |   10 ++++++++--
>>>    3 files changed, 15 insertions(+), 17 deletions(-)
>>>
>>> Index: linux-pm/drivers/thermal/thermal_core.c
>>> ===================================================================
>>> --- linux-pm.orig/drivers/thermal/thermal_core.c
>>> +++ linux-pm/drivers/thermal/thermal_core.c
>>> @@ -389,6 +389,9 @@ static void handle_thermal_trip(struct t
>>>                if (tz->temperature < trip->temperature - trip->hysteresis) {
>>>                        list_add(&td->notify_list_node, way_down_list);
>>>                        td->notify_temp = trip->temperature - trip->hysteresis;
>>> +
>>> +                     if (trip->type == THERMAL_TRIP_PASSIVE)
>>> +                             tz->passive--;
>>
>> This gets negative values and than the core switches to fast 'polling'
>> mode. The values is decremented from 0 each time the
>> thermal_zone_device_enable() is called.
> 
> Interesting.
> 
> This shouldn't happen because it means that the passive trip has been
> crossed on the way down, but it wasn't crossed on the way up.
> 
> It looks like an initialization issue to me.
> 
>> Then IPA is actually called every 100ms after boot w/ low temp,
>> while it should 1s.
>>
>> Please see the logs below:
>> 'short log' after boot
>> ----------------------------------------------
>>
>> [    1.632670] thermal_sys: TZ: tz_id=0 passive-- = -1
>> [    1.637984] thermal_sys: TZ: tz_id=0 passive-- = -2
>> [    1.643641] thermal_sys: TZ: tz_id=1 passive-- = -1
>> ----------------------------------------------
>>
>> long log with call stack dumped
>> ----------------------------------------------
>>
>> [    1.632973] thermal_sys: TZ: tz_id=0 passive-- = -1
>> [    1.638295] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 6.9.0-rc5+ #28
>> [    1.645409] Hardware name: Radxa ROCK 4SE (DT)
>> [    1.650376] Call trace:
>> [    1.653109]  dump_backtrace+0x9c/0x100
>> [    1.657309]  show_stack+0x20/0x38
>> [    1.661017]  dump_stack_lvl+0xc0/0xd0
>> [    1.665119]  dump_stack+0x18/0x28
>> [    1.668828]  __thermal_zone_device_update+0x1fc/0x550
>> [    1.674484]  thermal_zone_device_set_mode+0x64/0xc0
>> [    1.679943]  thermal_zone_device_enable+0x1c/0x30
>> [    1.685206]  thermal_of_zone_register+0x34c/0x590
> 
> So let's see.
> 
> thermal_of_zone_register() calls
> thermal_zone_device_register_with_trips() which calls
> thermal_zone_device_update() for the first time, but
> __thermal_zone_device_update() sees that
> thermal_zone_device_is_enabled() returns false, so it does nothing.
> 
> This is right after thermal_zone_device_init() has been called, so
> tz->temperature == THERMAL_TEMP_INVALID and tz->passive == 0.
> 
> Next, thermal_zone_device_enable() is called by
> thermal_of_zone_register() and it calls __thermal_zone_device_update()
> via thermal_zone_device_set_mode().
> 
> This time thermal_zone_device_is_enabled() returns true, so
> update_temperature() is called and, unless __thermal_zone_get_temp()
> returns an error, it sets tz->last_temperature to THERMAL_TEMP_INVALID
> and tz->temperature to the current zone temperature.
> 
> Next, handle_thermal_trip() is called for all trips and it sees that
> tz->last_temperature == THERMAL_TEMP_INVALID, so it skips the branch
> in which tz->passive is decremented.
> 
> The only case I can see in which something else can happen in when
> __thermal_zone_get_temp() called by update_temperature() returns an
> error code (and if it is -EAGAIN, it won't even trigger a warning
> message) in which case the error is silently discarded and
> __thermal_zone_device_update() happily proceeds with tz->temperature
> == THERMAL_TEMP_INVALID and tz->last_temperature == 0.

That correct.

> 
> This can lead to many surprises down the road, so IMV
> __thermal_zone_device_update() should abort if it sees tz->temperature
> == THERMAL_TEMP_INVALID past calling update_temperature().

agree

> 
> So I'm wondering if the patch below (modulo white-space damage from
> GMail) helps.
> 
>> [    1.690473]  devm_thermal_of_zone_register+0x6c/0xc0
>> [    1.696031]  rockchip_thermal_probe+0x238/0x5e8
>> [    1.701106]  platform_probe+0x70/0xe8
>> [    1.705208]  really_probe+0xc4/0x278
>> [    1.709205]  __driver_probe_device+0x80/0x140
>> [    1.714078]  driver_probe_device+0x48/0x130
>> [    1.718756]  __driver_attach+0x7c/0x138
>> [    1.723045]  bus_for_each_dev+0x80/0xf0
>> [    1.727342]  driver_attach+0x2c/0x40
>> [    1.731340]  bus_add_driver+0xec/0x1f8
>> [    1.735539]  driver_register+0x68/0x138
>> [    1.739828]  __platform_driver_register+0x30/0x48
>> [    1.745093]  rockchip_thermal_driver_init+0x24/0x38
>> [    1.750551]  do_one_initcall+0x50/0x2d8
>> [    1.754844]  kernel_init_freeable+0x204/0x440
>> [    1.759722]  kernel_init+0x28/0x140
>> [    1.763631]  ret_from_fork+0x10/0x20
>> [    1.767802] thermal_sys: TZ: tz_id=0 passive-- = -2
> 
> ---
>   drivers/thermal/thermal_core.c |    3 +++
>   1 file changed, 3 insertions(+)
> 
> Index: linux-pm/drivers/thermal/thermal_core.c
> ===================================================================
> --- linux-pm.orig/drivers/thermal/thermal_core.c
> +++ linux-pm/drivers/thermal/thermal_core.c
> @@ -496,6 +496,9 @@ void __thermal_zone_device_update(struct
> 
>       update_temperature(tz);
> 
> +    if (tz->temperature == THERMAL_TEMP_INVALID)
> +        return;
> +
>       tz->notify_event = event;
> 
>       for_each_trip_desc(tz, td)

Yes, it prevents that previous situation. I have added a debug print
before your return in the code above and the it's in the log:

[    1.632520] thermal_sys: TZ: THERMAL_TEMP_INVALID, return
[    1.638899] thermal_sys: TZ: THERMAL_TEMP_INVALID, return

The tracing also shows that we have only 1s slow polling.
It also works properly in IPA when crossing 2 trip points
and coming back to low temp.

So, that code above should be OK. Thanks!
Rafael J. Wysocki April 30, 2024, 3:08 p.m. UTC | #5
On Tue, Apr 30, 2024 at 4:58 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> On 4/30/24 14:48, Rafael J. Wysocki wrote:
> > Hi Lukasz,
> >
> > On Mon, Apr 29, 2024 at 11:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >> Hi Rafael,
> >>
> >> On 4/25/24 15:11, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> Passive polling is enabled by setting the 'passive' field in
> >>> struct thermal_zone_device to a positive value so long as the
> >>> 'passive_delay_jiffies' field is greater than zero.  It causes
> >>> the thermal core to actively check the thermal zone temperature
> >>> periodically which in theory should be done after crossing a
> >>> passive trip point on the way up in order to allow governors to
> >>> react more rapidly to temperature changes and adjust mitigation
> >>> more precisely.
> >>>
> >>> However, the 'passive' field in struct thermal_zone_device is currently
> >>> managed by governors which is quite problematic.  First of all, only
> >>> two governors, Step-Wise and Power Allocator, update that field at
> >>> all, so the other governors do not benefit from passive polling,
> >>> although in principle they should.  Moreover, if the zone governor is
> >>> changed from, say, Step-Wise to Fair-Share after 'passive' has been
> >>> incremented by the former, it is not going to be reset back to zero by
> >>> the latter even if the zone temperature falls down below all passive
> >>> trip points.
> >>>
> >>> For this reason, make handle_thermal_trip() increment 'passive'
> >>> to enable passive polling for the given thermal zone whenever a
> >>> passive trip point is crossed on the way up and decrement it
> >>> whenever a passive trip point is crossed on the way down.  Also
> >>> remove the 'passive' field updates from governors and additionally
> >>> clear it in thermal_zone_device_init() to prevent passive polling
> >>> from being enabled after a system resume just beacuse it was enabled
> >>> before suspending the system.
> >>>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>> ---
> >>>
> >>> This has been mentioned here:
> >>>
> >>> https://lore.kernel.org/linux-pm/61560bc6-d453-4b0c-a4ea-b375d547b143@linaro.org/
> >>>
> >>> and I need someone to double check if the Power Allocator governor does not
> >>> need to be adjusted more for this change.
> >>>
> >>> ---
> >>>    drivers/thermal/gov_power_allocator.c |   12 +++++++-----
> >>>    drivers/thermal/gov_step_wise.c       |   10 ----------
> >>>    drivers/thermal/thermal_core.c        |   10 ++++++++--
> >>>    3 files changed, 15 insertions(+), 17 deletions(-)
> >>>
> >>> Index: linux-pm/drivers/thermal/thermal_core.c
> >>> ===================================================================
> >>> --- linux-pm.orig/drivers/thermal/thermal_core.c
> >>> +++ linux-pm/drivers/thermal/thermal_core.c
> >>> @@ -389,6 +389,9 @@ static void handle_thermal_trip(struct t
> >>>                if (tz->temperature < trip->temperature - trip->hysteresis) {
> >>>                        list_add(&td->notify_list_node, way_down_list);
> >>>                        td->notify_temp = trip->temperature - trip->hysteresis;
> >>> +
> >>> +                     if (trip->type == THERMAL_TRIP_PASSIVE)
> >>> +                             tz->passive--;
> >>
> >> This gets negative values and than the core switches to fast 'polling'
> >> mode. The values is decremented from 0 each time the
> >> thermal_zone_device_enable() is called.
> >
> > Interesting.
> >
> > This shouldn't happen because it means that the passive trip has been
> > crossed on the way down, but it wasn't crossed on the way up.
> >
> > It looks like an initialization issue to me.
> >
> >> Then IPA is actually called every 100ms after boot w/ low temp,
> >> while it should 1s.
> >>
> >> Please see the logs below:
> >> 'short log' after boot
> >> ----------------------------------------------
> >>
> >> [    1.632670] thermal_sys: TZ: tz_id=0 passive-- = -1
> >> [    1.637984] thermal_sys: TZ: tz_id=0 passive-- = -2
> >> [    1.643641] thermal_sys: TZ: tz_id=1 passive-- = -1
> >> ----------------------------------------------
> >>
> >> long log with call stack dumped
> >> ----------------------------------------------
> >>
> >> [    1.632973] thermal_sys: TZ: tz_id=0 passive-- = -1
> >> [    1.638295] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 6.9.0-rc5+ #28
> >> [    1.645409] Hardware name: Radxa ROCK 4SE (DT)
> >> [    1.650376] Call trace:
> >> [    1.653109]  dump_backtrace+0x9c/0x100
> >> [    1.657309]  show_stack+0x20/0x38
> >> [    1.661017]  dump_stack_lvl+0xc0/0xd0
> >> [    1.665119]  dump_stack+0x18/0x28
> >> [    1.668828]  __thermal_zone_device_update+0x1fc/0x550
> >> [    1.674484]  thermal_zone_device_set_mode+0x64/0xc0
> >> [    1.679943]  thermal_zone_device_enable+0x1c/0x30
> >> [    1.685206]  thermal_of_zone_register+0x34c/0x590
> >
> > So let's see.
> >
> > thermal_of_zone_register() calls
> > thermal_zone_device_register_with_trips() which calls
> > thermal_zone_device_update() for the first time, but
> > __thermal_zone_device_update() sees that
> > thermal_zone_device_is_enabled() returns false, so it does nothing.
> >
> > This is right after thermal_zone_device_init() has been called, so
> > tz->temperature == THERMAL_TEMP_INVALID and tz->passive == 0.
> >
> > Next, thermal_zone_device_enable() is called by
> > thermal_of_zone_register() and it calls __thermal_zone_device_update()
> > via thermal_zone_device_set_mode().
> >
> > This time thermal_zone_device_is_enabled() returns true, so
> > update_temperature() is called and, unless __thermal_zone_get_temp()
> > returns an error, it sets tz->last_temperature to THERMAL_TEMP_INVALID
> > and tz->temperature to the current zone temperature.
> >
> > Next, handle_thermal_trip() is called for all trips and it sees that
> > tz->last_temperature == THERMAL_TEMP_INVALID, so it skips the branch
> > in which tz->passive is decremented.
> >
> > The only case I can see in which something else can happen in when
> > __thermal_zone_get_temp() called by update_temperature() returns an
> > error code (and if it is -EAGAIN, it won't even trigger a warning
> > message) in which case the error is silently discarded and
> > __thermal_zone_device_update() happily proceeds with tz->temperature
> > == THERMAL_TEMP_INVALID and tz->last_temperature == 0.
>
> That correct.
>
> >
> > This can lead to many surprises down the road, so IMV
> > __thermal_zone_device_update() should abort if it sees tz->temperature
> > == THERMAL_TEMP_INVALID past calling update_temperature().
>
> agree
>
> >
> > So I'm wondering if the patch below (modulo white-space damage from
> > GMail) helps.
> >
> >> [    1.690473]  devm_thermal_of_zone_register+0x6c/0xc0
> >> [    1.696031]  rockchip_thermal_probe+0x238/0x5e8
> >> [    1.701106]  platform_probe+0x70/0xe8
> >> [    1.705208]  really_probe+0xc4/0x278
> >> [    1.709205]  __driver_probe_device+0x80/0x140
> >> [    1.714078]  driver_probe_device+0x48/0x130
> >> [    1.718756]  __driver_attach+0x7c/0x138
> >> [    1.723045]  bus_for_each_dev+0x80/0xf0
> >> [    1.727342]  driver_attach+0x2c/0x40
> >> [    1.731340]  bus_add_driver+0xec/0x1f8
> >> [    1.735539]  driver_register+0x68/0x138
> >> [    1.739828]  __platform_driver_register+0x30/0x48
> >> [    1.745093]  rockchip_thermal_driver_init+0x24/0x38
> >> [    1.750551]  do_one_initcall+0x50/0x2d8
> >> [    1.754844]  kernel_init_freeable+0x204/0x440
> >> [    1.759722]  kernel_init+0x28/0x140
> >> [    1.763631]  ret_from_fork+0x10/0x20
> >> [    1.767802] thermal_sys: TZ: tz_id=0 passive-- = -2
> >
> > ---
> >   drivers/thermal/thermal_core.c |    3 +++
> >   1 file changed, 3 insertions(+)
> >
> > Index: linux-pm/drivers/thermal/thermal_core.c
> > ===================================================================
> > --- linux-pm.orig/drivers/thermal/thermal_core.c
> > +++ linux-pm/drivers/thermal/thermal_core.c
> > @@ -496,6 +496,9 @@ void __thermal_zone_device_update(struct
> >
> >       update_temperature(tz);
> >
> > +    if (tz->temperature == THERMAL_TEMP_INVALID)
> > +        return;
> > +
> >       tz->notify_event = event;
> >
> >       for_each_trip_desc(tz, td)
>
> Yes, it prevents that previous situation. I have added a debug print
> before your return in the code above and the it's in the log:
>
> [    1.632520] thermal_sys: TZ: THERMAL_TEMP_INVALID, return
> [    1.638899] thermal_sys: TZ: THERMAL_TEMP_INVALID, return
>
> The tracing also shows that we have only 1s slow polling.
> It also works properly in IPA when crossing 2 trip points
> and coming back to low temp.
>
> So, that code above should be OK. Thanks!

Thanks for the confirmation!

Let me put the fix patch and the $subject one in a series and resend
them together.
diff mbox series

Patch

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -389,6 +389,9 @@  static void handle_thermal_trip(struct t
 		if (tz->temperature < trip->temperature - trip->hysteresis) {
 			list_add(&td->notify_list_node, way_down_list);
 			td->notify_temp = trip->temperature - trip->hysteresis;
+
+			if (trip->type == THERMAL_TRIP_PASSIVE)
+				tz->passive--;
 		} else {
 			td->threshold -= trip->hysteresis;
 		}
@@ -402,8 +405,10 @@  static void handle_thermal_trip(struct t
 		td->notify_temp = trip->temperature;
 		td->threshold -= trip->hysteresis;
 
-		if (trip->type == THERMAL_TRIP_CRITICAL ||
-		    trip->type == THERMAL_TRIP_HOT)
+		if (trip->type == THERMAL_TRIP_PASSIVE)
+			tz->passive++;
+		else if (trip->type == THERMAL_TRIP_CRITICAL ||
+			 trip->type == THERMAL_TRIP_HOT)
 			handle_critical_trips(tz, trip);
 	}
 }
@@ -444,6 +449,7 @@  static void thermal_zone_device_init(str
 	INIT_DELAYED_WORK(&tz->poll_queue, thermal_zone_device_check);
 
 	tz->temperature = THERMAL_TEMP_INVALID;
+	tz->passive = 0;
 	tz->prev_low_trip = -INT_MAX;
 	tz->prev_high_trip = INT_MAX;
 	list_for_each_entry(pos, &tz->thermal_instances, tz_node)
Index: linux-pm/drivers/thermal/gov_step_wise.c
===================================================================
--- linux-pm.orig/drivers/thermal/gov_step_wise.c
+++ linux-pm/drivers/thermal/gov_step_wise.c
@@ -93,16 +93,6 @@  static void thermal_zone_trip_update(str
 		if (instance->initialized && old_target == instance->target)
 			continue;
 
-		if (trip->type == THERMAL_TRIP_PASSIVE) {
-			/* If needed, update the status of passive polling. */
-			if (old_target == THERMAL_NO_TARGET &&
-			    instance->target != THERMAL_NO_TARGET)
-				tz->passive++;
-			else if (old_target != THERMAL_NO_TARGET &&
-				 instance->target == THERMAL_NO_TARGET)
-				tz->passive--;
-		}
-
 		instance->initialized = true;
 
 		mutex_lock(&instance->cdev->lock);
Index: linux-pm/drivers/thermal/gov_power_allocator.c
===================================================================
--- linux-pm.orig/drivers/thermal/gov_power_allocator.c
+++ linux-pm/drivers/thermal/gov_power_allocator.c
@@ -66,6 +66,7 @@  struct power_actor {
  * struct power_allocator_params - parameters for the power allocator governor
  * @allocated_tzp:	whether we have allocated tzp for this thermal zone and
  *			it needs to be freed on unbind
+ * @update_cdevs:	whether or not update cdevs on the next run
  * @err_integral:	accumulated error in the PID controller.
  * @prev_err:	error in the previous iteration of the PID controller.
  *		Used to calculate the derivative term.
@@ -84,6 +85,7 @@  struct power_actor {
  */
 struct power_allocator_params {
 	bool allocated_tzp;
+	bool update_cdevs;
 	s64 err_integral;
 	s32 prev_err;
 	u32 sustainable_power;
@@ -533,7 +535,7 @@  static void reset_pid_controller(struct
 	params->prev_err = 0;
 }
 
-static void allow_maximum_power(struct thermal_zone_device *tz, bool update)
+static void allow_maximum_power(struct thermal_zone_device *tz)
 {
 	struct power_allocator_params *params = tz->governor_data;
 	struct thermal_cooling_device *cdev;
@@ -555,7 +557,7 @@  static void allow_maximum_power(struct t
 		 */
 		cdev->ops->get_requested_power(cdev, &req_power);
 
-		if (update)
+		if (params->update_cdevs)
 			__thermal_cdev_update(cdev);
 
 		mutex_unlock(&cdev->lock);
@@ -752,13 +754,13 @@  static void power_allocator_manage(struc
 
 	if (trip && tz->temperature < trip->temperature) {
 		reset_pid_controller(params);
-		allow_maximum_power(tz, tz->passive);
-		tz->passive = 0;
+		allow_maximum_power(tz);
+		params->update_cdevs = false;
 		return;
 	}
 
 	allocate_power(tz, params->trip_max->temperature);
-	tz->passive = 1;
+	params->update_cdevs = true;
 }
 
 static struct thermal_governor thermal_gov_power_allocator = {