diff mbox series

thermal: core: Restore behavior regarding invalid trip points

Message ID 20230314155010.3692869-1-idosch@nvidia.com
State Accepted
Commit f1b80a3878b2d76ced46a275fdfd7fb80b4f083b
Headers show
Series thermal: core: Restore behavior regarding invalid trip points | expand

Commit Message

Ido Schimmel March 14, 2023, 3:50 p.m. UTC
Cited commit stopped marking trip points with a zero temperature as
disabled, behavior that was originally introduced in commit 81ad4276b505
("Thermal: Ignore invalid trip points").

When using the mlxsw driver we see that when such trip points are not
disabled, the thermal subsystem repeatedly tries to set the state of the
associated cooling devices to the maximum state.

Fix this by restoring the original behavior and mark trip points with a
zero temperature as disabled.

Fixes: 7c3d5c20dc16 ("thermal/core: Add a generic thermal_zone_get_trip() function")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/thermal/thermal_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Rafael J. Wysocki March 14, 2023, 6:03 p.m. UTC | #1
On Tue, Mar 14, 2023 at 4:50 PM Ido Schimmel <idosch@nvidia.com> wrote:
>
> Cited commit stopped marking trip points with a zero temperature as
> disabled, behavior that was originally introduced in commit 81ad4276b505
> ("Thermal: Ignore invalid trip points").
>
> When using the mlxsw driver we see that when such trip points are not
> disabled, the thermal subsystem repeatedly tries to set the state of the
> associated cooling devices to the maximum state.
>
> Fix this by restoring the original behavior and mark trip points with a
> zero temperature as disabled.

What if the temperature is negative?  I think that you'd still want to
disable the trip in that case, wouldn't you?

Daniel, what's your take on this?

> Fixes: 7c3d5c20dc16 ("thermal/core: Add a generic thermal_zone_get_trip() function")
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  drivers/thermal/thermal_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 5ae72f314683..63583df4498d 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -1309,7 +1309,7 @@ thermal_zone_device_register_with_trips(const char *type, struct thermal_trip *t
>                 struct thermal_trip trip;
>
>                 result = thermal_zone_get_trip(tz, count, &trip);
> -               if (result)
> +               if (result || !trip.temperature)
>                         set_bit(count, &tz->trips_disabled);
>         }
>
> --
Rafael J. Wysocki March 22, 2023, 7:04 p.m. UTC | #2
On Tue, Mar 14, 2023 at 7:35 PM Ido Schimmel <idosch@nvidia.com> wrote:
>
> Hi Rafael,
>
> On Tue, Mar 14, 2023 at 07:03:07PM +0100, Rafael J. Wysocki wrote:
> > What if the temperature is negative?  I think that you'd still want to
> > disable the trip in that case, wouldn't you?
>
> Personally, no. This patch merely restores the behavior that was
> inadvertently removed by 7c3d5c20dc16. Specifically by this hunk:
>
> ```
> @@ -1252,9 +1319,10 @@ thermal_zone_device_register_with_trips(const char *type, struct thermal_trip *t
>                 goto release_device;
>
>         for (count = 0; count < num_trips; count++) {
> -               if (tz->ops->get_trip_type(tz, count, &trip_type) ||
> -                   tz->ops->get_trip_temp(tz, count, &trip_temp) ||
> -                   !trip_temp)
> +               struct thermal_trip trip;
> +
> +               result = thermal_zone_get_trip(tz, count, &trip);
> +               if (result)
>                         set_bit(count, &tz->trips_disabled);
>         }
> ```
>
> > Daniel, what's your take on this?
>
> Discussed this with Daniel yesterday:
> https://lore.kernel.org/linux-pm/ZA8TPDpEVanOpjEp@shredder/
>
> We agreed to rework mlxsw to not rely on the fact that trip points with
> a zero temperature are disabled, but it's not rc material, unlike this
> patch.

So I've applied this as 6.3-rc material for the sake of avoiding a
driver regression in 6.3, under the assumption that we'll get the
mlxsw driver update on time for the 6.4 merge window.

Thanks!
Ido Schimmel March 23, 2023, 9:25 p.m. UTC | #3
On Wed, Mar 22, 2023 at 08:04:54PM +0100, Rafael J. Wysocki wrote:
> So I've applied this as 6.3-rc material for the sake of avoiding a
> driver regression in 6.3, under the assumption that we'll get the
> mlxsw driver update on time for the 6.4 merge window.

Correct. Sent the patches for internal review and will submit them to
netdev (where this driver is maintained) next week assuming review is OK
and nothing explodes in our regression. Will copy Daniel and you.

Thanks
diff mbox series

Patch

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 5ae72f314683..63583df4498d 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1309,7 +1309,7 @@  thermal_zone_device_register_with_trips(const char *type, struct thermal_trip *t
 		struct thermal_trip trip;
 
 		result = thermal_zone_get_trip(tz, count, &trip);
-		if (result)
+		if (result || !trip.temperature)
 			set_bit(count, &tz->trips_disabled);
 	}