diff mbox series

[v2,1/5] thermal/core: Update cooling device during thermal zone unregistration

Message ID 20230324070807.6342-1-rui.zhang@intel.com
State New
Headers show
Series [v2,1/5] thermal/core: Update cooling device during thermal zone unregistration | expand

Commit Message

Zhang, Rui March 24, 2023, 7:08 a.m. UTC
When unregistering a thermal zone device, update all cooling devices
bound to the thermal zone device.

This fixes a problem that the frequency of ACPI processors are still
limited after unloading ACPI thermal driver while ACPI passive cooling
is activated.

Cc: stable@vger.kernel.org
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
---
v1 -> v2
	Changelog update.
	Rearrange the code to elimiate an "iterator used outside loop"
warning.
---
 drivers/thermal/thermal_core.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

Comments

Rafael J. Wysocki March 24, 2023, 1:19 p.m. UTC | #1
On Fri, Mar 24, 2023 at 8:08 AM Zhang Rui <rui.zhang@intel.com> wrote:
>
> When unregistering a cooling device, it is possible that the cooling
> device has been activated. And once the cooling device is unregistered,
> no one will deactivate it anymore.
>
> Reset cooling state during cooling device unregistration.
>
> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> ---
> In theory, this problem that this patch fixes can be triggered on a
> platform with ACPI Active cooling, by
> 1. overheat the system to trigger ACPI active cooling
> 2. unload ACPI fan driver
> 3. check if the fan is still spinning
> But I don't have such a system so I didn't trigger then problem and I
> only did build & boot test.

So I'm not sure if this change is actually safe.

In the example above, the system will still need the fan to spin after
the ACPI fan driver is unloaded in order to cool down, won't it?

> ---
>  drivers/thermal/thermal_core.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 30ff39154598..fd54e6c10b60 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -1192,6 +1192,10 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
>                 }
>         }
>
> +       mutex_lock(&cdev->lock);
> +       cdev->ops->set_cur_state(cdev, 0);
> +       mutex_unlock(&cdev->lock);
> +
>         mutex_unlock(&thermal_list_lock);
>
>         device_unregister(&cdev->device);
> --
> 2.25.1
>
Rafael J. Wysocki March 24, 2023, 1:25 p.m. UTC | #2
On Fri, Mar 24, 2023 at 2:24 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, Mar 24, 2023 at 8:08 AM Zhang Rui <rui.zhang@intel.com> wrote:
> >
> > The .bind/.unbind callbacks are designed to allow the thermal zone
> > device to bind to/unbind from a matched cooling device, with thermal
> > instances created/deleted.
> >
> > In this sense, .bind/.unbind callbacks must exist in pairs.
> >
> > Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> > ---
> >  drivers/thermal/thermal_core.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> > index 5225d65fb0e0..9c447f22cb39 100644
> > --- a/drivers/thermal/thermal_core.c
> > +++ b/drivers/thermal/thermal_core.c
> > @@ -1258,6 +1258,11 @@ thermal_zone_device_register_with_trips(const char *type, struct thermal_trip *t
> >         if (num_trips > 0 && (!ops->get_trip_type || !ops->get_trip_temp) && !trips)
> >                 return ERR_PTR(-EINVAL);
> >
> > +       if ((ops->bind && !ops->unbind) || (!ops->bind && ops->unbind)) {
>
> This can be written as
>
>         if (!!ops->bind != !!ops->unbind) {

Or even

        if (!ops->bind != !ops->unbind) {

for that matter.

>
> > +               pr_err("Thermal zone device .bind/.unbind not paired\n");
>
> And surely none of the existing drivers do that?  Because it would be
> a functional regression if they did.
>
> > +               return ERR_PTR(-EINVAL);
> > +       }
> > +
> >         if (!thermal_class)
> >                 return ERR_PTR(-ENODEV);
> >
> > --
Rafael J. Wysocki March 27, 2023, 3:13 p.m. UTC | #3
On Mon, Mar 27, 2023 at 4:50 PM Zhang, Rui <rui.zhang@intel.com> wrote:
>
> On Fri, 2023-03-24 at 14:19 +0100, Rafael J. Wysocki wrote:
> > On Fri, Mar 24, 2023 at 8:08 AM Zhang Rui <rui.zhang@intel.com>
> > wrote:
> > > When unregistering a cooling device, it is possible that the
> > > cooling
> > > device has been activated. And once the cooling device is
> > > unregistered,
> > > no one will deactivate it anymore.
> > >
> > > Reset cooling state during cooling device unregistration.
> > >
> > > Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> > > ---
> > > In theory, this problem that this patch fixes can be triggered on a
> > > platform with ACPI Active cooling, by
> > > 1. overheat the system to trigger ACPI active cooling
> > > 2. unload ACPI fan driver
> > > 3. check if the fan is still spinning
> > > But I don't have such a system so I didn't trigger then problem and
> > > I
> > > only did build & boot test.
> >
> > So I'm not sure if this change is actually safe.
> >
> > In the example above, the system will still need the fan to spin
> > after
> > the ACPI fan driver is unloaded in order to cool down, won't it?
>
> Then we can argue that the ACPI fan driver should not be unloaded in
> this case.

I don't think that whether or not the driver is expected to be
unloaded at a given time has any bearing on how it should behave when
actually unloaded.

Leaving the cooling device in its current state is "safe" from the
thermal control perspective, but it may affect the general user
experience (which may include performance too) going forward, so there
is a tradeoff.

You can argue that even if the cooling device is reset on the driver
removal, there should be another thermal control mechanism in place
that will take care of the overheat condition instead of it, but that
mechanism may be an emergency system shutdown.

What do the other cooling device drivers do in general when they get removed?

> Actually, this is the same situation as patch 1/5.
> Patch 1/5 fixes the problem that cooling state not restored to 0 when
> unloading the thermal driver, and this fixes the same problem when
> unloading the cooling device driver.

Right, it is analogous.
Rafael J. Wysocki March 28, 2023, 5:54 p.m. UTC | #4
On Tue, Mar 28, 2023 at 4:46 AM Zhang, Rui <rui.zhang@intel.com> wrote:
>
> On Mon, 2023-03-27 at 17:13 +0200, Rafael J. Wysocki wrote:
> > On Mon, Mar 27, 2023 at 4:50 PM Zhang, Rui <rui.zhang@intel.com>
> > wrote:
> > > On Fri, 2023-03-24 at 14:19 +0100, Rafael J. Wysocki wrote:
> > > > On Fri, Mar 24, 2023 at 8:08 AM Zhang Rui <rui.zhang@intel.com>
> > > > wrote:
> > > > > When unregistering a cooling device, it is possible that the
> > > > > cooling
> > > > > device has been activated. And once the cooling device is
> > > > > unregistered,
> > > > > no one will deactivate it anymore.
> > > > >
> > > > > Reset cooling state during cooling device unregistration.
> > > > >
> > > > > Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> > > > > ---
> > > > > In theory, this problem that this patch fixes can be triggered
> > > > > on a
> > > > > platform with ACPI Active cooling, by
> > > > > 1. overheat the system to trigger ACPI active cooling
> > > > > 2. unload ACPI fan driver
> > > > > 3. check if the fan is still spinning
> > > > > But I don't have such a system so I didn't trigger then problem
> > > > > and
> > > > > I
> > > > > only did build & boot test.
> > > >
> > > > So I'm not sure if this change is actually safe.
> > > >
> > > > In the example above, the system will still need the fan to spin
> > > > after
> > > > the ACPI fan driver is unloaded in order to cool down, won't it?
> > >
> > > Then we can argue that the ACPI fan driver should not be unloaded
> > > in
> > > this case.
> >
> > I don't think that whether or not the driver is expected to be
> > unloaded at a given time has any bearing on how it should behave when
> > actually unloaded.
> >
> > Leaving the cooling device in its current state is "safe" from the
> > thermal control perspective, but it may affect the general user
> > experience (which may include performance too) going forward, so
> > there
> > is a tradeoff.
>
> Right.
> If we don't have a third choice, then the question is simple.
> "thermal safety" vs. "user experience"?
>
> I'd vote for "thermal safety" and drop this patch series.

Works for me.

> > What do the other cooling device drivers do in general when they get
> > removed?
>
> No cooling device driver has extra handling after cdev unregistration.

However, the question regarding what to do when the driver of a
cooling device in use is being removed is a valid one.

One possible approach that comes to mind could be to defer the driver
removal until the overheat condition goes away, but anyway it would be
better to do that in the core IMV.
diff mbox series

Patch

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index cfd4c1afeae7..30ff39154598 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1497,9 +1497,24 @@  void thermal_zone_device_unregister(struct thermal_zone_device *tz)
 
 	/* Unbind all cdevs associated with 'this' thermal zone */
 	list_for_each_entry(cdev, &thermal_cdev_list, node) {
+		struct thermal_instance *ti;
+
+		mutex_lock(&tz->lock);
+		list_for_each_entry(ti, &tz->thermal_instances, tz_node) {
+			if (ti->cdev == cdev) {
+				mutex_unlock(&tz->lock);
+				goto unbind;
+			}
+		}
+
+		/* The cooling device is not bound to current thermal zone */
+		mutex_unlock(&tz->lock);
+		continue;
+
+unbind:
 		if (tz->ops->unbind) {
 			tz->ops->unbind(tz, cdev);
-			continue;
+			goto deactivate;
 		}
 
 		if (!tzp || !tzp->tbp)
@@ -1511,6 +1526,16 @@  void thermal_zone_device_unregister(struct thermal_zone_device *tz)
 				tzp->tbp[i].cdev = NULL;
 			}
 		}
+
+deactivate:
+		/*
+		 * The thermal instances for current thermal zone has been
+		 * removed. Update the cooling device in case it is activated
+		 * by current thermal zone device.
+		 */
+		mutex_lock(&cdev->lock);
+		__thermal_cdev_update(cdev);
+		mutex_unlock(&cdev->lock);
 	}
 
 	mutex_unlock(&thermal_list_lock);