diff mbox series

[1/1] thermal: sysfs: avoid actual readings from sysfs

Message ID 20230607003721.834038-1-evalenti@kernel.org
State New
Headers show
Series [1/1] thermal: sysfs: avoid actual readings from sysfs | expand

Commit Message

Eduardo Valentin June 7, 2023, 12:37 a.m. UTC
From: Eduardo Valentin <eduval@amazon.com>

As the thermal zone caches the current and last temperature
value, the sysfs interface can use that instead of
forcing an actual update or read from the device.
This way, if multiple userspace requests are coming
in, we avoid storming the device with multiple reads
and potentially clogging the timing requirement
for the governors.

Cc: "Rafael J. Wysocki" <rafael@kernel.org> (supporter:THERMAL)
Cc: Daniel Lezcano <daniel.lezcano@linaro.org> (supporter:THERMAL)
Cc: Amit Kucheria <amitk@kernel.org> (reviewer:THERMAL)
Cc: Zhang Rui <rui.zhang@intel.com> (reviewer:THERMAL)
Cc: linux-pm@vger.kernel.org (open list:THERMAL)
Cc: linux-kernel@vger.kernel.org (open list)

Signed-off-by: Eduardo Valentin <eduval@amazon.com>
---
 drivers/thermal/thermal_sysfs.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

Comments

Russell Haley June 10, 2023, 5:24 p.m. UTC | #1
On 6/7/23 11:38, Eduardo Valentin wrote:
>> Can you elaborate 'the timing requirement for the governors' ? I'm
>> missing the point
> 
> 
> The point is to avoid contention on the device update path.
> Governor that use differential equations on temperature over time
> will be very time sensitive. Step wise, power allocator, or any
> PID will be very sensitive to time. So, If userspace is hitting
> this API too often we can see cases where the updates needed to
> service userspace may defer/delay the execution of the governor
> logic.
> 
> Despite that, there is really no point to have more updates than
> what was configured for the thermal zone to support. Say that
> we configure a thermal zone to update itself every 500ms, yet
> userspace keeps sending reads every 100ms, we do not need necessarily
> to do a trip to the device every single time to update the temperature,
> as per the design for the thermal zone.

A userspace governor might *also* use PID or filter multiple samples
taken at high rate. I specifically switched my python fan control script
from the Intel coretemp hwmon to the x86_pkg_tmp thermal zone because of
the coretemp driver's annoying 1-second caching behavior.
Daniel Lezcano June 12, 2023, 8:17 a.m. UTC | #2
Hi Eduardo,

On 08/06/2023 19:44, Eduardo Valentin wrote:

[ ... ]

>> Do you have a use case with some measurements to spot an issue or is it
>> a potential issue you identified ?
> 
> 
> yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
> and needs to update the zone every 100ms. Each read in this bus, if done alone
> would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
> well technically 9, but rounding for the sake of the example, which gets you
> 50 / 100KHz = 500 us. That is for a single read. You add one single extra
> userspace read triggering an unused device update, that is already a 1ms drift.
> Basically you looking at 0.5% for each extra userspace read competing in this
> sysfs node. You add extra devices in the same I2C bus, your governor is looking
> at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
> I did not even include the lock overhead considered for this CPU ;-)
> 
> Again, this is not about controlling the DIE temperature of the CPU you
> are running the thermal subsystem. This is about controlling
> a target device.

Ok. The target device is on a bus which is slow and prone to contention.

This hardware is not designed to be monitored with a high precision, so 
reading the temperature at a high rate does not really make sense.

Moreover (putting apart a potential contention), the delayed read does 
not change the time interval, which remains the same from the governor 
point of view.

In addition, i2c sensors are usually handled in the hwmon subsystem 
which are registered in the thermal framework from there. Those have 
most of their 'read' callback with a cached value in a jiffies based way 
eg. [1].

So the feature already exists for slow devices and are handled in the 
drivers directly via the hwmon subsystem.

 From my POV, the feature is not needed in the thermal framework.



[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/tree/drivers/hwmon/lm95234.c#n163
Eduardo Valentin June 21, 2023, 5:06 a.m. UTC | #3
On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote:
> 
> 
> 
> Hi Eduardo,
> 
> On 08/06/2023 19:44, Eduardo Valentin wrote:
> 
> [ ... ]
> 
> > > Do you have a use case with some measurements to spot an issue or is it
> > > a potential issue you identified ?
> > 
> > 
> > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
> > and needs to update the zone every 100ms. Each read in this bus, if done alone
> > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
> > well technically 9, but rounding for the sake of the example, which gets you
> > 50 / 100KHz = 500 us. That is for a single read. You add one single extra
> > userspace read triggering an unused device update, that is already a 1ms drift.
> > Basically you looking at 0.5% for each extra userspace read competing in this
> > sysfs node. You add extra devices in the same I2C bus, your governor is looking
> > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
> > I did not even include the lock overhead considered for this CPU ;-)
> > 
> > Again, this is not about controlling the DIE temperature of the CPU you
> > are running the thermal subsystem. This is about controlling
> > a target device.
> 
> Ok. The target device is on a bus which is slow and prone to contention.
> 
> This hardware is not designed to be monitored with a high precision, so
> reading the temperature at a high rate does not really make sense.

On the contrary, it needs even more precision and any extra delay adds to
loss on accuracy :-)

> 
> Moreover (putting apart a potential contention), the delayed read does
> not change the time interval, which remains the same from the governor
> point of view.

It does not change the governor update interval and that is a property of
the thermal zone. Correct. And that is the intention of the change.
The actual temperature updates driven by the governor will always
result in a driver call. While a userspace call will not be in the way
of the governor update.

Sysfs reads, However, with the current code as is, it may cause
jittering on the actual execution of the governor throttle function.
 causing the computation of the desired outcome cooling device being skewed.

> 
> In addition, i2c sensors are usually handled in the hwmon subsystem
> which are registered in the thermal framework from there. Those have
> most of their 'read' callback with a cached value in a jiffies based way
> eg. [1].

I guess what you are really saying is: go read the hwmon sysfs node,
or, hwmon solves this for us, which unfortunately is not true for all devices.


> 
> So the feature already exists for slow devices and are handled in the
> drivers directly via the hwmon subsystem.
> 
> From my POV, the feature is not needed in the thermal framework.

The fact that hwmon does it in some way is another evidence of the
actual problem. Telling that this has to be solved by another subsystem
for a sysfs node that is part of thermal subsystem does not really solve
the problem. Also as I mentioned, this is not common on all hwmon
devices, and not all I2C devices are hwmon devices. In fact, I2C
was just one example of a slow device. There are more I can quote
that are not necessarily under the hwmon case.


Not sure if you missed, but an alternative for the difference of
opinion on how this should behave is to have caching for response
of sysfs read of tz/temp  as an option/configuration. Then we let
userspace to choose which behavior it wants.

> 
> 
> 
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/tree/drivers/hwmon/lm95234.c#n163
> 
> 
> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>
Daniel Lezcano June 21, 2023, 11:43 a.m. UTC | #4
On 21/06/2023 07:06, Eduardo Valentin wrote:
> On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote:
>>
>>
>>
>> Hi Eduardo,
>>
>> On 08/06/2023 19:44, Eduardo Valentin wrote:
>>
>> [ ... ]
>>
>>>> Do you have a use case with some measurements to spot an issue or is it
>>>> a potential issue you identified ?
>>>
>>>
>>> yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
>>> and needs to update the zone every 100ms. Each read in this bus, if done alone
>>> would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
>>> well technically 9, but rounding for the sake of the example, which gets you
>>> 50 / 100KHz = 500 us. That is for a single read. You add one single extra
>>> userspace read triggering an unused device update, that is already a 1ms drift.
>>> Basically you looking at 0.5% for each extra userspace read competing in this
>>> sysfs node. You add extra devices in the same I2C bus, your governor is looking
>>> at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
>>> I did not even include the lock overhead considered for this CPU ;-)
>>>
>>> Again, this is not about controlling the DIE temperature of the CPU you
>>> are running the thermal subsystem. This is about controlling
>>> a target device.
>>
>> Ok. The target device is on a bus which is slow and prone to contention.
>>
>> This hardware is not designed to be monitored with a high precision, so
>> reading the temperature at a high rate does not really make sense.
> 
> On the contrary, it needs even more precision and any extra delay adds to
> loss on accuracy :-)

What I meant is if the hardware designer thought there could be a 
problem with the thermal zone they would have put another kind of 
sensor, not one with a i2c based communication.


>> Moreover (putting apart a potential contention), the delayed read does
>> not change the time interval, which remains the same from the governor
>> point of view.
> 
> It does not change the governor update interval and that is a property of
> the thermal zone. Correct. And that is the intention of the change.
> The actual temperature updates driven by the governor will always
> result in a driver call. While a userspace call will not be in the way
> of the governor update.
> 
> Sysfs reads, However, with the current code as is, it may cause
> jittering on the actual execution of the governor throttle function.
>   causing the computation of the desired outcome cooling device being skewed.
> 
>>
>> In addition, i2c sensors are usually handled in the hwmon subsystem
>> which are registered in the thermal framework from there. Those have
>> most of their 'read' callback with a cached value in a jiffies based way
>> eg. [1].
> 
> I guess what you are really saying is: go read the hwmon sysfs node,
> or, hwmon solves this for us, which unfortunately is not true for all devices.

I meant the i2c sensors are under the hwmon subsystem. This subsystem is 
connected with the thermal framework, so when a hwmon sensor is created, 
it register this sensor as a thermal zone.


>> So the feature already exists for slow devices and are handled in the
>> drivers directly via the hwmon subsystem.
>>
>>  From my POV, the feature is not needed in the thermal framework.
> 
> The fact that hwmon does it in some way is another evidence of the
> actual problem.

Not really, it shows the i2c sensors are in the hwmon subsystems.


> Telling that this has to be solved by another subsystem
> for a sysfs node that is part of thermal subsystem does not really solve
> the problem. Also as I mentioned, this is not common on all hwmon
> devices, and not all I2C devices are hwmon devices. In fact, I2C
> was just one example of a slow device. There are more I can quote
> that are not necessarily under the hwmon case.

Yes, please. Can you give examples with existing drivers in the thermal 
framework and observed issues on specific platforms ? Numbers would help.

> Not sure if you missed, but an alternative for the difference of
> opinion on how this should behave is to have caching for response
> of sysfs read of tz/temp  as an option/configuration. Then we let
> userspace to choose which behavior it wants.

Before that, you have to prove the feature is really needed and show how 
the patches solves an issue. At this point, this is not demonstrated.
Eduardo Valentin June 22, 2023, 4:56 a.m. UTC | #5
On Wed, Jun 21, 2023 at 01:43:26PM +0200, Daniel Lezcano wrote:
> 
> 
> 
> On 21/06/2023 07:06, Eduardo Valentin wrote:
> > On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote:
> > > 
> > > 
> > > 
> > > Hi Eduardo,
> > > 
> > > On 08/06/2023 19:44, Eduardo Valentin wrote:
> > > 
> > > [ ... ]
> > > 
> > > > > Do you have a use case with some measurements to spot an issue or is it
> > > > > a potential issue you identified ?
> > > > 
> > > > 
> > > > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
> > > > and needs to update the zone every 100ms. Each read in this bus, if done alone
> > > > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
> > > > well technically 9, but rounding for the sake of the example, which gets you
> > > > 50 / 100KHz = 500 us. That is for a single read. You add one single extra
> > > > userspace read triggering an unused device update, that is already a 1ms drift.
> > > > Basically you looking at 0.5% for each extra userspace read competing in this
> > > > sysfs node. You add extra devices in the same I2C bus, your governor is looking
> > > > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
> > > > I did not even include the lock overhead considered for this CPU ;-)
> > > > 
> > > > Again, this is not about controlling the DIE temperature of the CPU you
> > > > are running the thermal subsystem. This is about controlling
> > > > a target device.
> > > 
> > > Ok. The target device is on a bus which is slow and prone to contention.
> > > 
> > > This hardware is not designed to be monitored with a high precision, so
> > > reading the temperature at a high rate does not really make sense.
> > 
> > On the contrary, it needs even more precision and any extra delay adds to
> > loss on accuracy :-)
> 
> What I meant is if the hardware designer thought there could be a
> problem with the thermal zone they would have put another kind of
> sensor, not one with a i2c based communication.

No, that is not a problem in the physical thermal zone. Or to cover
for a hardware design flaw. This is an actual typical case. However,
yes, designer must take into account any sort of delays or jittering
in the control system, and typically one would add some thermal margins
on the control system. But to the original point here, we should eliminate
unnecessary jittering or delay in the control system.

> 
> 
> > > Moreover (putting apart a potential contention), the delayed read does
> > > not change the time interval, which remains the same from the governor
> > > point of view.
> > 
> > It does not change the governor update interval and that is a property of
> > the thermal zone. Correct. And that is the intention of the change.
> > The actual temperature updates driven by the governor will always
> > result in a driver call. While a userspace call will not be in the way
> > of the governor update.
> > 
> > Sysfs reads, However, with the current code as is, it may cause
> > jittering on the actual execution of the governor throttle function.
> >   causing the computation of the desired outcome cooling device being skewed.
> > 
> > > 
> > > In addition, i2c sensors are usually handled in the hwmon subsystem
> > > which are registered in the thermal framework from there. Those have
> > > most of their 'read' callback with a cached value in a jiffies based way
> > > eg. [1].
> > 
> > I guess what you are really saying is: go read the hwmon sysfs node,
> > or, hwmon solves this for us, which unfortunately is not true for all devices.
> 
> I meant the i2c sensors are under the hwmon subsystem. This subsystem is
> connected with the thermal framework, so when a hwmon sensor is created,
> it register this sensor as a thermal zone.
> 
> 
> > > So the feature already exists for slow devices and are handled in the
> > > drivers directly via the hwmon subsystem.
> > > 
> > >  From my POV, the feature is not needed in the thermal framework.
> > 
> > The fact that hwmon does it in some way is another evidence of the
> > actual problem.
> 
> Not really, it shows the i2c sensors are in the hwmon subsystems.

They are there too. And hwmon also sees same problem of too frequent
device update. The problem is there regardless of the subsystem we use
to represent the device. Also, I dont buy the argument that this is
a problem of hwmon because it is already supported to plug in
hwmon devices in the thermal subsystem. That is actually the original
design in fact :-). So running a control in the thermal subsystem
on top of inputs from hwmon, which happens to support I2C devices,
is not a problem for hwmon to solve, since the control is in the
thermal subsystem, and hwmon does not offer control solutions.


> 
> 
> > Telling that this has to be solved by another subsystem
> > for a sysfs node that is part of thermal subsystem does not really solve
> > the problem. Also as I mentioned, this is not common on all hwmon
> > devices, and not all I2C devices are hwmon devices. In fact, I2C
> > was just one example of a slow device. There are more I can quote
> > that are not necessarily under the hwmon case.
> 
> Yes, please. Can you give examples with existing drivers in the thermal
> framework and observed issues on specific platforms ? Numbers would help.

I believe I already gave you numbers. And as I explained above,
the driver does not need to sit on thermal subsystem only,
we already support plugging in I2C/pmbus devices on the thermal
subsystem, so basically all drivers on hwmon that has the REGISTER_TZ
flag are actual samples for this problem
Rafael J. Wysocki June 23, 2023, 5:31 p.m. UTC | #6
On Thu, Jun 22, 2023 at 6:56 AM Eduardo Valentin <evalenti@kernel.org> wrote:
>
> On Wed, Jun 21, 2023 at 01:43:26PM +0200, Daniel Lezcano wrote:
> >
> >
> >
> > On 21/06/2023 07:06, Eduardo Valentin wrote:
> > > On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote:
> > > >
> > > >
> > > >
> > > > Hi Eduardo,
> > > >
> > > > On 08/06/2023 19:44, Eduardo Valentin wrote:
> > > >
> > > > [ ... ]
> > > >
> > > > > > Do you have a use case with some measurements to spot an issue or is it
> > > > > > a potential issue you identified ?
> > > > >
> > > > >
> > > > > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
> > > > > and needs to update the zone every 100ms. Each read in this bus, if done alone
> > > > > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
> > > > > well technically 9, but rounding for the sake of the example, which gets you
> > > > > 50 / 100KHz = 500 us. That is for a single read. You add one single extra
> > > > > userspace read triggering an unused device update, that is already a 1ms drift.
> > > > > Basically you looking at 0.5% for each extra userspace read competing in this
> > > > > sysfs node. You add extra devices in the same I2C bus, your governor is looking
> > > > > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
> > > > > I did not even include the lock overhead considered for this CPU ;-)
> > > > >
> > > > > Again, this is not about controlling the DIE temperature of the CPU you
> > > > > are running the thermal subsystem. This is about controlling
> > > > > a target device.
> > > >
> > > > Ok. The target device is on a bus which is slow and prone to contention.
> > > >
> > > > This hardware is not designed to be monitored with a high precision, so
> > > > reading the temperature at a high rate does not really make sense.
> > >
> > > On the contrary, it needs even more precision and any extra delay adds to
> > > loss on accuracy :-)
> >
> > What I meant is if the hardware designer thought there could be a
> > problem with the thermal zone they would have put another kind of
> > sensor, not one with a i2c based communication.
>
> No, that is not a problem in the physical thermal zone. Or to cover
> for a hardware design flaw. This is an actual typical case. However,
> yes, designer must take into account any sort of delays or jittering
> in the control system, and typically one would add some thermal margins
> on the control system. But to the original point here, we should eliminate
> unnecessary jittering or delay in the control system.
>
> >
> >
> > > > Moreover (putting apart a potential contention), the delayed read does
> > > > not change the time interval, which remains the same from the governor
> > > > point of view.
> > >
> > > It does not change the governor update interval and that is a property of
> > > the thermal zone. Correct. And that is the intention of the change.
> > > The actual temperature updates driven by the governor will always
> > > result in a driver call. While a userspace call will not be in the way
> > > of the governor update.
> > >
> > > Sysfs reads, However, with the current code as is, it may cause
> > > jittering on the actual execution of the governor throttle function.
> > >   causing the computation of the desired outcome cooling device being skewed.
> > >
> > > >
> > > > In addition, i2c sensors are usually handled in the hwmon subsystem
> > > > which are registered in the thermal framework from there. Those have
> > > > most of their 'read' callback with a cached value in a jiffies based way
> > > > eg. [1].
> > >
> > > I guess what you are really saying is: go read the hwmon sysfs node,
> > > or, hwmon solves this for us, which unfortunately is not true for all devices.
> >
> > I meant the i2c sensors are under the hwmon subsystem. This subsystem is
> > connected with the thermal framework, so when a hwmon sensor is created,
> > it register this sensor as a thermal zone.
> >
> >
> > > > So the feature already exists for slow devices and are handled in the
> > > > drivers directly via the hwmon subsystem.
> > > >
> > > >  From my POV, the feature is not needed in the thermal framework.
> > >
> > > The fact that hwmon does it in some way is another evidence of the
> > > actual problem.
> >
> > Not really, it shows the i2c sensors are in the hwmon subsystems.
>
> They are there too. And hwmon also sees same problem of too frequent
> device update. The problem is there regardless of the subsystem we use
> to represent the device. Also, I dont buy the argument that this is
> a problem of hwmon because it is already supported to plug in
> hwmon devices in the thermal subsystem. That is actually the original
> design in fact :-). So running a control in the thermal subsystem
> on top of inputs from hwmon, which happens to support I2C devices,
> is not a problem for hwmon to solve, since the control is in the
> thermal subsystem, and hwmon does not offer control solutions.

Regardless of where the problem is etc, if my understanding of the
patch is correct, it is proposing to change the behavior of a
well-known sysfs interface in a way that is likely to be incompatible
with at least some of its users.  This is an obvious no-go in kernel
development and I would expect you to be well aware of it.

IOW, if you want the cached value to be returned, a new interface (eg.
a new sysfs attribute) is needed.

And I think that the use case is not really i2c sensors in general,
because at least in some cases they work just fine AFAICS, but a
platform with a control loop running in the kernel that depends on
sensor reads carried out at a specific, approximately constant, rate
that can be disturbed by user space checking the sensor temperature
via sysfs at "wrong" times.  If at the same time the user space
program in question doesn't care about the most recent values reported
by the sensor, it may very well use the values cached by the in-kernel
control loop.
Eduardo Valentin June 28, 2023, 9:10 p.m. UTC | #7
On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote:
> 
> 
> 
> On Thu, Jun 22, 2023 at 6:56 AM Eduardo Valentin <evalenti@kernel.org> wrote:
> >
> > On Wed, Jun 21, 2023 at 01:43:26PM +0200, Daniel Lezcano wrote:
> > >
> > >
> > >
> > > On 21/06/2023 07:06, Eduardo Valentin wrote:
> > > > On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote:
> > > > >
> > > > >
> > > > >
> > > > > Hi Eduardo,
> > > > >
> > > > > On 08/06/2023 19:44, Eduardo Valentin wrote:
> > > > >
> > > > > [ ... ]
> > > > >
> > > > > > > Do you have a use case with some measurements to spot an issue or is it
> > > > > > > a potential issue you identified ?
> > > > > >
> > > > > >
> > > > > > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
> > > > > > and needs to update the zone every 100ms. Each read in this bus, if done alone
> > > > > > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
> > > > > > well technically 9, but rounding for the sake of the example, which gets you
> > > > > > 50 / 100KHz = 500 us. That is for a single read. You add one single extra
> > > > > > userspace read triggering an unused device update, that is already a 1ms drift.
> > > > > > Basically you looking at 0.5% for each extra userspace read competing in this
> > > > > > sysfs node. You add extra devices in the same I2C bus, your governor is looking
> > > > > > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
> > > > > > I did not even include the lock overhead considered for this CPU ;-)
> > > > > >
> > > > > > Again, this is not about controlling the DIE temperature of the CPU you
> > > > > > are running the thermal subsystem. This is about controlling
> > > > > > a target device.
> > > > >
> > > > > Ok. The target device is on a bus which is slow and prone to contention.
> > > > >
> > > > > This hardware is not designed to be monitored with a high precision, so
> > > > > reading the temperature at a high rate does not really make sense.
> > > >
> > > > On the contrary, it needs even more precision and any extra delay adds to
> > > > loss on accuracy :-)
> > >
> > > What I meant is if the hardware designer thought there could be a
> > > problem with the thermal zone they would have put another kind of
> > > sensor, not one with a i2c based communication.
> >
> > No, that is not a problem in the physical thermal zone. Or to cover
> > for a hardware design flaw. This is an actual typical case. However,
> > yes, designer must take into account any sort of delays or jittering
> > in the control system, and typically one would add some thermal margins
> > on the control system. But to the original point here, we should eliminate
> > unnecessary jittering or delay in the control system.
> >
> > >
> > >
> > > > > Moreover (putting apart a potential contention), the delayed read does
> > > > > not change the time interval, which remains the same from the governor
> > > > > point of view.
> > > >
> > > > It does not change the governor update interval and that is a property of
> > > > the thermal zone. Correct. And that is the intention of the change.
> > > > The actual temperature updates driven by the governor will always
> > > > result in a driver call. While a userspace call will not be in the way
> > > > of the governor update.
> > > >
> > > > Sysfs reads, However, with the current code as is, it may cause
> > > > jittering on the actual execution of the governor throttle function.
> > > >   causing the computation of the desired outcome cooling device being skewed.
> > > >
> > > > >
> > > > > In addition, i2c sensors are usually handled in the hwmon subsystem
> > > > > which are registered in the thermal framework from there. Those have
> > > > > most of their 'read' callback with a cached value in a jiffies based way
> > > > > eg. [1].
> > > >
> > > > I guess what you are really saying is: go read the hwmon sysfs node,
> > > > or, hwmon solves this for us, which unfortunately is not true for all devices.
> > >
> > > I meant the i2c sensors are under the hwmon subsystem. This subsystem is
> > > connected with the thermal framework, so when a hwmon sensor is created,
> > > it register this sensor as a thermal zone.
> > >
> > >
> > > > > So the feature already exists for slow devices and are handled in the
> > > > > drivers directly via the hwmon subsystem.
> > > > >
> > > > >  From my POV, the feature is not needed in the thermal framework.
> > > >
> > > > The fact that hwmon does it in some way is another evidence of the
> > > > actual problem.
> > >
> > > Not really, it shows the i2c sensors are in the hwmon subsystems.
> >
> > They are there too. And hwmon also sees same problem of too frequent
> > device update. The problem is there regardless of the subsystem we use
> > to represent the device. Also, I dont buy the argument that this is
> > a problem of hwmon because it is already supported to plug in
> > hwmon devices in the thermal subsystem. That is actually the original
> > design in fact :-). So running a control in the thermal subsystem
> > on top of inputs from hwmon, which happens to support I2C devices,
> > is not a problem for hwmon to solve, since the control is in the
> > thermal subsystem, and hwmon does not offer control solutions.
> 
> Regardless of where the problem is etc, if my understanding of the
> patch is correct, it is proposing to change the behavior of a
> well-known sysfs interface in a way that is likely to be incompatible
> with at least some of its users.  This is an obvious no-go in kernel
> development and I would expect you to be well aware of it.

yeah I get it.

> 
> IOW, if you want the cached value to be returned, a new interface (eg.
> a new sysfs attribute) is needed.

Yeah, I am fine with either a new sysfs entry to return the cached value,
or a new sysfs entry to change the behavior of the existing /temp, as I
mentioned before, either way works for me, if changing the existing one
is too intrusive.

> 
> And I think that the use case is not really i2c sensors in general,

I2C was just the factual example I had, but you are right, the use case
is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C,
the actual issue just happen to be easier to see when I2C devices, slower
than typical MMIO devices, are being used as input for the control.

> because at least in some cases they work just fine AFAICS, but a
> platform with a control loop running in the kernel that depends on
> sensor reads carried out at a specific, approximately constant, rate
> that can be disturbed by user space checking the sensor temperature
> via sysfs at "wrong" times.  If at the same time the user space
> program in question doesn't care about the most recent values reported
> by the sensor, it may very well use the values cached by the in-kernel
> control loop.

That is right, the balance between supporting user space reads and
running the control timely is the actual original concern. The problem
fades out a bit when you have device reads in the us / ns time scale
and control update is in 100s of ms. But becomes more apparent on slower
devices, when reads are in ms and policy update is in the 100s ms, that is
why the I2C case was quoted. But nothing wrong with I2C, or supporting
I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ,
the problem is on having to support the control in kernel and a read in
userspace that can create jitter to the control.

And as you properly stated, for this use case, the userspace does not care
about the most recent value of the device, so that is why the change
proposes to give cached values.

On the flip side though, there may be user space based policies that
require the most recent device value. But in this case, the expectation
is to disable the in kernel policy and switch the thermal zone to
mode == disabled. And that is also why this patch will go the path
to request the most recent device value when the /temp sysfs entry
is read and the mode is disabled.

I would suggest to have an addition sysfs entry that sets the
thermal zone into cached mode or not, let's say for the sake of the
discussion, we call it 'cached_values', with default to 'not cached'.
This way, we could support:

a) Default, current situation, where all reads in /temp are always backed up
with an actual device .get_temp(). Nothing changes here, keeps reading
under /temp, and so long system designer is satisfied with jittering,
no need to change anything.

b) When one has control in kernel, and frequent userspace reads on /temp
but system designer wants to protect the control in kernel to avoid jittering.
Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'.
Then userspace will get updated values as frequent as the kernel control has
and the kernel control will not be disturbed by frequent device reads.

c) When one has control in userspace, and wants to have the most frequent
device read. Here, one can simply disable the in kernel control by
setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set
to 'not cached'. Well in fact, the way I thought this originally in this patch
was to simply always read the device when /temp is read is 'mode' is 'disabled'.

I believe you proposed to have another sysfs entry  sysfs entry for reading cached temperature.
Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work.
The only possible downside is to have two entries to read temperature.

Strong opinions on any of the above?
Rafael J. Wysocki June 30, 2023, 8:16 a.m. UTC | #8
On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
>
> On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote:
> >

[cut]

> >
> > Regardless of where the problem is etc, if my understanding of the
> > patch is correct, it is proposing to change the behavior of a
> > well-known sysfs interface in a way that is likely to be incompatible
> > with at least some of its users.  This is an obvious no-go in kernel
> > development and I would expect you to be well aware of it.
>
> yeah I get it.
>
> >
> > IOW, if you want the cached value to be returned, a new interface (eg.
> > a new sysfs attribute) is needed.
>
> Yeah, I am fine with either a new sysfs entry to return the cached value,
> or a new sysfs entry to change the behavior of the existing /temp, as I
> mentioned before, either way works for me, if changing the existing one
> is too intrusive.
>
> >
> > And I think that the use case is not really i2c sensors in general,
>
> I2C was just the factual example I had, but you are right, the use case
> is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C,
> the actual issue just happen to be easier to see when I2C devices, slower
> than typical MMIO devices, are being used as input for the control.
>
> > because at least in some cases they work just fine AFAICS, but a
> > platform with a control loop running in the kernel that depends on
> > sensor reads carried out at a specific, approximately constant, rate
> > that can be disturbed by user space checking the sensor temperature
> > via sysfs at "wrong" times.  If at the same time the user space
> > program in question doesn't care about the most recent values reported
> > by the sensor, it may very well use the values cached by the in-kernel
> > control loop.
>
> That is right, the balance between supporting user space reads and
> running the control timely is the actual original concern. The problem
> fades out a bit when you have device reads in the us / ns time scale
> and control update is in 100s of ms. But becomes more apparent on slower
> devices, when reads are in ms and policy update is in the 100s ms, that is
> why the I2C case was quoted. But nothing wrong with I2C, or supporting
> I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ,
> the problem is on having to support the control in kernel and a read in
> userspace that can create jitter to the control.
>
> And as you properly stated, for this use case, the userspace does not care
> about the most recent value of the device, so that is why the change
> proposes to give cached values.
>
> On the flip side though, there may be user space based policies that
> require the most recent device value. But in this case, the expectation
> is to disable the in kernel policy and switch the thermal zone to
> mode == disabled. And that is also why this patch will go the path
> to request the most recent device value when the /temp sysfs entry
> is read and the mode is disabled.
>
> I would suggest to have an addition sysfs entry that sets the
> thermal zone into cached mode or not, let's say for the sake of the
> discussion, we call it 'cached_values', with default to 'not cached'.
> This way, we could support:
>
> a) Default, current situation, where all reads in /temp are always backed up
> with an actual device .get_temp(). Nothing changes here, keeps reading
> under /temp, and so long system designer is satisfied with jittering,
> no need to change anything.
>
> b) When one has control in kernel, and frequent userspace reads on /temp
> but system designer wants to protect the control in kernel to avoid jittering.
> Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'.
> Then userspace will get updated values as frequent as the kernel control has
> and the kernel control will not be disturbed by frequent device reads.
>
> c) When one has control in userspace, and wants to have the most frequent
> device read. Here, one can simply disable the in kernel control by
> setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set
> to 'not cached'. Well in fact, the way I thought this originally in this patch
> was to simply always read the device when /temp is read is 'mode' is 'disabled'.
>
> I believe you proposed to have another sysfs entry  sysfs entry for reading cached temperature.
> Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work.
> The only possible downside is to have two entries to read temperature.
>
> Strong opinions on any of the above?

So what about adding a new zone attribute that can be used to specify
the preferred caching time for the temperature?

That is, if the time interval between two consecutive updates of the
cached temperature value is less than the value of the new attribute,
the cached temperature value will be returned by "temp".  Otherwise,
it will cause the sensor to be read and the value obtained from it
will be returned to user space and cached.

If the value of the new attribute is 0, everything will work as it
does now (which will also need to be the default behavior).
Daniel Lezcano June 30, 2023, 10:11 a.m. UTC | #9
Hi Rafael,

On 30/06/2023 10:16, Rafael J. Wysocki wrote:
> On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:

[ ... ]

> So what about adding a new zone attribute that can be used to specify
> the preferred caching time for the temperature?
> 
> That is, if the time interval between two consecutive updates of the
> cached temperature value is less than the value of the new attribute,
> the cached temperature value will be returned by "temp".  Otherwise,
> it will cause the sensor to be read and the value obtained from it
> will be returned to user space and cached.
> 
> If the value of the new attribute is 0, everything will work as it
> does now (which will also need to be the default behavior).

I'm still not convinced about the feature.

Eduardo provided some numbers but they seem based on the characteristics 
of the I2C, not to a real use case. Eduardo?

Before adding more complexity in the thermal framework and yet another 
sysfs entry, it would be interesting to have an experiment and show the 
impact of both configurations, not from a timing point of view but with 
a temperature mitigation accuracy.

Without a real use case, this feature does make really sense IMO.
Rafael J. Wysocki June 30, 2023, 10:46 a.m. UTC | #10
Hi Daniel,

On Fri, Jun 30, 2023 at 12:11 PM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
>
> Hi Rafael,
>
> On 30/06/2023 10:16, Rafael J. Wysocki wrote:
> > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
>
> [ ... ]
>
> > So what about adding a new zone attribute that can be used to specify
> > the preferred caching time for the temperature?
> >
> > That is, if the time interval between two consecutive updates of the
> > cached temperature value is less than the value of the new attribute,
> > the cached temperature value will be returned by "temp".  Otherwise,
> > it will cause the sensor to be read and the value obtained from it
> > will be returned to user space and cached.
> >
> > If the value of the new attribute is 0, everything will work as it
> > does now (which will also need to be the default behavior).
>
> I'm still not convinced about the feature.
>
> Eduardo provided some numbers but they seem based on the characteristics
> of the I2C, not to a real use case. Eduardo?
>
> Before adding more complexity in the thermal framework and yet another
> sysfs entry, it would be interesting to have an experiment and show the
> impact of both configurations, not from a timing point of view but with
> a temperature mitigation accuracy.
>
> Without a real use case, this feature does make really sense IMO.

I'm kind of unsure why you think that it is not a good idea in general
to have a way to limit the rate of accessing a temperature sensor, for
energy-efficiency reasons if nothing more.
Daniel Lezcano June 30, 2023, 12:09 p.m. UTC | #11
On 30/06/2023 12:46, Rafael J. Wysocki wrote:
> Hi Daniel,
> 
> On Fri, Jun 30, 2023 at 12:11 PM Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>>
>>
>> Hi Rafael,
>>
>> On 30/06/2023 10:16, Rafael J. Wysocki wrote:
>>> On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
>>
>> [ ... ]
>>
>>> So what about adding a new zone attribute that can be used to specify
>>> the preferred caching time for the temperature?
>>>
>>> That is, if the time interval between two consecutive updates of the
>>> cached temperature value is less than the value of the new attribute,
>>> the cached temperature value will be returned by "temp".  Otherwise,
>>> it will cause the sensor to be read and the value obtained from it
>>> will be returned to user space and cached.
>>>
>>> If the value of the new attribute is 0, everything will work as it
>>> does now (which will also need to be the default behavior).
>>
>> I'm still not convinced about the feature.
>>
>> Eduardo provided some numbers but they seem based on the characteristics
>> of the I2C, not to a real use case. Eduardo?
>>
>> Before adding more complexity in the thermal framework and yet another
>> sysfs entry, it would be interesting to have an experiment and show the
>> impact of both configurations, not from a timing point of view but with
>> a temperature mitigation accuracy.
>>
>> Without a real use case, this feature does make really sense IMO.
> 
> I'm kind of unsure why you think that it is not a good idea in general
> to have a way to limit the rate of accessing a temperature sensor, for
> energy-efficiency reasons if nothing more.

I don't think it is not a good idea. I've no judgement with the proposed 
change.

But I'm not convinced it is really useful, that is why having a real use 
case and some numbers showing that feature solves the issue would be nice.

It is illogical we want a fast and accurate response on a specific 
hardware and then design it with slow sensors and contention prone bus.

In Eduardo's example, we have 100ms monitoring rate on a I2C. This rate 
is usually to monitor CPUs with very fast transitions. With a remote 
site, the monitoring rate would be much slower, so if there is a 
contention in the bus because a dumb process is reading constantly the 
temperature, then it should be negligible.

All that are hypothesis, that is why having a real use case would help 
to figure out the temperature limit drift at mitigation time.

Assuming it is really needed, I'm not sure that should be exported via 
sysfs. It is a driver issue and it may register the thermal zone with a 
parameter telling the userspace rate limit.

On the other side, hwmon and thermal are connected. hwmon drivers 
register a thermal zone and thermal drivers add themselves in the hwmon 
sysfs directory. The temperature cache is handled in the driver level in 
the hwmon subsystems and we want to handle the temperature cache at the 
thermal sysfs level. How will we cope with this inconsistency?

As a side note, slow drivers are usually going under drivers/hwmon.
Eduardo Valentin July 1, 2023, 1:37 a.m. UTC | #12
On Fri, Jun 30, 2023 at 10:16:38AM +0200, Rafael J. Wysocki wrote:
> 
> 
> 
> On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
> >
> > On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote:
> > >
> 
> [cut]
> 
> > >
> > > Regardless of where the problem is etc, if my understanding of the
> > > patch is correct, it is proposing to change the behavior of a
> > > well-known sysfs interface in a way that is likely to be incompatible
> > > with at least some of its users.  This is an obvious no-go in kernel
> > > development and I would expect you to be well aware of it.
> >
> > yeah I get it.
> >
> > >
> > > IOW, if you want the cached value to be returned, a new interface (eg.
> > > a new sysfs attribute) is needed.
> >
> > Yeah, I am fine with either a new sysfs entry to return the cached value,
> > or a new sysfs entry to change the behavior of the existing /temp, as I
> > mentioned before, either way works for me, if changing the existing one
> > is too intrusive.
> >
> > >
> > > And I think that the use case is not really i2c sensors in general,
> >
> > I2C was just the factual example I had, but you are right, the use case
> > is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C,
> > the actual issue just happen to be easier to see when I2C devices, slower
> > than typical MMIO devices, are being used as input for the control.
> >
> > > because at least in some cases they work just fine AFAICS, but a
> > > platform with a control loop running in the kernel that depends on
> > > sensor reads carried out at a specific, approximately constant, rate
> > > that can be disturbed by user space checking the sensor temperature
> > > via sysfs at "wrong" times.  If at the same time the user space
> > > program in question doesn't care about the most recent values reported
> > > by the sensor, it may very well use the values cached by the in-kernel
> > > control loop.
> >
> > That is right, the balance between supporting user space reads and
> > running the control timely is the actual original concern. The problem
> > fades out a bit when you have device reads in the us / ns time scale
> > and control update is in 100s of ms. But becomes more apparent on slower
> > devices, when reads are in ms and policy update is in the 100s ms, that is
> > why the I2C case was quoted. But nothing wrong with I2C, or supporting
> > I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ,
> > the problem is on having to support the control in kernel and a read in
> > userspace that can create jitter to the control.
> >
> > And as you properly stated, for this use case, the userspace does not care
> > about the most recent value of the device, so that is why the change
> > proposes to give cached values.
> >
> > On the flip side though, there may be user space based policies that
> > require the most recent device value. But in this case, the expectation
> > is to disable the in kernel policy and switch the thermal zone to
> > mode == disabled. And that is also why this patch will go the path
> > to request the most recent device value when the /temp sysfs entry
> > is read and the mode is disabled.
> >
> > I would suggest to have an addition sysfs entry that sets the
> > thermal zone into cached mode or not, let's say for the sake of the
> > discussion, we call it 'cached_values', with default to 'not cached'.
> > This way, we could support:
> >
> > a) Default, current situation, where all reads in /temp are always backed up
> > with an actual device .get_temp(). Nothing changes here, keeps reading
> > under /temp, and so long system designer is satisfied with jittering,
> > no need to change anything.
> >
> > b) When one has control in kernel, and frequent userspace reads on /temp
> > but system designer wants to protect the control in kernel to avoid jittering.
> > Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'.
> > Then userspace will get updated values as frequent as the kernel control has
> > and the kernel control will not be disturbed by frequent device reads.
> >
> > c) When one has control in userspace, and wants to have the most frequent
> > device read. Here, one can simply disable the in kernel control by
> > setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set
> > to 'not cached'. Well in fact, the way I thought this originally in this patch
> > was to simply always read the device when /temp is read is 'mode' is 'disabled'.
> >
> > I believe you proposed to have another sysfs entry  sysfs entry for reading cached temperature.
> > Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work.
> > The only possible downside is to have two entries to read temperature.
> >
> > Strong opinions on any of the above?
> 
> So what about adding a new zone attribute that can be used to specify
> the preferred caching time for the temperature?
> 
> That is, if the time interval between two consecutive updates of the
> cached temperature value is less than the value of the new attribute,
> the cached temperature value will be returned by "temp".  Otherwise,
> it will cause the sensor to be read and the value obtained from it
> will be returned to user space and cached.
> 
> If the value of the new attribute is 0, everything will work as it
> does now (which will also need to be the default behavior).


Yeah, that makes sense to me. I can cook something up in the next version.
Eduardo Valentin July 1, 2023, 1:38 a.m. UTC | #13
Hey Daniel,

On Fri, Jun 30, 2023 at 12:11:25PM +0200, Daniel Lezcano wrote:
> 
> 
> 
> Hi Rafael,
> 
> On 30/06/2023 10:16, Rafael J. Wysocki wrote:
> > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
> 
> [ ... ]
> 
> > So what about adding a new zone attribute that can be used to specify
> > the preferred caching time for the temperature?
> > 
> > That is, if the time interval between two consecutive updates of the
> > cached temperature value is less than the value of the new attribute,
> > the cached temperature value will be returned by "temp".  Otherwise,
> > it will cause the sensor to be read and the value obtained from it
> > will be returned to user space and cached.
> > 
> > If the value of the new attribute is 0, everything will work as it
> > does now (which will also need to be the default behavior).
> 
> I'm still not convinced about the feature.
> 
> Eduardo provided some numbers but they seem based on the characteristics
> of the I2C, not to a real use case. Eduardo?

Why I2C is not a real use case?
Eduardo Valentin July 1, 2023, 1:49 a.m. UTC | #14
Hello,

On Fri, Jun 30, 2023 at 02:09:44PM +0200, Daniel Lezcano wrote:
> 
> 
> 
> On 30/06/2023 12:46, Rafael J. Wysocki wrote:
> > Hi Daniel,
> > 
> > On Fri, Jun 30, 2023 at 12:11 PM Daniel Lezcano
> > <daniel.lezcano@linaro.org> wrote:
> > > 
> > > 
> > > Hi Rafael,
> > > 
> > > On 30/06/2023 10:16, Rafael J. Wysocki wrote:
> > > > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
> > > 
> > > [ ... ]
> > > 
> > > > So what about adding a new zone attribute that can be used to specify
> > > > the preferred caching time for the temperature?
> > > > 
> > > > That is, if the time interval between two consecutive updates of the
> > > > cached temperature value is less than the value of the new attribute,
> > > > the cached temperature value will be returned by "temp".  Otherwise,
> > > > it will cause the sensor to be read and the value obtained from it
> > > > will be returned to user space and cached.
> > > > 
> > > > If the value of the new attribute is 0, everything will work as it
> > > > does now (which will also need to be the default behavior).
> > > 
> > > I'm still not convinced about the feature.
> > > 
> > > Eduardo provided some numbers but they seem based on the characteristics
> > > of the I2C, not to a real use case. Eduardo?
> > > 
> > > Before adding more complexity in the thermal framework and yet another
> > > sysfs entry, it would be interesting to have an experiment and show the
> > > impact of both configurations, not from a timing point of view but with
> > > a temperature mitigation accuracy.
> > > 
> > > Without a real use case, this feature does make really sense IMO.
> > 
> > I'm kind of unsure why you think that it is not a good idea in general
> > to have a way to limit the rate of accessing a temperature sensor, for
> > energy-efficiency reasons if nothing more.
> 
> I don't think it is not a good idea. I've no judgement with the proposed
> change.
> 
> But I'm not convinced it is really useful, that is why having a real use
> case and some numbers showing that feature solves the issue would be nice.
> 
> It is illogical we want a fast and accurate response on a specific
> hardware and then design it with slow sensors and contention prone bus.

Totally agree, but at same time, this is real world :-)

> 
> In Eduardo's example, we have 100ms monitoring rate on a I2C. This rate
> is usually to monitor CPUs with very fast transitions. With a remote
> site, the monitoring rate would be much slower, so if there is a
> contention in the bus because a dumb process is reading constantly the
> temperature, then it should be negligible.
> 
> All that are hypothesis, that is why having a real use case would help
> to figure out the temperature limit drift at mitigation time.

Yeah, I guess the problem here is that you are assuming I2C is not a real
use case, not sure why. But it is and very common design in fact.

> 
> Assuming it is really needed, I'm not sure that should be exported via
> sysfs. It is a driver issue and it may register the thermal zone with a
> parameter telling the userspace rate limit.
> 
> On the other side, hwmon and thermal are connected. hwmon drivers
> register a thermal zone and thermal drivers add themselves in the hwmon
> sysfs directory. The temperature cache is handled in the driver level in
> the hwmon subsystems and we want to handle the temperature cache at the
> thermal sysfs level. How will we cope with this inconsistency?

Yeah, I do not see this, again, as where to handle cache type of design problem only.
This is really a protective / defensive code on the thermal core to avoid
userspace interfering on a kernel based control.


I agree that drivers may be free to go and defend themselves against
too frequent userspace requests, like they do, as you already shared
a link in another email. But saying that it is up to the driver to do this
is basically saying that the thermal subsystem do not care about their
own threads being delayed by a too frequent reads on a sysfs entry
created by the thermal subsystem, just because it is drivers responsability
to cache. To that is a missing defensive code. 

> 
> As a side note, slow drivers are usually going under drivers/hwmon.

Have you seen this code?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850


I also do not understand when you say slow drivers are usually going under
drivers/hwmon, does it really matter? One can design a thermal zone
that is connected to a hwmon device as input. Why would that be illogical?


> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>
Daniel Lezcano July 1, 2023, 7:28 a.m. UTC | #15
Eduardo,

On 01/07/2023 03:49, Eduardo Valentin wrote:

[ ... ]

>> All that are hypothesis, that is why having a real use case would help
>> to figure out the temperature limit drift at mitigation time.
> 
> Yeah, I guess the problem here is that you are assuming I2C is not a real
> use case, not sure why. But it is and very common design in fact.

If it is so common you should be able to reproduce the issue and give 
numbers. At this point, what I read is "that may happen because I2C is 
slow and we may monitor it at an insane rate, so let's cache the value".

>> Assuming it is really needed, I'm not sure that should be exported via
>> sysfs. It is a driver issue and it may register the thermal zone with a
>> parameter telling the userspace rate limit.
>>
>> On the other side, hwmon and thermal are connected. hwmon drivers
>> register a thermal zone and thermal drivers add themselves in the hwmon
>> sysfs directory. The temperature cache is handled in the driver level in
>> the hwmon subsystems and we want to handle the temperature cache at the
>> thermal sysfs level. How will we cope with this inconsistency?
> 
> Yeah, I do not see this, again, as where to handle cache type of design problem only.
> This is really a protective / defensive code on the thermal core to avoid
> userspace interfering on a kernel based control.
> 
> 
> I agree that drivers may be free to go and defend themselves against
> too frequent userspace requests, like they do, as you already shared
> a link in another email. But saying that it is up to the driver to do this
> is basically saying that the thermal subsystem do not care about their
> own threads being delayed by a too frequent reads on a sysfs entry
> created by the thermal subsystem, just because it is drivers responsability
> to cache. To that is a missing defensive code.

No, the core code has not to be defensive against bad hardware design.

If multiple processes are reading in an infinite loop the temperature, 
they will constantly take the lock, and as the monitoring thread is a 
CFS task, this one will be considered as the readers and be delayed, 
with probably a mitigation temperature drift. Here we have a missing 
defensive / optimized code against a DoS but it is unrelated to the 
hardware and the fix is not caching the value.

>> As a side note, slow drivers are usually going under drivers/hwmon.
> 
> Have you seen this code?
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850

Yes, and ?

That is what I said, the hwmon and the thermal zone are connected.

> I also do not understand when you say slow drivers are usually going under
> drivers/hwmon, does it really matter? One can design a thermal zone
> that is connected to a hwmon device as input. Why would that be illogical?

I'm not saying it is illogical. I'm pointing the slow drivers are under 
hwmon subsystems and usually don't aim in-kernel mitigation. The 
get_temp ops is going through hwmon and the drivers may cache the 
values. So *if* there is an in-kernel mitigation, the value will be 
already cached usually.

I do believe I raised some valid concerns regarding the approach. Could 
please take them into account instead of eluding them?

1. A real use case with temperature limit drift (easy to reproduce 
because it is very common)

2. How about the consistency between hwmon and thermal? (one driver but 
two ways to access the temperature - one may cache and the other not)

Another question regarding the I2C example, if another subsystem is 
using the I2C, won't it take the bus lock and create the contention 
also? I mean it is possible to create a mitigation drift by reading 
constantly another sensor value (eg. voltage or whatever) ?
Daniel Lezcano July 1, 2023, 2:20 p.m. UTC | #16
Hi Eduardo,

On 01/07/2023 03:38, Eduardo Valentin wrote:
> Hey Daniel,
> 
> On Fri, Jun 30, 2023 at 12:11:25PM +0200, Daniel Lezcano wrote:
>>
>>
>>
>> Hi Rafael,
>>
>> On 30/06/2023 10:16, Rafael J. Wysocki wrote:
>>> On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
>>
>> [ ... ]
>>
>>> So what about adding a new zone attribute that can be used to specify
>>> the preferred caching time for the temperature?
>>>
>>> That is, if the time interval between two consecutive updates of the
>>> cached temperature value is less than the value of the new attribute,
>>> the cached temperature value will be returned by "temp".  Otherwise,
>>> it will cause the sensor to be read and the value obtained from it
>>> will be returned to user space and cached.
>>>
>>> If the value of the new attribute is 0, everything will work as it
>>> does now (which will also need to be the default behavior).
>>
>> I'm still not convinced about the feature.
>>
>> Eduardo provided some numbers but they seem based on the characteristics
>> of the I2C, not to a real use case. Eduardo?
> 
> Why I2C is not a real use case?

What I meant is "I2C is slow, ok. But what is the setup where the 
problem arises?"
Eduardo Valentin July 5, 2023, 10:49 p.m. UTC | #17
On Sat, Jul 01, 2023 at 09:28:31AM +0200, Daniel Lezcano wrote:
> 
> 
> 
> Eduardo,
> 
> On 01/07/2023 03:49, Eduardo Valentin wrote:
> 
> [ ... ]
> 
> > > All that are hypothesis, that is why having a real use case would help
> > > to figure out the temperature limit drift at mitigation time.
> > 
> > Yeah, I guess the problem here is that you are assuming I2C is not a real
> > use case, not sure why. But it is and very common design in fact.
> 
> If it is so common you should be able to reproduce the issue and give
> numbers. At this point, what I read is "that may happen because I2C is
> slow and we may monitor it at an insane rate, so let's cache the value".
> 
> > > Assuming it is really needed, I'm not sure that should be exported via
> > > sysfs. It is a driver issue and it may register the thermal zone with a
> > > parameter telling the userspace rate limit.
> > > 
> > > On the other side, hwmon and thermal are connected. hwmon drivers
> > > register a thermal zone and thermal drivers add themselves in the hwmon
> > > sysfs directory. The temperature cache is handled in the driver level in
> > > the hwmon subsystems and we want to handle the temperature cache at the
> > > thermal sysfs level. How will we cope with this inconsistency?
> > 
> > Yeah, I do not see this, again, as where to handle cache type of design problem only.
> > This is really a protective / defensive code on the thermal core to avoid
> > userspace interfering on a kernel based control.
> > 
> > 
> > I agree that drivers may be free to go and defend themselves against
> > too frequent userspace requests, like they do, as you already shared
> > a link in another email. But saying that it is up to the driver to do this
> > is basically saying that the thermal subsystem do not care about their
> > own threads being delayed by a too frequent reads on a sysfs entry
> > created by the thermal subsystem, just because it is drivers responsability
> > to cache. To that is a missing defensive code.
> 
> No, the core code has not to be defensive against bad hardware design.

I do not understand why you are calling this a bad hardware design.

> 
> If multiple processes are reading in an infinite loop the temperature,
> they will constantly take the lock, and as the monitoring thread is a
> CFS task, this one will be considered as the readers and be delayed,
> with probably a mitigation temperature drift. Here we have a missing
> defensive / optimized code against a DoS but it is unrelated to the
> hardware and the fix is not caching the value.

I am not even going into the CFS discussion, yet. But if we go that direction,
here we are having a case of a userspace process contending into
a in kernel process, regardless of the userspace process priority.

But again that is not the main point of discussion for this change.

> 
> > > As a side note, slow drivers are usually going under drivers/hwmon.
> > 
> > Have you seen this code?
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850
> 
> Yes, and ?


I mean , I am sure can git grep for all occurences if that is what are looking for.
> 
> That is what I said, the hwmon and the thermal zone are connected.
> 
> > I also do not understand when you say slow drivers are usually going under
> > drivers/hwmon, does it really matter? One can design a thermal zone
> > that is connected to a hwmon device as input. Why would that be illogical?
> 
> I'm not saying it is illogical. I'm pointing the slow drivers are under
> hwmon subsystems and usually don't aim in-kernel mitigation. The
> get_temp ops is going through hwmon and the drivers may cache the
> values. So *if* there is an in-kernel mitigation, the value will be
> already cached usually.

Oh I see. yes, I totally give you that the most common case is to have
the in kernel policy written with drivers under thermal, but neglecting
the existence of drivers under /hwmon that connects into the thermal subsystem
is not fair, is it? They are there, they connect to thermal subsystem
and one can certainly have an in kernel control with hwmon drivers as input,
I am not understanding why protecting against those cases make the change
invalid or overcomplicates the subsystem design.

> 
> I do believe I raised some valid concerns regarding the approach. Could
> please take them into account instead of eluding them
> 
> 1. A real use case with temperature limit drift (easy to reproduce
> because it is very common)
> 
> 2. How about the consistency between hwmon and thermal? (one driver but
> two ways to access the temperature - one may cache and the other not)

I am not eluding anything. But this conversation seams to not move forward
because of the assumption that building in kernel control based on
hwmon drivers is either wrong or uncommon.  I fail to see that as the
main argument to be discussed mainly because it is a problem that exists.
Driver count is also not a meaningful metric to conclude if the problem
is common or not. How many motherboard can you think that are out there
that may have an lm75 to represent an actual temperature point in the
PCB? Or an lm75 that may come from a PCI board? That is what I meant
by common design.

Seams to me that you are trying to make a point that this problem
is not worth having a fix for, even if you already recognized the
basic point of it, because (a) this is not happening MMIO based
drivers which is the (today) common case for drivers under /thermal
and (b) hwmon drivers are supposed to feed in only control in userspace.

for both argument that seams to restrict the thermal subsystem
to only MMIO base devices drivers to feed into CPU temperature control,
which is a limited application of it in real world scenario, for
any real life scenario really, mobile. BMC, or any embedded case. 
When in fact, the abstraction and desing on the thermal subsystem today
is pure control of temperature and can be used in any application
that does it. Limiting the application of the thermal subsystem to
only MMIO based devices is, well, limiting :-).

> 
> Another question regarding the I2C example, if another subsystem is
> using the I2C, won't it take the bus lock and create the contention
> also? I mean it is possible to create a mitigation drift by reading
> constantly another sensor value (eg. voltage or whatever) ?

Yes, if a flood of events in the bus happen, then the drift will also happen,
specially if reads are aligned with the in kernel monitoring thread
update / call to .get_temp().

> 
> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>
Rafael J. Wysocki July 6, 2023, 1:02 p.m. UTC | #18
On Sat, Jul 1, 2023 at 3:37 AM Eduardo Valentin <evalenti@kernel.org> wrote:
>
> On Fri, Jun 30, 2023 at 10:16:38AM +0200, Rafael J. Wysocki wrote:
> >
> >
> >
> > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
> > >
> > > On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote:
> > > >
> >
> > [cut]
> >
> > > >
> > > > Regardless of where the problem is etc, if my understanding of the
> > > > patch is correct, it is proposing to change the behavior of a
> > > > well-known sysfs interface in a way that is likely to be incompatible
> > > > with at least some of its users.  This is an obvious no-go in kernel
> > > > development and I would expect you to be well aware of it.
> > >
> > > yeah I get it.
> > >
> > > >
> > > > IOW, if you want the cached value to be returned, a new interface (eg.
> > > > a new sysfs attribute) is needed.
> > >
> > > Yeah, I am fine with either a new sysfs entry to return the cached value,
> > > or a new sysfs entry to change the behavior of the existing /temp, as I
> > > mentioned before, either way works for me, if changing the existing one
> > > is too intrusive.
> > >
> > > >
> > > > And I think that the use case is not really i2c sensors in general,
> > >
> > > I2C was just the factual example I had, but you are right, the use case
> > > is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C,
> > > the actual issue just happen to be easier to see when I2C devices, slower
> > > than typical MMIO devices, are being used as input for the control.
> > >
> > > > because at least in some cases they work just fine AFAICS, but a
> > > > platform with a control loop running in the kernel that depends on
> > > > sensor reads carried out at a specific, approximately constant, rate
> > > > that can be disturbed by user space checking the sensor temperature
> > > > via sysfs at "wrong" times.  If at the same time the user space
> > > > program in question doesn't care about the most recent values reported
> > > > by the sensor, it may very well use the values cached by the in-kernel
> > > > control loop.
> > >
> > > That is right, the balance between supporting user space reads and
> > > running the control timely is the actual original concern. The problem
> > > fades out a bit when you have device reads in the us / ns time scale
> > > and control update is in 100s of ms. But becomes more apparent on slower
> > > devices, when reads are in ms and policy update is in the 100s ms, that is
> > > why the I2C case was quoted. But nothing wrong with I2C, or supporting
> > > I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ,
> > > the problem is on having to support the control in kernel and a read in
> > > userspace that can create jitter to the control.
> > >
> > > And as you properly stated, for this use case, the userspace does not care
> > > about the most recent value of the device, so that is why the change
> > > proposes to give cached values.
> > >
> > > On the flip side though, there may be user space based policies that
> > > require the most recent device value. But in this case, the expectation
> > > is to disable the in kernel policy and switch the thermal zone to
> > > mode == disabled. And that is also why this patch will go the path
> > > to request the most recent device value when the /temp sysfs entry
> > > is read and the mode is disabled.
> > >
> > > I would suggest to have an addition sysfs entry that sets the
> > > thermal zone into cached mode or not, let's say for the sake of the
> > > discussion, we call it 'cached_values', with default to 'not cached'.
> > > This way, we could support:
> > >
> > > a) Default, current situation, where all reads in /temp are always backed up
> > > with an actual device .get_temp(). Nothing changes here, keeps reading
> > > under /temp, and so long system designer is satisfied with jittering,
> > > no need to change anything.
> > >
> > > b) When one has control in kernel, and frequent userspace reads on /temp
> > > but system designer wants to protect the control in kernel to avoid jittering.
> > > Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'.
> > > Then userspace will get updated values as frequent as the kernel control has
> > > and the kernel control will not be disturbed by frequent device reads.
> > >
> > > c) When one has control in userspace, and wants to have the most frequent
> > > device read. Here, one can simply disable the in kernel control by
> > > setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set
> > > to 'not cached'. Well in fact, the way I thought this originally in this patch
> > > was to simply always read the device when /temp is read is 'mode' is 'disabled'.
> > >
> > > I believe you proposed to have another sysfs entry  sysfs entry for reading cached temperature.
> > > Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work.
> > > The only possible downside is to have two entries to read temperature.
> > >
> > > Strong opinions on any of the above?
> >
> > So what about adding a new zone attribute that can be used to specify
> > the preferred caching time for the temperature?
> >
> > That is, if the time interval between two consecutive updates of the
> > cached temperature value is less than the value of the new attribute,
> > the cached temperature value will be returned by "temp".  Otherwise,
> > it will cause the sensor to be read and the value obtained from it
> > will be returned to user space and cached.
> >
> > If the value of the new attribute is 0, everything will work as it
> > does now (which will also need to be the default behavior).
>
> Yeah, that makes sense to me. I can cook something up in the next version.

Yes, please.

Also I think that the $subject patch was inspired by observations made
on a specific system in practice.  It would be good to say what the
system is and include some numbers illustrating how severe the problem
is (that is, what is expected and what is observed and why the
discrepancy is attributed to the effect of direct sensor accesses from
user space via sysfs).
Eduardo Valentin July 7, 2023, 5:14 p.m. UTC | #19
On Thu, Jul 06, 2023 at 03:22:55PM +0200, Rafael J. Wysocki wrote:
> 
> 
> 
> On Thu, Jul 6, 2023 at 12:50 AM Eduardo Valentin <evalenti@kernel.org> wrote:
> >
> > On Sat, Jul 01, 2023 at 09:28:31AM +0200, Daniel Lezcano wrote:
> > >
> > >
> > >
> > > Eduardo,
> > >
> > > On 01/07/2023 03:49, Eduardo Valentin wrote:
> > >
> > > [ ... ]
> > >
> > > > > All that are hypothesis, that is why having a real use case would help
> > > > > to figure out the temperature limit drift at mitigation time.
> > > >
> > > > Yeah, I guess the problem here is that you are assuming I2C is not a real
> > > > use case, not sure why. But it is and very common design in fact.
> > >
> > > If it is so common you should be able to reproduce the issue and give
> > > numbers. At this point, what I read is "that may happen because I2C is
> > > slow and we may monitor it at an insane rate, so let's cache the value".
> > >
> > > > > Assuming it is really needed, I'm not sure that should be exported via
> > > > > sysfs. It is a driver issue and it may register the thermal zone with a
> > > > > parameter telling the userspace rate limit.
> > > > >
> > > > > On the other side, hwmon and thermal are connected. hwmon drivers
> > > > > register a thermal zone and thermal drivers add themselves in the hwmon
> > > > > sysfs directory. The temperature cache is handled in the driver level in
> > > > > the hwmon subsystems and we want to handle the temperature cache at the
> > > > > thermal sysfs level. How will we cope with this inconsistency?
> > > >
> > > > Yeah, I do not see this, again, as where to handle cache type of design problem only.
> > > > This is really a protective / defensive code on the thermal core to avoid
> > > > userspace interfering on a kernel based control.
> > > >
> > > >
> > > > I agree that drivers may be free to go and defend themselves against
> > > > too frequent userspace requests, like they do, as you already shared
> > > > a link in another email. But saying that it is up to the driver to do this
> > > > is basically saying that the thermal subsystem do not care about their
> > > > own threads being delayed by a too frequent reads on a sysfs entry
> > > > created by the thermal subsystem, just because it is drivers responsability
> > > > to cache. To that is a missing defensive code.
> > >
> > > No, the core code has not to be defensive against bad hardware design.
> >
> > I do not understand why you are calling this a bad hardware design.
> >
> > >
> > > If multiple processes are reading in an infinite loop the temperature,
> > > they will constantly take the lock, and as the monitoring thread is a
> > > CFS task, this one will be considered as the readers and be delayed,
> > > with probably a mitigation temperature drift. Here we have a missing
> > > defensive / optimized code against a DoS but it is unrelated to the
> > > hardware and the fix is not caching the value.
> >
> > I am not even going into the CFS discussion, yet. But if we go that direction,
> > here we are having a case of a userspace process contending into
> > a in kernel process, regardless of the userspace process priority.
> >
> > But again that is not the main point of discussion for this change.
> >
> > >
> > > > > As a side note, slow drivers are usually going under drivers/hwmon.
> > > >
> > > > Have you seen this code?
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850
> > >
> > > Yes, and ?
> >
> >
> > I mean , I am sure can git grep for all occurences if that is what are looking for.
> > >
> > > That is what I said, the hwmon and the thermal zone are connected.
> > >
> > > > I also do not understand when you say slow drivers are usually going under
> > > > drivers/hwmon, does it really matter? One can design a thermal zone
> > > > that is connected to a hwmon device as input. Why would that be illogical?
> > >
> > > I'm not saying it is illogical. I'm pointing the slow drivers are under
> > > hwmon subsystems and usually don't aim in-kernel mitigation. The
> > > get_temp ops is going through hwmon and the drivers may cache the
> > > values. So *if* there is an in-kernel mitigation, the value will be
> > > already cached usually.
> >
> > Oh I see. yes, I totally give you that the most common case is to have
> > the in kernel policy written with drivers under thermal, but neglecting
> > the existence of drivers under /hwmon that connects into the thermal subsystem
> > is not fair, is it? They are there, they connect to thermal subsystem
> > and one can certainly have an in kernel control with hwmon drivers as input,
> > I am not understanding why protecting against those cases make the change
> > invalid or overcomplicates the subsystem design.
> >
> > >
> > > I do believe I raised some valid concerns regarding the approach. Could
> > > please take them into account instead of eluding them
> > >
> > > 1. A real use case with temperature limit drift (easy to reproduce
> > > because it is very common)
> > >
> > > 2. How about the consistency between hwmon and thermal? (one driver but
> > > two ways to access the temperature - one may cache and the other not)
> >
> > I am not eluding anything. But this conversation seams to not move forward
> > because of the assumption that building in kernel control based on
> > hwmon drivers is either wrong or uncommon.  I fail to see that as the
> > main argument to be discussed mainly because it is a problem that exists.
> > Driver count is also not a meaningful metric to conclude if the problem
> > is common or not. How many motherboard can you think that are out there
> > that may have an lm75 to represent an actual temperature point in the
> > PCB? Or an lm75 that may come from a PCI board? That is what I meant
> > by common design.
> >
> > Seams to me that you are trying to make a point that this problem
> > is not worth having a fix for, even if you already recognized the
> > basic point of it, because (a) this is not happening MMIO based
> > drivers which is the (today) common case for drivers under /thermal
> > and (b) hwmon drivers are supposed to feed in only control in userspace.
> >
> > for both argument that seams to restrict the thermal subsystem
> > to only MMIO base devices drivers to feed into CPU temperature control,
> > which is a limited application of it in real world scenario, for
> > any real life scenario really, mobile. BMC, or any embedded case.
> > When in fact, the abstraction and desing on the thermal subsystem today
> > is pure control of temperature and can be used in any application
> > that does it. Limiting the application of the thermal subsystem to
> > only MMIO based devices is, well, limiting :-).
> >
> > >
> > > Another question regarding the I2C example, if another subsystem is
> > > using the I2C, won't it take the bus lock and create the contention
> > > also? I mean it is possible to create a mitigation drift by reading
> > > constantly another sensor value (eg. voltage or whatever) ?
> >
> > Yes, if a flood of events in the bus happen, then the drift will also happen,
> > specially if reads are aligned with the in kernel monitoring thread
> > update / call to .get_temp().
> 
> This is all in general terms, though.
> 
> I think that Daniel has been asking for a specific example, because if
> a new sysfs i/f is added with a certain issue in mind and then it
> turns out to be never used by anyone, because the issue is theoretical
> or not severe enough for anyone to care, it is a pure maintenance
> burden.  It would be good to have at least some assurance that this
> will not be the case.
> 
> Daniel, even if you think that the hardware in question is
> misdesigned, Linux has a long record of supporting misdesigned
> hardware, so this wouldn't be the first time ever.  However, I do
> agree that it would be good to have a specific example supporting this
> case.


I will add some examples in the next version with the numbers I quoted in this thread.

Also, generally speaking, this series uses the thermal subsystem mostly
for board monitoring use cases, which includes applications like BMC,
and that has generated some churn because the current set of existing
drivers mainly targets in-kernel control for the running system CPU temperature monitoring.
However, even since time of thermal device tree binding conception, this limitation
was never part of the original thoughts on enhancing the thermal subsystem from
ACPI only to a more general solution.

For this reason, next iteration I will start with enhancing documentation.
I took a brief look in the existing master tree and plenty of
the existing documentation references to board monitoring have been lost
in reshuffle of thermal documentation files, including thermal device tree
bindings, which is a shame. I will recover those examples with refreshed examples.
Thermal doc has been not the prettiest anyways so it is always good to spend
some time enhancing it.
diff mbox series

Patch

diff --git a/drivers/thermal/thermal_sysfs.c b/drivers/thermal/thermal_sysfs.c
index b6daea2398da..a240c58d9e08 100644
--- a/drivers/thermal/thermal_sysfs.c
+++ b/drivers/thermal/thermal_sysfs.c
@@ -35,12 +35,23 @@  static ssize_t
 temp_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
 	struct thermal_zone_device *tz = to_thermal_zone(dev);
-	int temperature, ret;
-
-	ret = thermal_zone_get_temp(tz, &temperature);
+	int temperature;
 
-	if (ret)
-		return ret;
+	/*
+	 * don't force new update from external reads
+	 * This way we avoid messing up with time constraints.
+	 */
+	if (tz->mode == THERMAL_DEVICE_DISABLED) {
+		int r;
+
+		r = thermal_zone_get_temp(tz, &temperature); /* holds tz->lock*/
+		if (r)
+			return r;
+	} else {
+		mutex_lock(&tz->lock);
+		temperature = tz->temperature;
+		mutex_unlock(&tz->lock);
+	}
 
 	return sprintf(buf, "%d\n", temperature);
 }