Message ID | 20220114031730.25621-1-benl@squareup.com |
---|---|
State | New |
Headers | show |
Series | [v2] drivers: thermal: tsens: respect thermal_device_mode in threshold irq reporting | expand |
On Fri, Jan 14, 2022 at 8:47 AM Benjamin Li <benl@squareup.com> wrote: > > 'echo disabled > .../thermal_zoneX/mode' will disable the thermal core's > polling mechanism to check for threshold trips. This is used sometimes to > run performance test cases. > > However, tsens supports an interrupt mechanism to receive notification of > trips, implemented in commit 634e11d5b450 ("drivers: thermal: tsens: Add > interrupt support"). > > Currently the thermal zone mode that's set by userspace is not checked > before propagating threshold trip events from IRQs. Let's fix this to > restore the abilty to disable thermal throttling at runtime. > > ==================== > > Tested on MSM8939 running 5.16.0. This platform has 8 cores; the first > four thermal zones control cpu0-3 and the last zone is for the other four > CPUs together. > > for f in /sys/class/thermal/thermal_zone*; do > echo "disabled" > $f/mode > echo $f | paste - $f/type $f/mode > done > > /sys/class/thermal/thermal_zone0 cpu0-thermal disabled > /sys/class/thermal/thermal_zone1 cpu1-thermal disabled > /sys/class/thermal/thermal_zone2 cpu2-thermal disabled > /sys/class/thermal/thermal_zone3 cpu3-thermal disabled > /sys/class/thermal/thermal_zone4 cpu4567-thermal disabled > > With mitigation thresholds at 75 degC and load running, we can now cruise > past temp=75000 without CPU throttling kicking in. > > watch -n 1 "grep '' /sys/class/thermal/*/temp > /sys/class/thermal/*/cur_state > /sys/bus/cpu/devices/cpu*/cpufreq/cpuinfo_cur_freq" > > /sys/class/thermal/thermal_zone0/temp:82000 > /sys/class/thermal/thermal_zone1/temp:84000 > /sys/class/thermal/thermal_zone2/temp:87000 > /sys/class/thermal/thermal_zone3/temp:84000 > /sys/class/thermal/thermal_zone4/temp:84000 > /sys/class/thermal/cooling_device0/cur_state:0 > /sys/class/thermal/cooling_device1/cur_state:0 > /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_cur_freq:1113600 > /sys/bus/cpu/devices/cpu1/cpufreq/cpuinfo_cur_freq:1113600 > /sys/bus/cpu/devices/cpu2/cpufreq/cpuinfo_cur_freq:1113600 > /sys/bus/cpu/devices/cpu3/cpufreq/cpuinfo_cur_freq:1113600 > /sys/bus/cpu/devices/cpu4/cpufreq/cpuinfo_cur_freq:800000 > /sys/bus/cpu/devices/cpu5/cpufreq/cpuinfo_cur_freq:800000 > /sys/bus/cpu/devices/cpu6/cpufreq/cpuinfo_cur_freq:800000 > /sys/bus/cpu/devices/cpu7/cpufreq/cpuinfo_cur_freq:800000 > > Reported-by: Zac Crosby <zac@squareup.com> > Signed-off-by: Benjamin Li <benl@squareup.com> > --- > Changes in v2: > - Reordered sentences in first part of commit message to make sense. > > drivers/thermal/qcom/tsens.c | 15 +++++++++------ > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/drivers/thermal/qcom/tsens.c b/drivers/thermal/qcom/tsens.c > index 99a8d9f3e03c..0b6299512e7c 100644 > --- a/drivers/thermal/qcom/tsens.c > +++ b/drivers/thermal/qcom/tsens.c > @@ -509,13 +509,16 @@ static irqreturn_t tsens_irq_thread(int irq, void *data) > spin_unlock_irqrestore(&priv->ul_lock, flags); > > if (trigger) { > - dev_dbg(priv->dev, "[%u] %s: TZ update trigger (%d mC)\n", > - hw_id, __func__, temp); > - thermal_zone_device_update(s->tzd, > - THERMAL_EVENT_UNSPECIFIED); > + if (s->tzd->mode == THERMAL_DEVICE_ENABLED) { > + dev_dbg(priv->dev, "[%u] %s: TZ update trigger (%d mC)\n", > + hw_id, __func__, temp); > + thermal_zone_device_update(s->tzd, THERMAL_EVENT_UNSPECIFIED); > + } else { > + dev_dbg(priv->dev, "[%u] %s: TZ update trigger (%d mC) skipped as zone disabled\n", Hmm. I don't like the fact that these messages won't be visible to users in dmesg unless they're debugging. This change puts the SoC in a potentially unsafe state. Perhaps we should print a ratelimited message in the logs that we're operating outside safety limits? > + hw_id, __func__, temp); > + } > } else { > - dev_dbg(priv->dev, "[%u] %s: no violation: %d\n", > - hw_id, __func__, temp); > + dev_dbg(priv->dev, "[%u] %s: no violation: %d\n", hw_id, __func__, temp); Get rid of this hunk, it is unrelated to the above change. > } > > if (tsens_version(priv) < VER_0_1) { > -- > 2.17.1 >
On 1/19/22 4:33 PM, Bjorn Andersson <bjorn.andersson@linaro.org> wrote: > Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Thanks! On 1/20/22 3:40 AM, Amit Kucheria wrote: >> + dev_dbg(priv->dev, "[%u] %s: TZ update trigger (%d mC)\n", >> + hw_id, __func__, temp); >> + thermal_zone_device_update(s->tzd, THERMAL_EVENT_UNSPECIFIED); >> + } else { >> + dev_dbg(priv->dev, "[%u] %s: TZ update trigger (%d mC) skipped as zone disabled\n", > > Hmm. I don't like the fact that these messages won't be visible to > users in dmesg unless they're debugging. This change puts the SoC in a > potentially unsafe state. Perhaps we should print a ratelimited > message in the logs that we're operating outside safety limits? That seems fine, I'll change to dev_info_ratelimited and make the message a bit scarier. > >> + hw_id, __func__, temp); >> + } >> } else { >> - dev_dbg(priv->dev, "[%u] %s: no violation: %d\n", >> - hw_id, __func__, temp); >> + dev_dbg(priv->dev, "[%u] %s: no violation: %d\n", hw_id, __func__, temp); > > Get rid of this hunk, it is unrelated to the above change. Will do. > >> } >> >> if (tsens_version(priv) < VER_0_1) { >> -- >> 2.17.1 >>
diff --git a/drivers/thermal/qcom/tsens.c b/drivers/thermal/qcom/tsens.c index 99a8d9f3e03c..0b6299512e7c 100644 --- a/drivers/thermal/qcom/tsens.c +++ b/drivers/thermal/qcom/tsens.c @@ -509,13 +509,16 @@ static irqreturn_t tsens_irq_thread(int irq, void *data) spin_unlock_irqrestore(&priv->ul_lock, flags); if (trigger) { - dev_dbg(priv->dev, "[%u] %s: TZ update trigger (%d mC)\n", - hw_id, __func__, temp); - thermal_zone_device_update(s->tzd, - THERMAL_EVENT_UNSPECIFIED); + if (s->tzd->mode == THERMAL_DEVICE_ENABLED) { + dev_dbg(priv->dev, "[%u] %s: TZ update trigger (%d mC)\n", + hw_id, __func__, temp); + thermal_zone_device_update(s->tzd, THERMAL_EVENT_UNSPECIFIED); + } else { + dev_dbg(priv->dev, "[%u] %s: TZ update trigger (%d mC) skipped as zone disabled\n", + hw_id, __func__, temp); + } } else { - dev_dbg(priv->dev, "[%u] %s: no violation: %d\n", - hw_id, __func__, temp); + dev_dbg(priv->dev, "[%u] %s: no violation: %d\n", hw_id, __func__, temp); } if (tsens_version(priv) < VER_0_1) {
'echo disabled > .../thermal_zoneX/mode' will disable the thermal core's polling mechanism to check for threshold trips. This is used sometimes to run performance test cases. However, tsens supports an interrupt mechanism to receive notification of trips, implemented in commit 634e11d5b450 ("drivers: thermal: tsens: Add interrupt support"). Currently the thermal zone mode that's set by userspace is not checked before propagating threshold trip events from IRQs. Let's fix this to restore the abilty to disable thermal throttling at runtime. ==================== Tested on MSM8939 running 5.16.0. This platform has 8 cores; the first four thermal zones control cpu0-3 and the last zone is for the other four CPUs together. for f in /sys/class/thermal/thermal_zone*; do echo "disabled" > $f/mode echo $f | paste - $f/type $f/mode done /sys/class/thermal/thermal_zone0 cpu0-thermal disabled /sys/class/thermal/thermal_zone1 cpu1-thermal disabled /sys/class/thermal/thermal_zone2 cpu2-thermal disabled /sys/class/thermal/thermal_zone3 cpu3-thermal disabled /sys/class/thermal/thermal_zone4 cpu4567-thermal disabled With mitigation thresholds at 75 degC and load running, we can now cruise past temp=75000 without CPU throttling kicking in. watch -n 1 "grep '' /sys/class/thermal/*/temp /sys/class/thermal/*/cur_state /sys/bus/cpu/devices/cpu*/cpufreq/cpuinfo_cur_freq" /sys/class/thermal/thermal_zone0/temp:82000 /sys/class/thermal/thermal_zone1/temp:84000 /sys/class/thermal/thermal_zone2/temp:87000 /sys/class/thermal/thermal_zone3/temp:84000 /sys/class/thermal/thermal_zone4/temp:84000 /sys/class/thermal/cooling_device0/cur_state:0 /sys/class/thermal/cooling_device1/cur_state:0 /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_cur_freq:1113600 /sys/bus/cpu/devices/cpu1/cpufreq/cpuinfo_cur_freq:1113600 /sys/bus/cpu/devices/cpu2/cpufreq/cpuinfo_cur_freq:1113600 /sys/bus/cpu/devices/cpu3/cpufreq/cpuinfo_cur_freq:1113600 /sys/bus/cpu/devices/cpu4/cpufreq/cpuinfo_cur_freq:800000 /sys/bus/cpu/devices/cpu5/cpufreq/cpuinfo_cur_freq:800000 /sys/bus/cpu/devices/cpu6/cpufreq/cpuinfo_cur_freq:800000 /sys/bus/cpu/devices/cpu7/cpufreq/cpuinfo_cur_freq:800000 Reported-by: Zac Crosby <zac@squareup.com> Signed-off-by: Benjamin Li <benl@squareup.com> --- Changes in v2: - Reordered sentences in first part of commit message to make sense. drivers/thermal/qcom/tsens.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)