mbox series

[v7,00/11] Stop monitoring disabled devices

Message ID 20200629122925.21729-1-andrzej.p@collabora.com
Headers show
Series Stop monitoring disabled devices | expand

Message

Andrzej Pietrasiewicz June 29, 2020, 12:29 p.m. UTC
A respin of v6 with these changes:

v6..v7:
- removed duplicate S-o-b line from patch 6/11

v5..v6:
- staticized thermal_zone_device_set_mode() (kbuild test robot)

v4..v5:

- EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL (Daniel)
- dropped unnecessary thermal_zone_device_enable() in int3400_thermal.c
and in thermal_of.c (Bartlomiej)

Andrzej Pietrasiewicz (11):
  acpi: thermal: Fix error handling in the register function
  thermal: Store thermal mode in a dedicated enum
  thermal: Add current mode to thermal zone device
  thermal: Store device mode in struct thermal_zone_device
  thermal: remove get_mode() operation of drivers
  thermal: Add mode helpers
  thermal: Use mode helpers in drivers
  thermal: Explicitly enable non-changing thermal zone devices
  thermal: core: Stop polling DISABLED thermal devices
  thermal: Simplify or eliminate unnecessary set_mode() methods
  thermal: Rename set_mode() to change_mode()

 drivers/acpi/thermal.c                        | 75 +++++----------
 .../ethernet/chelsio/cxgb4/cxgb4_thermal.c    |  8 ++
 .../ethernet/mellanox/mlxsw/core_thermal.c    | 91 ++++---------------
 drivers/net/wireless/intel/iwlwifi/mvm/tt.c   |  9 +-
 drivers/platform/x86/acerhdf.c                | 33 +++----
 drivers/platform/x86/intel_mid_thermal.c      |  6 ++
 drivers/power/supply/power_supply_core.c      |  9 +-
 drivers/thermal/armada_thermal.c              |  6 ++
 drivers/thermal/da9062-thermal.c              | 16 +---
 drivers/thermal/dove_thermal.c                |  6 ++
 drivers/thermal/hisi_thermal.c                |  6 +-
 drivers/thermal/imx_thermal.c                 | 57 ++++--------
 .../intel/int340x_thermal/int3400_thermal.c   | 38 ++------
 .../int340x_thermal/int340x_thermal_zone.c    |  5 +
 drivers/thermal/intel/intel_pch_thermal.c     |  5 +
 .../thermal/intel/intel_quark_dts_thermal.c   | 34 ++-----
 drivers/thermal/intel/intel_soc_dts_iosf.c    |  3 +
 drivers/thermal/intel/x86_pkg_temp_thermal.c  |  6 ++
 drivers/thermal/kirkwood_thermal.c            |  7 ++
 drivers/thermal/rcar_thermal.c                |  9 +-
 drivers/thermal/rockchip_thermal.c            |  6 +-
 drivers/thermal/spear_thermal.c               |  7 ++
 drivers/thermal/sprd_thermal.c                |  6 +-
 drivers/thermal/st/st_thermal.c               |  5 +
 drivers/thermal/thermal_core.c                | 76 ++++++++++++++--
 drivers/thermal/thermal_of.c                  | 41 +--------
 drivers/thermal/thermal_sysfs.c               | 37 +-------
 include/linux/thermal.h                       | 19 +++-
 28 files changed, 275 insertions(+), 351 deletions(-)


base-commit: 9ebcfadb0610322ac537dd7aa5d9cbc2b2894c68

Comments

Andrzej Pietrasiewicz July 2, 2020, 5:19 p.m. UTC | #1
Hi,

W dniu 02.07.2020 o 19:01, Daniel Lezcano pisze:
> On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote:

>> Hi Daniel,

>>

>> <snip>

>>

>>>>>>>

>>>>>>> I did reproduce:

>>>>>>>

>>>>>>> v5.8-rc3 + series => imx6 hang at boot time

>>>>>>> v5.8-rc3 => imx6 boots correctly

>>>

>>> So finally I succeeded to reproduce it on my imx7 locally. The sensor

>>> was failing to initialize for another reason related to the legacy

>>> cooling device, this is why it is not appearing on the imx7.

>>>

>>> I can now git-bisect :)

>>>

>>

>> That would be very kind of you, thank you!

> 

> With the lock correctness option enabled:

> 

> [    4.179223] imx_thermal tempmon: Extended Commercial CPU temperature

> grade - max:105C critical:100C passive:95C

> [    4.189557]

> [    4.191060] ============================================

> [    4.196378] WARNING: possible recursive locking detected

> [    4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted

> [    4.207102] --------------------------------------------

> [    4.212421] kworker/0:3/54 is trying to acquire lock:

> [    4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:

> thermal_zone_device_is_enabled+0x18/0x34

> [    4.225777]

> [    4.225777] but task is already holding lock:

> [    4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:

> thermal_zone_get_temp+0x38/0x6c

> [    4.239121]

> [    4.239121] other info that might help us debug this:

> [    4.245655]  Possible unsafe locking scenario:

> [    4.245655]

> [    4.251579]        CPU0

> [    4.254031]        ----

> [    4.256481]   lock(&tz->lock);

> [    4.259544]   lock(&tz->lock);

> [    4.262608]

> [    4.262608]  *** DEADLOCK ***

> [    4.262608]

> [    4.268533]  May be due to missing lock nesting notation

> [    4.268533]

> [    4.275329] 4 locks held by kworker/0:3/54:

> [    4.279517]  #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, at:

> process_one_work+0x224/0x808

> [    4.288241]  #1: ca075f10 (deferred_probe_work){+.+.}-{0:0}, at:

> process_one_work+0x224/0x808

> [    4.296787]  #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at:

> __device_attach+0x30/0x140

> [    4.304468]  #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:

> thermal_zone_get_temp+0x38/0x6c

> [    4.312408]

> [    4.312408] stack backtrace:

> [    4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted

> 5.8.0-rc3-00011-gf5e50bf4d3ef #42

> [    4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree)

> [    4.330809] Workqueue: events deferred_probe_work_func

> [    4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>]

> (show_stack+0x10/0x14)

> [    4.343734] [<c030c580>] (show_stack) from [<c079d7d8>]

> (dump_stack+0xe8/0x114)

> [    4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>]

> (__lock_acquire+0xbfc/0x2cb4)

> [    4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>]

> (lock_acquire+0xf4/0x4e4)

> [    4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>]

> (__mutex_lock+0xb0/0xaa8)

> [    4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>]

> (mutex_lock_nested+0x1c/0x24)

> [    4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>]

> (thermal_zone_device_is_enabled+0x18/0x34)

> [    4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from

> [<c0d9da90>] (imx_get_temp+0x30/0x208)

> [    4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>]

> (thermal_zone_get_temp+0x4c/0x6c)

> [    4.409640] [<c0d97484>] (thermal_zone_get_temp) from [<c0d93df0>]

> (thermal_zone_device_update+0x8c/0x258)

> [    4.419310] [<c0d93df0>] (thermal_zone_device_update) from

> [<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88)

> [    4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from

> [<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578)

> [    4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>]

> (platform_drv_probe+0x48/0x98)

> [    4.447622] [<c0a78388>] (platform_drv_probe) from [<c0a7603c>]

> (really_probe+0x218/0x348)

> [    4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>]

> (driver_probe_device+0x5c/0xb4)

> [    4.464098] [<c0a76278>] (driver_probe_device) from [<c0a743bc>]

> (bus_for_each_drv+0x58/0xb8)

> [    4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>]

> (__device_attach+0xd4/0x140)

> [    4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>]

> (bus_probe_device+0x88/0x90)

> [    4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>]

> (deferred_probe_work_func+0x68/0x98)

> [    4.498088] [<c0a75564>] (deferred_probe_work_func) from [<c0369988>]

> (process_one_work+0x2e0/0x808)

> [    4.507237] [<c0369988>] (process_one_work) from [<c036a150>]

> (worker_thread+0x2a0/0x59c)

> [    4.515432] [<c036a150>] (worker_thread) from [<c0372208>]

> (kthread+0x16c/0x178)

> [    4.522843] [<c0372208>] (kthread) from [<c0300174>]

> (ret_from_fork+0x14/0x20)

> [    4.530074] Exception stack(0xca075fb0 to 0xca075ff8)

> [    4.535138] 5fa0:                                     00000000

> 00000000 00000000 00000000

> [    4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000

> 00000000 00000000 00000000

> [    4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000

> 

> 

> 


Thanks!

That confirms your suspicions.

So the reason is that ->get_temp() is called while the mutex is held and
thermal_zone_device_is_enabled() wants to take the same mutex.

Is adding a comment to thermal_zone_device_is_enabled() to never call
it while the mutex is held and adding another version of it which does
not take the mutex ok?

Andrzej
Zhang, Rui July 3, 2020, 1:49 a.m. UTC | #2
On Thu, 2020-07-02 at 19:49 +0200, Daniel Lezcano wrote:
> On 02/07/2020 19:19, Andrzej Pietrasiewicz wrote:

> > Hi,

> > 

> > W dniu 02.07.2020 o 19:01, Daniel Lezcano pisze:

> > > On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote:

> > > > Hi Daniel,

> > > > 

> > > > <snip>

> > > > 

> > > > > > > > > 

> > > > > > > > > I did reproduce:

> > > > > > > > > 

> > > > > > > > > v5.8-rc3 + series => imx6 hang at boot time

> > > > > > > > > v5.8-rc3 => imx6 boots correctly

> > > > > 

> > > > > So finally I succeeded to reproduce it on my imx7 locally.

> > > > > The sensor

> > > > > was failing to initialize for another reason related to the

> > > > > legacy

> > > > > cooling device, this is why it is not appearing on the imx7.

> > > > > 

> > > > > I can now git-bisect :)

> > > > > 

> > > > 

> > > > That would be very kind of you, thank you!

> > > 

> > > With the lock correctness option enabled:

> > > 

> > > [    4.179223] imx_thermal tempmon: Extended Commercial CPU

> > > temperature

> > > grade - max:105C critical:100C passive:95C

> > > [    4.189557]

> > > [    4.191060] ============================================

> > > [    4.196378] WARNING: possible recursive locking detected

> > > [    4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted

> > > [    4.207102] --------------------------------------------

> > > [    4.212421] kworker/0:3/54 is trying to acquire lock:

> > > [    4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:

> > > thermal_zone_device_is_enabled+0x18/0x34

> > > [    4.225777]

> > > [    4.225777] but task is already holding lock:

> > > [    4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:

> > > thermal_zone_get_temp+0x38/0x6c

> > > [    4.239121]

> > > [    4.239121] other info that might help us debug this:

> > > [    4.245655]  Possible unsafe locking scenario:

> > > [    4.245655]

> > > [    4.251579]        CPU0

> > > [    4.254031]        ----

> > > [    4.256481]   lock(&tz->lock);

> > > [    4.259544]   lock(&tz->lock);

> > > [    4.262608]

> > > [    4.262608]  *** DEADLOCK ***

> > > [    4.262608]

> > > [    4.268533]  May be due to missing lock nesting notation

> > > [    4.268533]

> > > [    4.275329] 4 locks held by kworker/0:3/54:

> > > [    4.279517]  #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, 

> > > at:

> > > process_one_work+0x224/0x808

> > > [    4.288241]  #1: ca075f10 (deferred_probe_work){+.+.}-{0:0},

> > > at:

> > > process_one_work+0x224/0x808

> > > [    4.296787]  #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at:

> > > __device_attach+0x30/0x140

> > > [    4.304468]  #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:

> > > thermal_zone_get_temp+0x38/0x6c

> > > [    4.312408]

> > > [    4.312408] stack backtrace:

> > > [    4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted

> > > 5.8.0-rc3-00011-gf5e50bf4d3ef #42

> > > [    4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree)

> > > [    4.330809] Workqueue: events deferred_probe_work_func

> > > [    4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>]

> > > (show_stack+0x10/0x14)

> > > [    4.343734] [<c030c580>] (show_stack) from [<c079d7d8>]

> > > (dump_stack+0xe8/0x114)

> > > [    4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>]

> > > (__lock_acquire+0xbfc/0x2cb4)

> > > [    4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>]

> > > (lock_acquire+0xf4/0x4e4)

> > > [    4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>]

> > > (__mutex_lock+0xb0/0xaa8)

> > > [    4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>]

> > > (mutex_lock_nested+0x1c/0x24)

> > > [    4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>]

> > > (thermal_zone_device_is_enabled+0x18/0x34)

> > > [    4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from

> > > [<c0d9da90>] (imx_get_temp+0x30/0x208)

> > > [    4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>]

> > > (thermal_zone_get_temp+0x4c/0x6c)

> > > [    4.409640] [<c0d97484>] (thermal_zone_get_temp) from

> > > [<c0d93df0>]

> > > (thermal_zone_device_update+0x8c/0x258)

> > > [    4.419310] [<c0d93df0>] (thermal_zone_device_update) from

> > > [<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88)

> > > [    4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from

> > > [<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578)

> > > [    4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>]

> > > (platform_drv_probe+0x48/0x98)

> > > [    4.447622] [<c0a78388>] (platform_drv_probe) from

> > > [<c0a7603c>]

> > > (really_probe+0x218/0x348)

> > > [    4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>]

> > > (driver_probe_device+0x5c/0xb4)

> > > [    4.464098] [<c0a76278>] (driver_probe_device) from

> > > [<c0a743bc>]

> > > (bus_for_each_drv+0x58/0xb8)

> > > [    4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>]

> > > (__device_attach+0xd4/0x140)

> > > [    4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>]

> > > (bus_probe_device+0x88/0x90)

> > > [    4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>]

> > > (deferred_probe_work_func+0x68/0x98)

> > > [    4.498088] [<c0a75564>] (deferred_probe_work_func) from

> > > [<c0369988>]

> > > (process_one_work+0x2e0/0x808)

> > > [    4.507237] [<c0369988>] (process_one_work) from [<c036a150>]

> > > (worker_thread+0x2a0/0x59c)

> > > [    4.515432] [<c036a150>] (worker_thread) from [<c0372208>]

> > > (kthread+0x16c/0x178)

> > > [    4.522843] [<c0372208>] (kthread) from [<c0300174>]

> > > (ret_from_fork+0x14/0x20)

> > > [    4.530074] Exception stack(0xca075fb0 to 0xca075ff8)

> > > [    4.535138] 5fa0:                                     00000000

> > > 00000000 00000000 00000000

> > > [    4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000

> > > 00000000 00000000 00000000

> > > [    4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013

> > > 00000000

> > > 

> > > 

> > > 

> > 

> > Thanks!

> > 

> > That confirms your suspicions.

> > 

> > So the reason is that ->get_temp() is called while the mutex is

> > held and

> > thermal_zone_device_is_enabled() wants to take the same mutex.

> 

> Yes, that's correct.

> 

> > Is adding a comment to thermal_zone_device_is_enabled() to never

> > call

> > it while the mutex is held and adding another version of it which

> > does

> > not take the mutex ok?

> 

> The thermal_zone_device_is_enabled() is only used in two places, acpi

> and this imx driver, and given:

> 

> 1. as soon as the mutex is released, there is no guarantee the

> thermal

> zone won't be changed right after, the lock is pointless, thus the

> information also.

> 

> 2. from a design point of view, I don't see why a driver should know

> if

> a thermal zone is disabled or not

> 

> It would make sense to end with this function and do not give the

> different drivers an opportunity to access this information.


I agree.
> 

> Why not add change_mode for the acpi in order to enable or disable

> the

> events


thermal_zone_device_is_enabled() is invoked in acpi thermal driver
because we only want to do thermal_zone_device_update() when the acpi
thermal zone is enabled.

As thermal_zone_device_update() can handle a disabled thermal zone now,
we can just remove the check.

thanks,
rui

>  and for imx_thermal use irq_enabled flag instead of the thermal

> zone mode? Moreover it is very unclear why this function is needed in

> imx_get_temp(), and I suspect we should be able to get rid of it.

> 

>
Daniel Lezcano July 3, 2020, 6:38 a.m. UTC | #3
On 03/07/2020 03:49, Zhang Rui wrote:
> On Thu, 2020-07-02 at 19:49 +0200, Daniel Lezcano wrote:


[ ... ]

>>> So the reason is that ->get_temp() is called while the mutex is

>>> held and

>>> thermal_zone_device_is_enabled() wants to take the same mutex.

>>

>> Yes, that's correct.

>>

>>> Is adding a comment to thermal_zone_device_is_enabled() to never

>>> call

>>> it while the mutex is held and adding another version of it which

>>> does

>>> not take the mutex ok?

>>

>> The thermal_zone_device_is_enabled() is only used in two places, acpi

>> and this imx driver, and given:

>>

>> 1. as soon as the mutex is released, there is no guarantee the

>> thermal

>> zone won't be changed right after, the lock is pointless, thus the

>> information also.

>>

>> 2. from a design point of view, I don't see why a driver should know

>> if

>> a thermal zone is disabled or not

>>

>> It would make sense to end with this function and do not give the

>> different drivers an opportunity to access this information.

> 

> I agree.

>>

>> Why not add change_mode for the acpi in order to enable or disable

>> the

>> events

> 

> thermal_zone_device_is_enabled() is invoked in acpi thermal driver

> because we only want to do thermal_zone_device_update() when the acpi

> thermal zone is enabled.

> 

> As thermal_zone_device_update() can handle a disabled thermal zone now,

> we can just remove the check.


Ah yes, good point!



-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog