diff mbox series

watchdog: qcom: Move suspend/resume to suspend_late/resume_early

Message ID 20210310202004.1436-1-saiprakash.ranjan@codeaurora.org
State New
Headers show
Series watchdog: qcom: Move suspend/resume to suspend_late/resume_early | expand

Commit Message

Sai Prakash Ranjan March 10, 2021, 8:20 p.m. UTC
During suspend/resume usecases and tests, it is common to see issues
such as lockups either in suspend path or resume path because of the
bugs in the corresponding device driver pm handling code. In such cases,
it is important that watchdog is active to make sure that we either
receive a watchdog pretimeout notification or a bite causing reset
instead of a hang causing us to hard reset the machine.

There are good reasons as to why we need this because:

* We can have a watchdog pretimeout governor set to panic in which
  case we can have a backtrace which would help identify the issue
  with the particular driver and cause a normal reboot.

* Even in case where there is no pretimeout support, a watchdog
  bite is still useful because some firmware has debug support to dump
  CPU core context on watchdog bite for post-mortem analysis.

* One more usecase which comes to mind is of warm reboot. In case we
  hard reset the target, a cold reboot could be induced resulting in
  lose of ddr contents thereby losing all the debug info.

Currently, the watchdog pm callback just invokes the usual suspend
and resume callback which do not have any special ordering in the
sense that a watchdog can be suspended before the buggy device driver
suspend callback and watchdog resume can happen after the buggy device
driver resume callback. This would mean that the watchdog will not be
active when the buggy driver cause the lockups thereby hanging the
system. So to make sure this doesn't happen, move the watchdog pm to
use late/early system pm callbacks which will ensure that the watchdog
is suspended late and resumed early so that it can catch such issues.

Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/watchdog/qcom-wdt.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Doug Anderson March 10, 2021, 8:27 p.m. UTC | #1
Hi,

On Wed, Mar 10, 2021 at 12:20 PM Sai Prakash Ranjan
<saiprakash.ranjan@codeaurora.org> wrote:
>
> During suspend/resume usecases and tests, it is common to see issues
> such as lockups either in suspend path or resume path because of the
> bugs in the corresponding device driver pm handling code. In such cases,
> it is important that watchdog is active to make sure that we either
> receive a watchdog pretimeout notification or a bite causing reset
> instead of a hang causing us to hard reset the machine.
>
> There are good reasons as to why we need this because:
>
> * We can have a watchdog pretimeout governor set to panic in which
>   case we can have a backtrace which would help identify the issue
>   with the particular driver and cause a normal reboot.
>
> * Even in case where there is no pretimeout support, a watchdog
>   bite is still useful because some firmware has debug support to dump
>   CPU core context on watchdog bite for post-mortem analysis.
>
> * One more usecase which comes to mind is of warm reboot. In case we
>   hard reset the target, a cold reboot could be induced resulting in
>   lose of ddr contents thereby losing all the debug info.
>
> Currently, the watchdog pm callback just invokes the usual suspend
> and resume callback which do not have any special ordering in the
> sense that a watchdog can be suspended before the buggy device driver
> suspend callback and watchdog resume can happen after the buggy device
> driver resume callback. This would mean that the watchdog will not be
> active when the buggy driver cause the lockups thereby hanging the
> system. So to make sure this doesn't happen, move the watchdog pm to
> use late/early system pm callbacks which will ensure that the watchdog
> is suspended late and resumed early so that it can catch such issues.
>
> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> ---
>  drivers/watchdog/qcom-wdt.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Reviewed-by: Douglas Anderson <dianders@chromium.org>
Sai Prakash Ranjan April 20, 2021, 5:43 a.m. UTC | #2
Hi Guenter,

On 2021-03-11 01:53, Guenter Roeck wrote:
> On Thu, Mar 11, 2021 at 01:50:04AM +0530, Sai Prakash Ranjan wrote:
>> During suspend/resume usecases and tests, it is common to see issues
>> such as lockups either in suspend path or resume path because of the
>> bugs in the corresponding device driver pm handling code. In such 
>> cases,
>> it is important that watchdog is active to make sure that we either
>> receive a watchdog pretimeout notification or a bite causing reset
>> instead of a hang causing us to hard reset the machine.
>> 
>> There are good reasons as to why we need this because:
>> 
>> * We can have a watchdog pretimeout governor set to panic in which
>>   case we can have a backtrace which would help identify the issue
>>   with the particular driver and cause a normal reboot.
>> 
>> * Even in case where there is no pretimeout support, a watchdog
>>   bite is still useful because some firmware has debug support to dump
>>   CPU core context on watchdog bite for post-mortem analysis.
>> 
>> * One more usecase which comes to mind is of warm reboot. In case we
>>   hard reset the target, a cold reboot could be induced resulting in
>>   lose of ddr contents thereby losing all the debug info.
>> 
>> Currently, the watchdog pm callback just invokes the usual suspend
>> and resume callback which do not have any special ordering in the
>> sense that a watchdog can be suspended before the buggy device driver
>> suspend callback and watchdog resume can happen after the buggy device
>> driver resume callback. This would mean that the watchdog will not be
>> active when the buggy driver cause the lockups thereby hanging the
>> system. So to make sure this doesn't happen, move the watchdog pm to
>> use late/early system pm callbacks which will ensure that the watchdog
>> is suspended late and resumed early so that it can catch such issues.
>> 
>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> 
> Reviewed-by: Guenter Roeck <linux@roeck-us.net>
> 

Gentle Ping. I don't see this in linux-next or linux-watchdog, please 
let
me know if anything is pending from my side.

Thanks,
Sai

>> ---
>>  drivers/watchdog/qcom-wdt.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/watchdog/qcom-wdt.c b/drivers/watchdog/qcom-wdt.c
>> index e38a87ffe5f5..0d2209c5eaca 100644
>> --- a/drivers/watchdog/qcom-wdt.c
>> +++ b/drivers/watchdog/qcom-wdt.c
>> @@ -329,7 +329,9 @@ static int __maybe_unused qcom_wdt_resume(struct 
>> device *dev)
>>  	return 0;
>>  }
>> 
>> -static SIMPLE_DEV_PM_OPS(qcom_wdt_pm_ops, qcom_wdt_suspend, 
>> qcom_wdt_resume);
>> +static const struct dev_pm_ops qcom_wdt_pm_ops = {
>> +	SET_LATE_SYSTEM_SLEEP_PM_OPS(qcom_wdt_suspend, qcom_wdt_resume)
>> +};
>> 
>>  static const struct of_device_id qcom_wdt_of_table[] = {
>>  	{ .compatible = "qcom,kpss-timer", .data = &match_data_apcs_tmr },
>> --
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
>> member
>> of Code Aurora Forum, hosted by The Linux Foundation
>>
Sai Prakash Ranjan May 24, 2021, 2:38 a.m. UTC | #3
On 2021-04-20 11:13, Sai Prakash Ranjan wrote:
> Hi Guenter,

> 

> On 2021-03-11 01:53, Guenter Roeck wrote:

>> On Thu, Mar 11, 2021 at 01:50:04AM +0530, Sai Prakash Ranjan wrote:

>>> During suspend/resume usecases and tests, it is common to see issues

>>> such as lockups either in suspend path or resume path because of the

>>> bugs in the corresponding device driver pm handling code. In such 

>>> cases,

>>> it is important that watchdog is active to make sure that we either

>>> receive a watchdog pretimeout notification or a bite causing reset

>>> instead of a hang causing us to hard reset the machine.

>>> 

>>> There are good reasons as to why we need this because:

>>> 

>>> * We can have a watchdog pretimeout governor set to panic in which

>>>   case we can have a backtrace which would help identify the issue

>>>   with the particular driver and cause a normal reboot.

>>> 

>>> * Even in case where there is no pretimeout support, a watchdog

>>>   bite is still useful because some firmware has debug support to 

>>> dump

>>>   CPU core context on watchdog bite for post-mortem analysis.

>>> 

>>> * One more usecase which comes to mind is of warm reboot. In case we

>>>   hard reset the target, a cold reboot could be induced resulting in

>>>   lose of ddr contents thereby losing all the debug info.

>>> 

>>> Currently, the watchdog pm callback just invokes the usual suspend

>>> and resume callback which do not have any special ordering in the

>>> sense that a watchdog can be suspended before the buggy device driver

>>> suspend callback and watchdog resume can happen after the buggy 

>>> device

>>> driver resume callback. This would mean that the watchdog will not be

>>> active when the buggy driver cause the lockups thereby hanging the

>>> system. So to make sure this doesn't happen, move the watchdog pm to

>>> use late/early system pm callbacks which will ensure that the 

>>> watchdog

>>> is suspended late and resumed early so that it can catch such issues.

>>> 

>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>

>> 

>> Reviewed-by: Guenter Roeck <linux@roeck-us.net>

>> 

> 

> Gentle Ping. I don't see this in linux-next or linux-watchdog, please 

> let

> me know if anything is pending from my side.

> 


Gentle Ping !!

Thanks,
Sai
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
Guenter Roeck May 24, 2021, 4:03 a.m. UTC | #4
On 5/23/21 7:38 PM, Sai Prakash Ranjan wrote:
> On 2021-04-20 11:13, Sai Prakash Ranjan wrote:

>> Hi Guenter,

>>

>> On 2021-03-11 01:53, Guenter Roeck wrote:

>>> On Thu, Mar 11, 2021 at 01:50:04AM +0530, Sai Prakash Ranjan wrote:

>>>> During suspend/resume usecases and tests, it is common to see issues

>>>> such as lockups either in suspend path or resume path because of the

>>>> bugs in the corresponding device driver pm handling code. In such cases,

>>>> it is important that watchdog is active to make sure that we either

>>>> receive a watchdog pretimeout notification or a bite causing reset

>>>> instead of a hang causing us to hard reset the machine.

>>>>

>>>> There are good reasons as to why we need this because:

>>>>

>>>> * We can have a watchdog pretimeout governor set to panic in which

>>>>   case we can have a backtrace which would help identify the issue

>>>>   with the particular driver and cause a normal reboot.

>>>>

>>>> * Even in case where there is no pretimeout support, a watchdog

>>>>   bite is still useful because some firmware has debug support to dump

>>>>   CPU core context on watchdog bite for post-mortem analysis.

>>>>

>>>> * One more usecase which comes to mind is of warm reboot. In case we

>>>>   hard reset the target, a cold reboot could be induced resulting in

>>>>   lose of ddr contents thereby losing all the debug info.

>>>>

>>>> Currently, the watchdog pm callback just invokes the usual suspend

>>>> and resume callback which do not have any special ordering in the

>>>> sense that a watchdog can be suspended before the buggy device driver

>>>> suspend callback and watchdog resume can happen after the buggy device

>>>> driver resume callback. This would mean that the watchdog will not be

>>>> active when the buggy driver cause the lockups thereby hanging the

>>>> system. So to make sure this doesn't happen, move the watchdog pm to

>>>> use late/early system pm callbacks which will ensure that the watchdog

>>>> is suspended late and resumed early so that it can catch such issues.

>>>>

>>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>

>>>

>>> Reviewed-by: Guenter Roeck <linux@roeck-us.net>

>>>

>>

>> Gentle Ping. I don't see this in linux-next or linux-watchdog, please let

>> me know if anything is pending from my side.

>>

> 

> Gentle Ping !!

> 


It is my watchdog-next branch. At some point Wim will hopefully pick it up.

Guenter
Sai Prakash Ranjan May 24, 2021, 4:12 a.m. UTC | #5
On 2021-05-24 09:33, Guenter Roeck wrote:
> On 5/23/21 7:38 PM, Sai Prakash Ranjan wrote:

>> On 2021-04-20 11:13, Sai Prakash Ranjan wrote:

>>> Hi Guenter,

>>> 

>>> On 2021-03-11 01:53, Guenter Roeck wrote:

>>>> On Thu, Mar 11, 2021 at 01:50:04AM +0530, Sai Prakash Ranjan wrote:

>>>>> During suspend/resume usecases and tests, it is common to see 

>>>>> issues

>>>>> such as lockups either in suspend path or resume path because of 

>>>>> the

>>>>> bugs in the corresponding device driver pm handling code. In such 

>>>>> cases,

>>>>> it is important that watchdog is active to make sure that we either

>>>>> receive a watchdog pretimeout notification or a bite causing reset

>>>>> instead of a hang causing us to hard reset the machine.

>>>>> 

>>>>> There are good reasons as to why we need this because:

>>>>> 

>>>>> * We can have a watchdog pretimeout governor set to panic in which

>>>>>   case we can have a backtrace which would help identify the issue

>>>>>   with the particular driver and cause a normal reboot.

>>>>> 

>>>>> * Even in case where there is no pretimeout support, a watchdog

>>>>>   bite is still useful because some firmware has debug support to 

>>>>> dump

>>>>>   CPU core context on watchdog bite for post-mortem analysis.

>>>>> 

>>>>> * One more usecase which comes to mind is of warm reboot. In case 

>>>>> we

>>>>>   hard reset the target, a cold reboot could be induced resulting 

>>>>> in

>>>>>   lose of ddr contents thereby losing all the debug info.

>>>>> 

>>>>> Currently, the watchdog pm callback just invokes the usual suspend

>>>>> and resume callback which do not have any special ordering in the

>>>>> sense that a watchdog can be suspended before the buggy device 

>>>>> driver

>>>>> suspend callback and watchdog resume can happen after the buggy 

>>>>> device

>>>>> driver resume callback. This would mean that the watchdog will not 

>>>>> be

>>>>> active when the buggy driver cause the lockups thereby hanging the

>>>>> system. So to make sure this doesn't happen, move the watchdog pm 

>>>>> to

>>>>> use late/early system pm callbacks which will ensure that the 

>>>>> watchdog

>>>>> is suspended late and resumed early so that it can catch such 

>>>>> issues.

>>>>> 

>>>>> Signed-off-by: Sai Prakash Ranjan 

>>>>> <saiprakash.ranjan@codeaurora.org>

>>>> 

>>>> Reviewed-by: Guenter Roeck <linux@roeck-us.net>

>>>> 

>>> 

>>> Gentle Ping. I don't see this in linux-next or linux-watchdog, please 

>>> let

>>> me know if anything is pending from my side.

>>> 

>> 

>> Gentle Ping !!

>> 

> 

> It is my watchdog-next branch. At some point Wim will hopefully pick it 

> up.

> 


Ah I see, thanks for the info.

-Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
diff mbox series

Patch

diff --git a/drivers/watchdog/qcom-wdt.c b/drivers/watchdog/qcom-wdt.c
index e38a87ffe5f5..0d2209c5eaca 100644
--- a/drivers/watchdog/qcom-wdt.c
+++ b/drivers/watchdog/qcom-wdt.c
@@ -329,7 +329,9 @@  static int __maybe_unused qcom_wdt_resume(struct device *dev)
 	return 0;
 }
 
-static SIMPLE_DEV_PM_OPS(qcom_wdt_pm_ops, qcom_wdt_suspend, qcom_wdt_resume);
+static const struct dev_pm_ops qcom_wdt_pm_ops = {
+	SET_LATE_SYSTEM_SLEEP_PM_OPS(qcom_wdt_suspend, qcom_wdt_resume)
+};
 
 static const struct of_device_id qcom_wdt_of_table[] = {
 	{ .compatible = "qcom,kpss-timer", .data = &match_data_apcs_tmr },