diff mbox

[v10,5/5] Watchdog: ARM SBSA Generic Watchdog half timeout panic support

Message ID 1454519923-25230-6-git-send-email-fu.wei@linaro.org
State New
Headers show

Commit Message

Fu Wei Fu Feb. 3, 2016, 5:18 p.m. UTC
From: Fu Wei <fu.wei@linaro.org>


This patch registers the WS0 interrupt routine to trigger panic,
when the watchdog reachs the first stage (the half timeout).
This function can help administrator to backup the system context
info by panic console output or kdump (if supported), once system
goes wrong (doesn't feed the watchdog in the half timeout).

User also can skip panic by setting panic_enabled (module parameter) as 0

Signed-off-by: Fu Wei <fu.wei@linaro.org>

---
 Documentation/watchdog/watchdog-parameters.txt |  1 +
 drivers/watchdog/Kconfig                       | 10 +++++
 drivers/watchdog/sbsa_gwdt.c                   | 54 +++++++++++++++++++++++---
 3 files changed, 60 insertions(+), 5 deletions(-)

-- 
2.5.0

Comments

Fu Wei Fu Feb. 3, 2016, 5:49 p.m. UTC | #1
Hi Timur,

Thanks for your rapid feedback :-)

On 4 February 2016 at 01:27, Timur Tabi <timur@codeaurora.org> wrote:
> fu.wei@linaro.org wrote:

>>

>> +#ifdef CONFIG_ARM_SBSA_WATCHDOG_PANIC

>> +static bool panic_enabled = true;

>

>

> I think this should default to 'false', because IMHO, this seems like an odd


yes, It make sense to make it default to 'false'.

> feature.  I'm not crazy about the fact that there's a Kconfig option for it

> either, but I'm not going to NACK this patch.

>

> I personally would prefer to drop this patch, and just wait for full-blown

> pre-timeout support.  It feels like a debugging feature that doesn't really


sorry, are you saying : using pre-timeout instead of this half timeout?

But even we have pre-timeout support, pre-timeout  == timeout / 2, it
can not be configured without touch timeout.

if you want pre-timeout  != timeout / 2, we have to modify WCV in the
interrupt routine.
 (because of the explicit watchdog refresh  mechanism)

Could you let me know why we need pre-timeout  here ?? :-)

> belong upstream.  But like I said, it's just my opinion, and I won't

> complain if I'm outvoted.


 I think this debugging feature is the  purpose of the two-stage
watchdog, if I understand correctly



-- 
Best regards,

Fu Wei
Software Engineer
Red Hat Software (Beijing) Co.,Ltd.Shanghai Branch
Ph: +86 21 61221326(direct)
Ph: +86 186 2020 4684 (mobile)
Room 1512, Regus One Corporate Avenue,Level 15,
One Corporate Avenue,222 Hubin Road,Huangpu District,
Shanghai,China 200021
Fu Wei Fu Feb. 3, 2016, 6:06 p.m. UTC | #2
Hi Timur

On 4 February 2016 at 01:53, Timur Tabi <timur@codeaurora.org> wrote:
> Fu Wei wrote:

>>

>> sorry, are you saying : using pre-timeout instead of this half timeout?

>>

>> But even we have pre-timeout support, pre-timeout  == timeout / 2, it

>> can not be configured without touch timeout.

>>

>> if you want pre-timeout  != timeout / 2, we have to modify WCV in the

>> interrupt routine.

>>   (because of the explicit watchdog refresh  mechanism)

>>

>> Could you let me know why we need pre-timeout  here ??:-)

>

>

> What I meant was that if we had full-blown pre-timeout support in the

> watchdog layer, then you could use that to implement the

> panic-on-half-timeout feature.

>

> When pre-timeout is implemented, will you modify the interrupt handler to

> use it?


Sorry I am little confused.

Actually I am taking your suggestion to avoid touching WCV in
interrupt routine.
So even we have pre-timeout support , it is useless for this
panic-on-half-timeout feature,
because pre-timeout  == timeout / 2 (always).

So maybe I misunderstand your suggestion,
could you let me know : why we want pre-timeout here?


>

>>> >belong upstream.  But like I said, it's just my opinion, and I won't

>>> >complain if I'm outvoted.

>>

>>   I think this debugging feature is the  purpose of the two-stage

>> watchdog, if I understand correctly

>

>

> Hmmm... that make sense.  I think maybe you should drop the Kconfig option,

> and just have "static bool panic_enabled = false;"  Also, then do this:

>

> if (panic_enabled) {

>         ret = devm_request_irq(dev, irq, sbsa_gwdt_interrupt, 0,

>                                pdev->name, gwdt);

>         if (ret) {

>                 dev_err(dev, "unable to request IRQ %d\n", irq);

>                 return ret;

>         }

> }


yes, agree

>

> That way, the interrupt handler is never registered if the command-line

> parameter is not specified.

>




-- 
Best regards,

Fu Wei
Software Engineer
Red Hat Software (Beijing) Co.,Ltd.Shanghai Branch
Ph: +86 21 61221326(direct)
Ph: +86 186 2020 4684 (mobile)
Room 1512, Regus One Corporate Avenue,Level 15,
One Corporate Avenue,222 Hubin Road,Huangpu District,
Shanghai,China 200021
Fu Wei Fu Feb. 5, 2016, 6:21 p.m. UTC | #3
On 5 February 2016 at 22:42, Guenter Roeck <linux@roeck-us.net> wrote:
> On 02/05/2016 01:51 AM, Fu Wei wrote:

>>

>> Hi Guenter,

>>

>> On 4 February 2016 at 13:17, Guenter Roeck <linux@roeck-us.net> wrote:

>>>

>>> On 02/03/2016 03:00 PM, Fu Wei wrote:

>>>>

>>>>

>>>> On 4 February 2016 at 02:45, Timur Tabi <timur@codeaurora.org> wrote:

>>>>>

>>>>>

>>>>> Fu Wei wrote:

>>>>>>

>>>>>>

>>>>>>

>>>>>> As you know I have made the pre-timeout support patch, If people like

>>>>>> it, i am happy to go on upstream it separately.

>>>>>>

>>>>>> If we want to use pre-timeout here, user only can use get_pretimeout

>>>>>> and disable panic by setting pretimeout to 0

>>>>>> but user can not really set pretimeout, because "pre-timeout  ==

>>>>>> timeout / 2 (always)".

>>>>>> if user want to change pretimeout, he/she has to set_time instead.

>>>>>

>>>>>

>>>>>

>>>>>

>>>>> Ok, I think patches 4 and 5 should be combined, and I think the Kconfig

>>>>> entry should be removed and just use panic_enabled.

>>>

>>>

>>>

>>> Agreed.

>>

>>

>> np, will do

>>

>>>>

>>>>

>>>> NP, will update this patchset like that ,  thanks :-)

>>>>

>>>

>>> Also, if panic is enabled, the timeout needs to be adjusted accordingly

>>> (to only panic after the entire timeout period has expired, not after

>>> half of it). We can not panic the system after timeout / 2.

>>

>>

>> OK, my thought is

>>

>> if panic is enabled :

>> |--------WOR-------WS0--------WOR-------WS1

>> |------timeout------(panic)------timeout-----reset

>>

>> if panic is disabled .

>> |--------WOR-------WS0--------WOR-------WS1

>> |---------------------timeout---------------------reset

>>

>>   panic_enabled only can be configured when module is loaded by module

>> parameter

>>

>> But user should know that max_timeout(panic_enable) =

>> max_timeout(panic_disable) / 2

>>

>

> That means you'll have to update max_timeout accordingly.


panic_enabled only can be configured when module is loaded, so we
don't need to update it.

max_timeout will only be set up in the init stage.

Does it make sense ? :-)

>

>>>

>>> I am not too happy with the parameter name (panic_enabled). How about

>>> "action", to match machzwd ?

>>

>>

>> yes, makes sense. Maybe we can do something  like this:

>>

>> /*

>>   * action refers to action taken when watchdog gets WS0

>>   * 0 = SKIP

>>   * 1 = PANIC

>>   * defaults to SKIP (0)

>>   */

>> static int action;

>> module_param(action, int, 0);

>> MODULE_PARM_DESC(action, "after watchdog gets WS0 interrupt, do: "

>> "0 = SKIP(*)  1 = PANIC");

>>

> Yes, though I would suggest to use lower case letters.


yes,  NP, will do , Thanks :-)

>

> Thanks,

> Guenter

>




-- 
Best regards,

Fu Wei
Software Engineer
Red Hat Software (Beijing) Co.,Ltd.Shanghai Branch
Ph: +86 21 61221326(direct)
Ph: +86 186 2020 4684 (mobile)
Room 1512, Regus One Corporate Avenue,Level 15,
One Corporate Avenue,222 Hubin Road,Huangpu District,
Shanghai,China 200021
diff mbox

Patch

diff --git a/Documentation/watchdog/watchdog-parameters.txt b/Documentation/watchdog/watchdog-parameters.txt
index 300eb4d..31641e2 100644
--- a/Documentation/watchdog/watchdog-parameters.txt
+++ b/Documentation/watchdog/watchdog-parameters.txt
@@ -286,6 +286,7 @@  nowayout: Watchdog cannot be stopped once started
 -------------------------------------------------
 sbsa_gwdt:
 timeout: Watchdog timeout in seconds. (default 20s)
+panic_enabled: Enable panic at half timeout. (default=true)
 nowayout: Watchdog cannot be stopped once started
 	(default=kernel config parameter)
 -------------------------------------------------
diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig
index 4ab1b05..42adfdf 100644
--- a/drivers/watchdog/Kconfig
+++ b/drivers/watchdog/Kconfig
@@ -218,6 +218,16 @@  config ARM_SBSA_WATCHDOG
 	  To compile this driver as module, choose M here: The module
 	  will be called sbsa_gwdt.
 
+config ARM_SBSA_WATCHDOG_PANIC
+	bool "ARM SBSA Generic Watchdog triggers panic at the half timeout"
+	depends on ARM_SBSA_WATCHDOG
+	help
+	  ARM SBSA Generic Watchdog will trigger panic in the first signal
+	  (WS0) interrupt routine when the half timeout is reached.
+	  This function can help administrator to backup the system context
+	  info by panic console output or kdump (if supported).
+	  But user can skip panic by setting moduleparam panic_enabled as 0.
+
 config ASM9260_WATCHDOG
 	tristate "Alphascale ASM9260 watchdog"
 	depends on MACH_ASM9260
diff --git a/drivers/watchdog/sbsa_gwdt.c b/drivers/watchdog/sbsa_gwdt.c
index 5a2dba3..d18cf37 100644
--- a/drivers/watchdog/sbsa_gwdt.c
+++ b/drivers/watchdog/sbsa_gwdt.c
@@ -16,18 +16,22 @@ 
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
  *
- * This SBSA Generic watchdog driver is a single stage timeout version.
+ * This SBSA Generic watchdog driver is a two stages version.
  * Since this watchdog timer has two stages, and each stage is determined
  * by WOR. So the timeout is (WOR * 2).
- * When first timeout is reached, WS0 is triggered, the interrupt
- * triggered by WS0 will be ignored, then the second watch period starts;
- * when second timeout is reached, then WS1 is triggered, system reset.
+ * When the first stage(the half timeout) is reached, WS0 interrupt is
+ * triggered, at this moment the second watch period starts;
+ * In the WS0 interrupt routine, panic will be triggered for saving the
+ * system context.
+ * If the system is getting into trouble and cannot be reset by panic or
+ * restart properly by the kdump kernel(if supported), then the second
+ * stage (the timeout) will be reached, system will be reset by WS1.
  *
  * More details about the hardware specification of this device:
  * ARM DEN0029B - Server Base System Architecture (SBSA)
  *
  * SBSA GWDT: |--------WOR-------WS0--------WOR-------WS1
- *            |----------------timeout----------------reset
+ *            |--half_timeout--(panic)--half_timeout--reset
  *
  */
 
@@ -84,6 +88,13 @@  MODULE_PARM_DESC(timeout,
 		 "Watchdog timeout in seconds. (>=0, default="
 		 __MODULE_STRING(DEFAULT_TIMEOUT) ")");
 
+#ifdef CONFIG_ARM_SBSA_WATCHDOG_PANIC
+static bool panic_enabled = true;
+module_param(panic_enabled, bool, 0);
+MODULE_PARM_DESC(panic_enabled,
+		 "enable panic at half timeout. (default=true)");
+#endif
+
 static bool nowayout = WATCHDOG_NOWAYOUT;
 module_param(nowayout, bool, S_IRUGO);
 MODULE_PARM_DESC(nowayout,
@@ -159,6 +170,16 @@  static int sbsa_gwdt_stop(struct watchdog_device *wdd)
 	return 0;
 }
 
+#ifdef CONFIG_ARM_SBSA_WATCHDOG_PANIC
+static irqreturn_t sbsa_gwdt_interrupt(int irq, void *dev_id)
+{
+	if (panic_enabled)
+		panic("SBSA Watchdog half timeout");
+
+	return IRQ_HANDLED;
+}
+#endif
+
 static struct watchdog_info sbsa_gwdt_info = {
 	.identity	= "SBSA Generic Watchdog",
 	.options	= WDIOF_SETTIMEOUT |
@@ -186,6 +207,9 @@  static int sbsa_gwdt_probe(struct platform_device *pdev)
 	struct resource *res;
 	u32 status;
 	int ret;
+#ifdef CONFIG_ARM_SBSA_WATCHDOG_PANIC
+	int irq;
+#endif
 
 	gwdt = devm_kzalloc(dev, sizeof(*gwdt), GFP_KERNEL);
 	if (!gwdt)
@@ -202,6 +226,14 @@  static int sbsa_gwdt_probe(struct platform_device *pdev)
 	if (IS_ERR(rf_base))
 		return PTR_ERR(rf_base);
 
+#ifdef CONFIG_ARM_SBSA_WATCHDOG_PANIC
+	irq = platform_get_irq(pdev, 0);
+	if (irq < 0) {
+		dev_err(dev, "unable to get ws0 interrupt.\n");
+		return irq;
+	}
+#endif
+
 	/*
 	 * Get the frequency of system counter from the cp15 interface of ARM
 	 * Generic timer. We don't need to check it, because if it returns "0",
@@ -228,6 +260,14 @@  static int sbsa_gwdt_probe(struct platform_device *pdev)
 		dev_warn(dev, "System reset by WDT.\n");
 		wdd->bootstatus |= WDIOF_CARDRESET;
 	}
+#ifdef CONFIG_ARM_SBSA_WATCHDOG_PANIC
+	ret = devm_request_irq(dev, irq, sbsa_gwdt_interrupt, 0,
+			       pdev->name, gwdt);
+	if (ret) {
+		dev_err(dev, "unable to request IRQ %d\n", irq);
+		return ret;
+	}
+#endif
 
 	ret = watchdog_register_device(wdd);
 	if (ret)
@@ -242,6 +282,10 @@  static int sbsa_gwdt_probe(struct platform_device *pdev)
 
 	dev_info(dev, "Initialized with %ds timeout @ %u Hz%s\n", wdd->timeout,
 		 gwdt->clk, status & SBSA_GWDT_WCS_EN ? " [enabled]" : "");
+#ifdef CONFIG_ARM_SBSA_WATCHDOG_PANIC
+	dev_info(dev, "Half timeout panic %s.\n",
+		 panic_enabled ? "enabled" : "disabled");
+#endif
 
 	return 0;
 }