[v9,02/10] reboot: Add hardware protection power-off

Message ID 97260f8e150abb898a262fade25860609b460912.1620645507.git.matti.vaittinen@fi.rohmeurope.com
State Accepted
Commit dfa19b11385d4cf8f0242fd93e2073e25183c331
Headers show
Series
  • Extend regulator notification support
Related show

Commit Message

Vaittinen, Matti May 10, 2021, 11:28 a.m.
There can be few cases when we need to shut-down the system in order to
protect the hardware. Currently this is done at east by the thermal core
when temperature raises over certain limit.

Some PMICs can also generate interrupts for example for over-current or
over-voltage, voltage drops, short-circuit, ... etc. On some systems
these are a sign of hardware failure and only thing to do is try to
protect the rest of the hardware by shutting down the system.

Add shut-down logic which can be used by all subsystems instead of
implementing the shutdown in each subsystem. The logic is stolen from
thermal_core with difference of using atomic_t instead of a mutex in
order to allow calls directly from IRQ context.

Signed-off-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com>

---

Changelog:
v8: (changes suggested by Daniel Lezcano)
 - replace a protection implemented by a flag + spin_lock_irqsave() with
   simple atomic_dec_and_test().
 - Split thermal-core changes and adding the new API to separate patches
v7:
 - New patch
---
 include/linux/reboot.h |  1 +
 kernel/reboot.c        | 80 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)

Comments

Petr Mladek May 12, 2021, 8:20 a.m. | #1
On Mon 2021-05-10 14:28:30, Matti Vaittinen wrote:
> There can be few cases when we need to shut-down the system in order to
> protect the hardware. Currently this is done at east by the thermal core
> when temperature raises over certain limit.
> 
> Some PMICs can also generate interrupts for example for over-current or
> over-voltage, voltage drops, short-circuit, ... etc. On some systems
> these are a sign of hardware failure and only thing to do is try to
> protect the rest of the hardware by shutting down the system.
> 
> Add shut-down logic which can be used by all subsystems instead of
> implementing the shutdown in each subsystem. The logic is stolen from
> thermal_core with difference of using atomic_t instead of a mutex in
> order to allow calls directly from IRQ context.
> 
> Signed-off-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com>
> 
> diff --git a/kernel/reboot.c b/kernel/reboot.c
> index a6ad5eb2fa73..5da8c80a2647 100644
> --- a/kernel/reboot.c
> +++ b/kernel/reboot.c
> @@ -518,6 +519,85 @@ void orderly_reboot(void)
>  }
>  EXPORT_SYMBOL_GPL(orderly_reboot);
>  
> +/**
> + * hw_failure_emergency_poweroff_func - emergency poweroff work after a known delay
> + * @work: work_struct associated with the emergency poweroff function
> + *
> + * This function is called in very critical situations to force
> + * a kernel poweroff after a configurable timeout value.
> + */
> +static void hw_failure_emergency_poweroff_func(struct work_struct *work)
> +{
> +	/*
> +	 * We have reached here after the emergency shutdown waiting period has
> +	 * expired. This means orderly_poweroff has not been able to shut off
> +	 * the system for some reason.
> +	 *
> +	 * Try to shut down the system immediately using kernel_power_off
> +	 * if populated
> +	 */
> +	WARN(1, "Hardware protection timed-out. Trying forced poweroff\n");
> +	kernel_power_off();

WARN() look like an overkill here. It prints many lines that are not
much useful in this case. The function is called from well-known
context (workqueue worker).

Also be aware that "panic_on_warn" commandline option will trigger
panic() here.


> +	/*
> +	 * Worst of the worst case trigger emergency restart
> +	 */
> +	WARN(1,
> +	     "Hardware protection shutdown failed. Trying emergency restart\n");
> +	emergency_restart();

Two consecutive WARN() calls are even less useful. They are eye
catching but it is hard to find the only useful line with
the custom message.

Best Regards,
Petr
Vaittinen, Matti May 12, 2021, noon | #2
Hi Petr,

Thanks for the review!

On Wed, 2021-05-12 at 10:20 +0200, Petr Mladek wrote:
> On Mon 2021-05-10 14:28:30, Matti Vaittinen wrote:

> > There can be few cases when we need to shut-down the system in

> > order to

> > protect the hardware. Currently this is done at east by the thermal

> > core

> > when temperature raises over certain limit.

> > 

> > Some PMICs can also generate interrupts for example for over-

> > current or

> > over-voltage, voltage drops, short-circuit, ... etc. On some

> > systems

> > these are a sign of hardware failure and only thing to do is try to

> > protect the rest of the hardware by shutting down the system.

> > 

> > Add shut-down logic which can be used by all subsystems instead of

> > implementing the shutdown in each subsystem. The logic is stolen

> > from

> > thermal_core with difference of using atomic_t instead of a mutex

> > in

> > order to allow calls directly from IRQ context.

> > 

> > Signed-off-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com>

> > 

> > diff --git a/kernel/reboot.c b/kernel/reboot.c

> > index a6ad5eb2fa73..5da8c80a2647 100644

> > --- a/kernel/reboot.c

> > +++ b/kernel/reboot.c

> > @@ -518,6 +519,85 @@ void orderly_reboot(void)

> >  }

> >  EXPORT_SYMBOL_GPL(orderly_reboot);

> >  

> > +/**

> > + * hw_failure_emergency_poweroff_func - emergency poweroff work

> > after a known delay

> > + * @work: work_struct associated with the emergency poweroff

> > function

> > + *

> > + * This function is called in very critical situations to force

> > + * a kernel poweroff after a configurable timeout value.

> > + */

> > +static void hw_failure_emergency_poweroff_func(struct work_struct

> > *work)

> > +{

> > +	/*

> > +	 * We have reached here after the emergency shutdown waiting

> > period has

> > +	 * expired. This means orderly_poweroff has not been able to

> > shut off

> > +	 * the system for some reason.

> > +	 *

> > +	 * Try to shut down the system immediately using

> > kernel_power_off

> > +	 * if populated

> > +	 */

> > +	WARN(1, "Hardware protection timed-out. Trying forced

> > poweroff\n");

> > +	kernel_power_off();

> 

> WARN() look like an overkill here. It prints many lines that are not

> much useful in this case. The function is called from well-known

> context (workqueue worker).


This was the existing code which I stole from the thermal_core. I kind
of think that eye-catching WARN is actually a good choice here. Doing
autonomous power-off without a WARNing does not sound good to me :)

> Also be aware that "panic_on_warn" commandline option will trigger

> panic() here.


Hmm.. If panic() hangs the system that might indeed be a problem. Now
we are (again) on a territory which I don't know well. I'd appreciate
any input from thermal folks and Mark. I don't like the idea of making
extreme things like power-off w/o well visible log-trace. Thus I would
like to have WARN()-like eye-catcher, even if the call-trace was not
too varying. It will at least point to this worker. Any better
suggestions than WARN()?

> 

> > +	/*

> > +	 * Worst of the worst case trigger emergency restart

> > +	 */

> > +	WARN(1,

> > +	     "Hardware protection shutdown failed. Trying emergency

> > restart\n");

> > +	emergency_restart();

> 

> Two consecutive WARN() calls are even less useful. They are eye

> catching but it is hard to find the only useful line with

> the custom message.


I think you are right. One WARN should be enough to point here. This
last one could be just an additional print.

Best Regards
	--Matti Vaittinen
Petr Mladek May 13, 2021, 8:34 a.m. | #3
On Wed 2021-05-12 12:00:46, Vaittinen, Matti wrote:
> On Wed, 2021-05-12 at 10:20 +0200, Petr Mladek wrote:
> > On Mon 2021-05-10 14:28:30, Matti Vaittinen wrote:
> > > There can be few cases when we need to shut-down the system in
> > > order to
> > > protect the hardware. Currently this is done at east by the thermal
> > > core
> > > when temperature raises over certain limit.
> > > 
> > > Some PMICs can also generate interrupts for example for over-
> > > current or
> > > over-voltage, voltage drops, short-circuit, ... etc. On some
> > > systems
> > > these are a sign of hardware failure and only thing to do is try to
> > > protect the rest of the hardware by shutting down the system.
> > > 
> > > Add shut-down logic which can be used by all subsystems instead of
> > > implementing the shutdown in each subsystem. The logic is stolen
> > > from
> > > thermal_core with difference of using atomic_t instead of a mutex
> > > in
> > > order to allow calls directly from IRQ context.
> > > 
> > > Signed-off-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com>
> > > 
> > > diff --git a/kernel/reboot.c b/kernel/reboot.c
> > > index a6ad5eb2fa73..5da8c80a2647 100644
> > > --- a/kernel/reboot.c
> > > +++ b/kernel/reboot.c
> > > @@ -518,6 +519,85 @@ void orderly_reboot(void)
> > >  }
> > >  EXPORT_SYMBOL_GPL(orderly_reboot);
> > >  
> > > +/**
> > > + * hw_failure_emergency_poweroff_func - emergency poweroff work
> > > after a known delay
> > > + * @work: work_struct associated with the emergency poweroff
> > > function
> > > + *
> > > + * This function is called in very critical situations to force
> > > + * a kernel poweroff after a configurable timeout value.
> > > + */
> > > +static void hw_failure_emergency_poweroff_func(struct work_struct
> > > *work)
> > > +{
> > > +	/*
> > > +	 * We have reached here after the emergency shutdown waiting
> > > period has
> > > +	 * expired. This means orderly_poweroff has not been able to
> > > shut off
> > > +	 * the system for some reason.
> > > +	 *
> > > +	 * Try to shut down the system immediately using
> > > kernel_power_off
> > > +	 * if populated
> > > +	 */
> > > +	WARN(1, "Hardware protection timed-out. Trying forced
> > > poweroff\n");
> > > +	kernel_power_off();
> > 
> > WARN() look like an overkill here. It prints many lines that are not
> > much useful in this case. The function is called from well-known
> > context (workqueue worker).
> 
> This was the existing code which I stole from the thermal_core. I kind
> of think that eye-catching WARN is actually a good choice here. Doing
> autonomous power-off without a WARNing does not sound good to me :)
> 
> > Also be aware that "panic_on_warn" commandline option will trigger
> > panic() here.
> 
> Hmm.. If panic() hangs the system that might indeed be a problem. Now
> we are (again) on a territory which I don't know well. I'd appreciate
> any input from thermal folks and Mark. I don't like the idea of making
> extreme things like power-off w/o well visible log-trace. Thus I would
> like to have WARN()-like eye-catcher, even if the call-trace was not
> too varying. It will at least point to this worker. Any better
> suggestions than WARN()?

Heh, it might make sense to create a system wide API for these. I am
sure that WARN() is mis-used this way on many other locations.

There already are two locations that use another eye-catching text.
A common API might help to avoid duplication of the common parts,
see
https://lore.kernel.org/lkml/20210305194206.3165917-2-elver@google.com/

Well, it might be out of scope for this patchset.

Best Regards,
Petr
Vaittinen, Matti May 17, 2021, 4:57 a.m. | #4
On Thu, 2021-05-13 at 10:34 +0200, Petr Mladek wrote:
> On Wed 2021-05-12 12:00:46, Vaittinen, Matti wrote:

> > On Wed, 2021-05-12 at 10:20 +0200, Petr Mladek wrote:

> > > On Mon 2021-05-10 14:28:30, Matti Vaittinen wrote:

> > > > There can be few cases when we need to shut-down the system in

> > > > order to

> > > > protect the hardware. Currently this is done at east by the

> > > > thermal

> > > > core

> > > > when temperature raises over certain limit.

> > > > 

> > > > Some PMICs can also generate interrupts for example for over-

> > > > current or

> > > > over-voltage, voltage drops, short-circuit, ... etc. On some

> > > > systems

> > > > these are a sign of hardware failure and only thing to do is

> > > > try to

> > > > protect the rest of the hardware by shutting down the system.

> > > > 

> > > > Add shut-down logic which can be used by all subsystems instead

> > > > of

> > > > implementing the shutdown in each subsystem. The logic is

> > > > stolen

> > > > from

> > > > thermal_core with difference of using atomic_t instead of a

> > > > mutex

> > > > in

> > > > order to allow calls directly from IRQ context.

> > > > 

> > > > Signed-off-by: Matti Vaittinen <

> > > > matti.vaittinen@fi.rohmeurope.com>

> > > > 

> > > > diff --git a/kernel/reboot.c b/kernel/reboot.c

> > > > index a6ad5eb2fa73..5da8c80a2647 100644

> > > > --- a/kernel/reboot.c

> > > > +++ b/kernel/reboot.c

> > > > @@ -518,6 +519,85 @@ void orderly_reboot(void)

> > > >  }

> > > >  EXPORT_SYMBOL_GPL(orderly_reboot);

> > > >  

> > > > +/**

> > > > + * hw_failure_emergency_poweroff_func - emergency poweroff

> > > > work

> > > > after a known delay

> > > > + * @work: work_struct associated with the emergency poweroff

> > > > function

> > > > + *

> > > > + * This function is called in very critical situations to

> > > > force

> > > > + * a kernel poweroff after a configurable timeout value.

> > > > + */

> > > > +static void hw_failure_emergency_poweroff_func(struct

> > > > work_struct

> > > > *work)

> > > > +{

> > > > +	/*

> > > > +	 * We have reached here after the emergency shutdown

> > > > waiting

> > > > period has

> > > > +	 * expired. This means orderly_poweroff has not been

> > > > able to

> > > > shut off

> > > > +	 * the system for some reason.

> > > > +	 *

> > > > +	 * Try to shut down the system immediately using

> > > > kernel_power_off

> > > > +	 * if populated

> > > > +	 */

> > > > +	WARN(1, "Hardware protection timed-out. Trying forced

> > > > poweroff\n");

> > > > +	kernel_power_off();

> > > 

> > > WARN() look like an overkill here. It prints many lines that are

> > > not

> > > much useful in this case. The function is called from well-known

> > > context (workqueue worker).

> > 

> > This was the existing code which I stole from the thermal_core. I

> > kind

> > of think that eye-catching WARN is actually a good choice here.

> > Doing

> > autonomous power-off without a WARNing does not sound good to me :)

> > 

> > > Also be aware that "panic_on_warn" commandline option will

> > > trigger

> > > panic() here.

> > 

> > Hmm.. If panic() hangs the system that might indeed be a problem.

> > Now

> > we are (again) on a territory which I don't know well. I'd

> > appreciate

> > any input from thermal folks and Mark. I don't like the idea of

> > making

> > extreme things like power-off w/o well visible log-trace. Thus I

> > would

> > like to have WARN()-like eye-catcher, even if the call-trace was

> > not

> > too varying. It will at least point to this worker. Any better

> > suggestions than WARN()?

> 

> Heh, it might make sense to create a system wide API for these. I am

> sure that WARN() is mis-used this way on many other locations.

> 

> There already are two locations that use another eye-catching text.

> A common API might help to avoid duplication of the common parts,

> see

> https://lore.kernel.org/lkml/20210305194206.3165917-2-elver@google.com/

> 

> Well, it might be out of scope for this patchset.


I just had a very brief "chat" with Geert (3 IRC messages, posted
during 4 or 5 days :]) - and Geert pointed me to this:

https://lore.kernel.org/linux-iommu/20210331093104.383705-4-geert+renesas@glider.be/

So, maybe I'll just go with simple pr_emerg() and trust that the
emerg() print should catch attention as such level print probably
should. I'll respin the patch series (probably tomorrow) - let's see
what thermal and regulator folks say :)

Thanks for all the help this far!

Best Regards
	Matti Vaittinen

Patch

diff --git a/include/linux/reboot.h b/include/linux/reboot.h
index 3734cd8f38a8..af907a3d68d1 100644
--- a/include/linux/reboot.h
+++ b/include/linux/reboot.h
@@ -79,6 +79,7 @@  extern char poweroff_cmd[POWEROFF_CMD_PATH_LEN];
 
 extern void orderly_poweroff(bool force);
 extern void orderly_reboot(void);
+void hw_protection_shutdown(const char *reason, int ms_until_forced);
 
 /*
  * Emergency restart, callable from an interrupt handler.
diff --git a/kernel/reboot.c b/kernel/reboot.c
index a6ad5eb2fa73..5da8c80a2647 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -7,6 +7,7 @@ 
 
 #define pr_fmt(fmt)	"reboot: " fmt
 
+#include <linux/atomic.h>
 #include <linux/ctype.h>
 #include <linux/export.h>
 #include <linux/kexec.h>
@@ -518,6 +519,85 @@  void orderly_reboot(void)
 }
 EXPORT_SYMBOL_GPL(orderly_reboot);
 
+/**
+ * hw_failure_emergency_poweroff_func - emergency poweroff work after a known delay
+ * @work: work_struct associated with the emergency poweroff function
+ *
+ * This function is called in very critical situations to force
+ * a kernel poweroff after a configurable timeout value.
+ */
+static void hw_failure_emergency_poweroff_func(struct work_struct *work)
+{
+	/*
+	 * We have reached here after the emergency shutdown waiting period has
+	 * expired. This means orderly_poweroff has not been able to shut off
+	 * the system for some reason.
+	 *
+	 * Try to shut down the system immediately using kernel_power_off
+	 * if populated
+	 */
+	WARN(1, "Hardware protection timed-out. Trying forced poweroff\n");
+	kernel_power_off();
+
+	/*
+	 * Worst of the worst case trigger emergency restart
+	 */
+	WARN(1,
+	     "Hardware protection shutdown failed. Trying emergency restart\n");
+	emergency_restart();
+}
+
+static DECLARE_DELAYED_WORK(hw_failure_emergency_poweroff_work,
+			    hw_failure_emergency_poweroff_func);
+
+/**
+ * hw_failure_emergency_poweroff - Trigger an emergency system poweroff
+ *
+ * This may be called from any critical situation to trigger a system shutdown
+ * after a given period of time. If time is negative this is not scheduled.
+ */
+static void hw_failure_emergency_poweroff(int poweroff_delay_ms)
+{
+	if (poweroff_delay_ms <= 0)
+		return;
+	schedule_delayed_work(&hw_failure_emergency_poweroff_work,
+			      msecs_to_jiffies(poweroff_delay_ms));
+}
+
+/**
+ * hw_protection_shutdown - Trigger an emergency system poweroff
+ *
+ * @reason:		Reason of emergency shutdown to be printed.
+ * @ms_until_forced:	Time to wait for orderly shutdown before tiggering a
+ *			forced shudown. Negative value disables the forced
+ *			shutdown.
+ *
+ * Initiate an emergency system shutdown in order to protect hardware from
+ * further damage. Usage examples include a thermal protection or a voltage or
+ * current regulator failures.
+ * NOTE: The request is ignored if protection shutdown is already pending even
+ * if the previous request has given a large timeout for forced shutdown.
+ * Can be called from any context.
+ */
+void hw_protection_shutdown(const char *reason, int ms_until_forced)
+{
+	static atomic_t allow_proceed = ATOMIC_INIT(1);
+
+	pr_emerg("HARDWARE PROTECTION shutdown (%s)\n", reason);
+
+	/* Shutdown should be initiated only once. */
+	if (!atomic_dec_and_test(&allow_proceed))
+		return;
+
+	/*
+	 * Queue a backup emergency shutdown in the event of
+	 * orderly_poweroff failure
+	 */
+	hw_failure_emergency_poweroff(ms_until_forced);
+	orderly_poweroff(true);
+}
+EXPORT_SYMBOL_GPL(hw_protection_shutdown);
+
 static int __init reboot_setup(char *str)
 {
 	for (;;) {