diff mbox series

acpi/ghes: Make ghes_panic_timeout adjustable as a parameter

Message ID 20241227095422.44147-1-feng.tang@linux.alibaba.com
State New
Headers show
Series acpi/ghes: Make ghes_panic_timeout adjustable as a parameter | expand

Commit Message

Feng Tang Dec. 27, 2024, 9:54 a.m. UTC
There is a problem report that when debugging a hard-to-reproduce panic
issue, user wanted the kernel to not reboot by adding "panic=0" in
kernel cmdline, so that the panic context could be kept, say the panic
was caught randomly in the mid-night, and user hoped to check it in
the morning. GHES panic handler may overwrite that user setting and
force to reboot after 'ghes_panic_timeout'(30) seconds.

Make 'ghes_panic_timeout' a parameter can provide user some flexibility
to change the timeout on demand, without changing current behavior.

Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
---
 drivers/acpi/apei/ghes.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Borislav Petkov Dec. 27, 2024, 10:09 a.m. UTC | #1
On December 27, 2024 10:54:22 AM GMT+01:00, Feng Tang <feng.tang@linux.alibaba.com> wrote:
>There is a problem report that when debugging a hard-to-reproduce panic
>issue, user wanted the kernel to not reboot by adding "panic=0" in
>kernel cmdline, so that the panic context could be kept, say the panic
>was caught randomly in the mid-night, and user hoped to check it in
>the morning. GHES panic handler may overwrite that user setting and
>force to reboot after 'ghes_panic_timeout'(30) seconds.

Why doesn't the ghes panic handler honor a panic=0 setting?
Huang, Ying Dec. 30, 2024, 5:54 a.m. UTC | #2
Hi, Boris,

Borislav Petkov <bp@alien8.de> writes:

> On December 27, 2024 10:54:22 AM GMT+01:00, Feng Tang <feng.tang@linux.alibaba.com> wrote:
>>There is a problem report that when debugging a hard-to-reproduce panic
>>issue, user wanted the kernel to not reboot by adding "panic=0" in
>>kernel cmdline, so that the panic context could be kept, say the panic
>>was caught randomly in the mid-night, and user hoped to check it in
>>the morning. GHES panic handler may overwrite that user setting and
>>force to reboot after 'ghes_panic_timeout'(30) seconds.
>
> Why doesn't the ghes panic handler honor a panic=0 setting?

It appears that I introduced the ghes_panic_timeout originally.

panic() is used for software errors, while ghes is used for hardware
errors.  They may have different requirements.  For example, it may be
OK to wait forever for a software error, but it may be better to reboot
the system to contain the influence of the hardware error for some
hardware errors.  So, we introduced another knob for that.

---
Best Regards,
Huang, Ying
Huang, Ying Dec. 30, 2024, 11:04 a.m. UTC | #3
Borislav Petkov <bp@alien8.de> writes:

> On Mon, Dec 30, 2024 at 01:54:36PM +0800, Huang, Ying wrote:
>> For example, it may be OK to wait forever for a software error, but it may
>> be better to reboot the system to contain the influence of the hardware
>> error for some hardware errors.
>
> A default panic timeout of 30 seconds for hw errors?! You do realize that 30
> seconds for a machine is an eternity and by that time your hardware error has
> long propagated and corrupted results, right?
>
> So your timeout is not even trying to do what you want.
>
> So unless I'm missing something, this ghes timeout needs to go - if you want
> to "contain the influence" you need to panic *immediately*! And not even that
> would help in some cases - hw has its own protections there so the OS
> panicking is meh. At least on x86, that is.

OK.  30 seconds isn't good enough for hw errors.

Another possible benefit of ghes_panic_timeout is,

rebooting instead of waiting forever can help us to log/report the
hardware errors earlier.

For example, the hardware errors could be logged in some simple
non-volatile storage (such as EFI variables) during panic or kdump, etc.
Then, after reboot, the new kernel could report the hardware errors in
some way.

>> So, we introduced another knob for that.
>
> No, that another knob is piling more of the silly ontop.

---
Best Regards,
Huang, Ying
Feng Tang Dec. 30, 2024, 11:40 a.m. UTC | #4
On Mon, Dec 30, 2024 at 12:26:08PM +0100, Borislav Petkov wrote:
> On Mon, Dec 30, 2024 at 07:04:59PM +0800, Huang, Ying wrote:
> > Another possible benefit of ghes_panic_timeout is,
> > 
> > rebooting instead of waiting forever can help us to log/report the
> > hardware errors earlier.
> 
> So you rip out ghes_panic_timeout and set the panic_timeout in the ghes code,
> when you get a hw error which requires reboot.

Still we may need to honor the user setting, say if user specifically set
"panic=XXX" in the cmdline, we should detect that case and skip overwritting
the panic_timeout? 

Thanks,
Feng

> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Feng Tang Dec. 30, 2024, 1:04 p.m. UTC | #5
On Mon, Dec 30, 2024 at 01:10:09PM +0100, Borislav Petkov wrote:
> On Mon, Dec 30, 2024 at 07:40:18PM +0800, Feng Tang wrote:
> > Still we may need to honor the user setting, say if user specifically set
> > "panic=XXX" in the cmdline, we should detect that case and skip overwritting
> > the panic_timeout? 
> 
> So the user has set panic timeout to something, machine encounters a hw error
> which you want to log/report earlier but user's panic setting prevents you
> from doing that.
> 
> So what do you do?
> 
> What has higher prio?
> 
> I guess we're something like this:
> 
> https://youtu.be/gLFQystE8vU?t=71

Hah!

As per kernel config, most ARCH has 'panic_timeout' as 0 by default, so
need to set the kcmdline. For the case in my commit log, where user had
clear requirement for not-reboot and wait, the manually set 'panic=0'
should take priority here?  

Thanks,
Feng

> 
> :-P
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Feng Tang Dec. 31, 2024, 6:44 a.m. UTC | #6
On Mon, Dec 30, 2024 at 02:24:03PM +0100, Borislav Petkov wrote:
> On Mon, Dec 30, 2024 at 09:04:11PM +0800, Feng Tang wrote:
> > As per kernel config, most ARCH has 'panic_timeout' as 0 by default, so
> > need to set the kcmdline. For the case in my commit log, where user had
> > clear requirement for not-reboot and wait, the manually set 'panic=0'
> > should take priority here?
> 
> I think so.
> 
> I'm not convinced that lets-log-the-hw-error-faster should override the panic
> timeout setting but I'm open to real-life example scenarios...
> 

I see. Upon the discussion so far, how about the change below?

---
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 07789f0b59bc..113471b76d8d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool;
 static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
 
-static int ghes_panic_timeout __read_mostly = 30;
-
 static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
 {
 	phys_addr_t paddr;
@@ -987,9 +985,10 @@ static void __ghes_panic(struct ghes *ghes,
 
 	ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx);
 
-	/* reboot to log the error! */
-	if (!panic_timeout)
-		panic_timeout = ghes_panic_timeout;
+	/* If user hasn't specifically set panic timeout, reboot to log the error! */
+	if (!panic_timeout && !strstr(saved_command_line, "panic="))
+		panic_timeout = 30;
+
 	panic("Fatal hardware error!");
 }

Or we want to stick the orignal patch, which doesn't change the
original flow?

Thanks,
Feng
Feng Tang Dec. 31, 2024, 10:15 a.m. UTC | #7
On Tue, Dec 31, 2024 at 10:23:58AM +0100, Borislav Petkov wrote:
> On Tue, Dec 31, 2024 at 02:44:48PM +0800, Feng Tang wrote:
> > +	/* If user hasn't specifically set panic timeout, reboot to log the error! */
> > +	if (!panic_timeout && !strstr(saved_command_line, "panic="))
> 
> And you want to scan saved_command_line because?
> 
> Hint: look at how other code checks panic_timeout.
 
Thanks for the hint! IIUC, you are mentioning the set_arch_panic_timeout().
One thing is, most ARCHs' default timeout is 0, while in our case, the user
will also set 'panic=0' :), so we can't easily detect if the 0 is the user-set
value or the OS default one. Originally I even thought about adding a flag
of 'timeout_user_changed'.  Any suggestion?

> > Or we want to stick the orignal patch, which doesn't change the
> > original flow?
> 
> And pile more broken stuff ontop?

OK, will skip this.

Thanks,
Feng
Feng Tang Dec. 31, 2024, 12:02 p.m. UTC | #8
On Tue, Dec 31, 2024 at 12:13:14PM +0100, Borislav Petkov wrote:
> On Tue, Dec 31, 2024 at 06:15:59PM +0800, Feng Tang wrote:
> > Thanks for the hint! IIUC, you are mentioning the set_arch_panic_timeout().
> > One thing is, most ARCHs' default timeout is 0, while in our case, the user
> > will also set 'panic=0' :), so we can't easily detect if the 0 is the user-set
> > value or the OS default one. Originally I even thought about adding a flag
> > of 'timeout_user_changed'.  Any suggestion?
> 
> Ok, enough talking. Let's get concrete:
> 
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Tue, 31 Dec 2024 12:03:55 +0100
> Subject: [PATCH] APEI: GHES: Have GHES honor the panic= setting
> 
> The GHES driver overrides the panic= setting by rebooting the system
> after a fatal hw error has been reported. The intent being that such an
> error would be hopefully written out faster on non-volatile storage for
> later inspection.
> 
> However, this is not optimal when a hard-to-debug issue requires long
> time to reproduce and when that happens, the box will get rebooted after
> 30 seconds and thus destroy the whole hw context of when the error
> happened.
> 
> So rip out the default GHES panic timeout and honor the global one.
> 
> In the panic disabled (panic=0) case, the error will still be logged to
> dmesg for later inspection and if panic after a hw error is really
> required, then that can be controlled the usual way - use panic= on the
> cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT.
> 
> Reported-by: Feng Tang <feng.tang@linux.alibaba.com>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>


Looks good to me, thanks!

Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>

> ---
>  drivers/acpi/apei/ghes.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 07789f0b59bc..b72772494655 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool;
>  static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
>  static atomic_t ghes_estatus_cache_alloced;
>  
> -static int ghes_panic_timeout __read_mostly = 30;
> -
>  static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
>  {
>  	phys_addr_t paddr;
> @@ -983,14 +981,16 @@ static void __ghes_panic(struct ghes *ghes,
>  			 struct acpi_hest_generic_status *estatus,
>  			 u64 buf_paddr, enum fixed_addresses fixmap_idx)
>  {
> +	const char *msg = GHES_PFX "Fatal hardware error";
> +
>  	__ghes_print_estatus(KERN_EMERG, ghes->generic, estatus);
>  
>  	ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx);
>  
> -	/* reboot to log the error! */
>  	if (!panic_timeout)
> -		panic_timeout = ghes_panic_timeout;
> -	panic("Fatal hardware error!");
> +		pr_emerg("%s but panic disabled\n", msg);
> +
> +	panic(msg);
>  }
>  
>  static int ghes_proc(struct ghes *ghes)
> -- 
> 2.43.0
> 
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Huang, Ying Jan. 2, 2025, 2:46 a.m. UTC | #9
Borislav Petkov <bp@alien8.de> writes:

> On Tue, Dec 31, 2024 at 06:15:59PM +0800, Feng Tang wrote:
>> Thanks for the hint! IIUC, you are mentioning the set_arch_panic_timeout().
>> One thing is, most ARCHs' default timeout is 0, while in our case, the user
>> will also set 'panic=0' :), so we can't easily detect if the 0 is the user-set
>> value or the OS default one. Originally I even thought about adding a flag
>> of 'timeout_user_changed'.  Any suggestion?
>
> Ok, enough talking. Let's get concrete:
>
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Tue, 31 Dec 2024 12:03:55 +0100
> Subject: [PATCH] APEI: GHES: Have GHES honor the panic= setting
>
> The GHES driver overrides the panic= setting by rebooting the system
> after a fatal hw error has been reported. The intent being that such an
> error would be hopefully written out faster on non-volatile storage for
> later inspection.

IIUC, the hardware error will be written out on non-volatile storage at
the same time with or without ghes_panic_timeout overriding.  The
difference is that after rebooting, the error information in
non-volatile storage can be extracted and reported via UI, SNMP, or
MCTP earlier.

> However, this is not optimal when a hard-to-debug issue requires long
> time to reproduce and when that happens, the box will get rebooted after
> 30 seconds and thus destroy the whole hw context of when the error
> happened.
>
> So rip out the default GHES panic timeout and honor the global one.
>
> In the panic disabled (panic=0) case, the error will still be logged to
> dmesg for later inspection and if panic after a hw error is really
> required, then that can be controlled the usual way - use panic= on the
> cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT.
>
> Reported-by: Feng Tang <feng.tang@linux.alibaba.com>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> ---
>  drivers/acpi/apei/ghes.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 07789f0b59bc..b72772494655 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool;
>  static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
>  static atomic_t ghes_estatus_cache_alloced;
>  
> -static int ghes_panic_timeout __read_mostly = 30;
> -
>  static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
>  {
>  	phys_addr_t paddr;
> @@ -983,14 +981,16 @@ static void __ghes_panic(struct ghes *ghes,
>  			 struct acpi_hest_generic_status *estatus,
>  			 u64 buf_paddr, enum fixed_addresses fixmap_idx)
>  {
> +	const char *msg = GHES_PFX "Fatal hardware error";
> +
>  	__ghes_print_estatus(KERN_EMERG, ghes->generic, estatus);
>  
>  	ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx);
>  
> -	/* reboot to log the error! */
>  	if (!panic_timeout)
> -		panic_timeout = ghes_panic_timeout;
> -	panic("Fatal hardware error!");
> +		pr_emerg("%s but panic disabled\n", msg);
> +
> +	panic(msg);
>  }
>  
>  static int ghes_proc(struct ghes *ghes)

---
Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 07789f0b59bc..a8a6310e476a 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -174,6 +174,7 @@  static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_
 static atomic_t ghes_estatus_cache_alloced;
 
 static int ghes_panic_timeout __read_mostly = 30;
+module_param(ghes_panic_timeout, int, 0644);
 
 static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
 {