Message ID | 20241227095422.44147-1-feng.tang@linux.alibaba.com |
---|---|
State | New |
Headers | show |
Series | acpi/ghes: Make ghes_panic_timeout adjustable as a parameter | expand |
On December 27, 2024 10:54:22 AM GMT+01:00, Feng Tang <feng.tang@linux.alibaba.com> wrote: >There is a problem report that when debugging a hard-to-reproduce panic >issue, user wanted the kernel to not reboot by adding "panic=0" in >kernel cmdline, so that the panic context could be kept, say the panic >was caught randomly in the mid-night, and user hoped to check it in >the morning. GHES panic handler may overwrite that user setting and >force to reboot after 'ghes_panic_timeout'(30) seconds. Why doesn't the ghes panic handler honor a panic=0 setting?
Hi, Boris, Borislav Petkov <bp@alien8.de> writes: > On December 27, 2024 10:54:22 AM GMT+01:00, Feng Tang <feng.tang@linux.alibaba.com> wrote: >>There is a problem report that when debugging a hard-to-reproduce panic >>issue, user wanted the kernel to not reboot by adding "panic=0" in >>kernel cmdline, so that the panic context could be kept, say the panic >>was caught randomly in the mid-night, and user hoped to check it in >>the morning. GHES panic handler may overwrite that user setting and >>force to reboot after 'ghes_panic_timeout'(30) seconds. > > Why doesn't the ghes panic handler honor a panic=0 setting? It appears that I introduced the ghes_panic_timeout originally. panic() is used for software errors, while ghes is used for hardware errors. They may have different requirements. For example, it may be OK to wait forever for a software error, but it may be better to reboot the system to contain the influence of the hardware error for some hardware errors. So, we introduced another knob for that. --- Best Regards, Huang, Ying
Borislav Petkov <bp@alien8.de> writes: > On Mon, Dec 30, 2024 at 01:54:36PM +0800, Huang, Ying wrote: >> For example, it may be OK to wait forever for a software error, but it may >> be better to reboot the system to contain the influence of the hardware >> error for some hardware errors. > > A default panic timeout of 30 seconds for hw errors?! You do realize that 30 > seconds for a machine is an eternity and by that time your hardware error has > long propagated and corrupted results, right? > > So your timeout is not even trying to do what you want. > > So unless I'm missing something, this ghes timeout needs to go - if you want > to "contain the influence" you need to panic *immediately*! And not even that > would help in some cases - hw has its own protections there so the OS > panicking is meh. At least on x86, that is. OK. 30 seconds isn't good enough for hw errors. Another possible benefit of ghes_panic_timeout is, rebooting instead of waiting forever can help us to log/report the hardware errors earlier. For example, the hardware errors could be logged in some simple non-volatile storage (such as EFI variables) during panic or kdump, etc. Then, after reboot, the new kernel could report the hardware errors in some way. >> So, we introduced another knob for that. > > No, that another knob is piling more of the silly ontop. --- Best Regards, Huang, Ying
On Mon, Dec 30, 2024 at 12:26:08PM +0100, Borislav Petkov wrote: > On Mon, Dec 30, 2024 at 07:04:59PM +0800, Huang, Ying wrote: > > Another possible benefit of ghes_panic_timeout is, > > > > rebooting instead of waiting forever can help us to log/report the > > hardware errors earlier. > > So you rip out ghes_panic_timeout and set the panic_timeout in the ghes code, > when you get a hw error which requires reboot. Still we may need to honor the user setting, say if user specifically set "panic=XXX" in the cmdline, we should detect that case and skip overwritting the panic_timeout? Thanks, Feng > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Dec 30, 2024 at 01:10:09PM +0100, Borislav Petkov wrote: > On Mon, Dec 30, 2024 at 07:40:18PM +0800, Feng Tang wrote: > > Still we may need to honor the user setting, say if user specifically set > > "panic=XXX" in the cmdline, we should detect that case and skip overwritting > > the panic_timeout? > > So the user has set panic timeout to something, machine encounters a hw error > which you want to log/report earlier but user's panic setting prevents you > from doing that. > > So what do you do? > > What has higher prio? > > I guess we're something like this: > > https://youtu.be/gLFQystE8vU?t=71 Hah! As per kernel config, most ARCH has 'panic_timeout' as 0 by default, so need to set the kcmdline. For the case in my commit log, where user had clear requirement for not-reboot and wait, the manually set 'panic=0' should take priority here? Thanks, Feng > > :-P > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Dec 30, 2024 at 02:24:03PM +0100, Borislav Petkov wrote: > On Mon, Dec 30, 2024 at 09:04:11PM +0800, Feng Tang wrote: > > As per kernel config, most ARCH has 'panic_timeout' as 0 by default, so > > need to set the kcmdline. For the case in my commit log, where user had > > clear requirement for not-reboot and wait, the manually set 'panic=0' > > should take priority here? > > I think so. > > I'm not convinced that lets-log-the-hw-error-faster should override the panic > timeout setting but I'm open to real-life example scenarios... > I see. Upon the discussion so far, how about the change below? --- diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 07789f0b59bc..113471b76d8d 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool; static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE]; static atomic_t ghes_estatus_cache_alloced; -static int ghes_panic_timeout __read_mostly = 30; - static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx) { phys_addr_t paddr; @@ -987,9 +985,10 @@ static void __ghes_panic(struct ghes *ghes, ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx); - /* reboot to log the error! */ - if (!panic_timeout) - panic_timeout = ghes_panic_timeout; + /* If user hasn't specifically set panic timeout, reboot to log the error! */ + if (!panic_timeout && !strstr(saved_command_line, "panic=")) + panic_timeout = 30; + panic("Fatal hardware error!"); } Or we want to stick the orignal patch, which doesn't change the original flow? Thanks, Feng
On Tue, Dec 31, 2024 at 10:23:58AM +0100, Borislav Petkov wrote: > On Tue, Dec 31, 2024 at 02:44:48PM +0800, Feng Tang wrote: > > + /* If user hasn't specifically set panic timeout, reboot to log the error! */ > > + if (!panic_timeout && !strstr(saved_command_line, "panic=")) > > And you want to scan saved_command_line because? > > Hint: look at how other code checks panic_timeout. Thanks for the hint! IIUC, you are mentioning the set_arch_panic_timeout(). One thing is, most ARCHs' default timeout is 0, while in our case, the user will also set 'panic=0' :), so we can't easily detect if the 0 is the user-set value or the OS default one. Originally I even thought about adding a flag of 'timeout_user_changed'. Any suggestion? > > Or we want to stick the orignal patch, which doesn't change the > > original flow? > > And pile more broken stuff ontop? OK, will skip this. Thanks, Feng
On Tue, Dec 31, 2024 at 12:13:14PM +0100, Borislav Petkov wrote: > On Tue, Dec 31, 2024 at 06:15:59PM +0800, Feng Tang wrote: > > Thanks for the hint! IIUC, you are mentioning the set_arch_panic_timeout(). > > One thing is, most ARCHs' default timeout is 0, while in our case, the user > > will also set 'panic=0' :), so we can't easily detect if the 0 is the user-set > > value or the OS default one. Originally I even thought about adding a flag > > of 'timeout_user_changed'. Any suggestion? > > Ok, enough talking. Let's get concrete: > > From: "Borislav Petkov (AMD)" <bp@alien8.de> > Date: Tue, 31 Dec 2024 12:03:55 +0100 > Subject: [PATCH] APEI: GHES: Have GHES honor the panic= setting > > The GHES driver overrides the panic= setting by rebooting the system > after a fatal hw error has been reported. The intent being that such an > error would be hopefully written out faster on non-volatile storage for > later inspection. > > However, this is not optimal when a hard-to-debug issue requires long > time to reproduce and when that happens, the box will get rebooted after > 30 seconds and thus destroy the whole hw context of when the error > happened. > > So rip out the default GHES panic timeout and honor the global one. > > In the panic disabled (panic=0) case, the error will still be logged to > dmesg for later inspection and if panic after a hw error is really > required, then that can be controlled the usual way - use panic= on the > cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT. > > Reported-by: Feng Tang <feng.tang@linux.alibaba.com> > Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Looks good to me, thanks! Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com> > --- > drivers/acpi/apei/ghes.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 07789f0b59bc..b72772494655 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool; > static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE]; > static atomic_t ghes_estatus_cache_alloced; > > -static int ghes_panic_timeout __read_mostly = 30; > - > static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx) > { > phys_addr_t paddr; > @@ -983,14 +981,16 @@ static void __ghes_panic(struct ghes *ghes, > struct acpi_hest_generic_status *estatus, > u64 buf_paddr, enum fixed_addresses fixmap_idx) > { > + const char *msg = GHES_PFX "Fatal hardware error"; > + > __ghes_print_estatus(KERN_EMERG, ghes->generic, estatus); > > ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx); > > - /* reboot to log the error! */ > if (!panic_timeout) > - panic_timeout = ghes_panic_timeout; > - panic("Fatal hardware error!"); > + pr_emerg("%s but panic disabled\n", msg); > + > + panic(msg); > } > > static int ghes_proc(struct ghes *ghes) > -- > 2.43.0 > > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov <bp@alien8.de> writes: > On Tue, Dec 31, 2024 at 06:15:59PM +0800, Feng Tang wrote: >> Thanks for the hint! IIUC, you are mentioning the set_arch_panic_timeout(). >> One thing is, most ARCHs' default timeout is 0, while in our case, the user >> will also set 'panic=0' :), so we can't easily detect if the 0 is the user-set >> value or the OS default one. Originally I even thought about adding a flag >> of 'timeout_user_changed'. Any suggestion? > > Ok, enough talking. Let's get concrete: > > From: "Borislav Petkov (AMD)" <bp@alien8.de> > Date: Tue, 31 Dec 2024 12:03:55 +0100 > Subject: [PATCH] APEI: GHES: Have GHES honor the panic= setting > > The GHES driver overrides the panic= setting by rebooting the system > after a fatal hw error has been reported. The intent being that such an > error would be hopefully written out faster on non-volatile storage for > later inspection. IIUC, the hardware error will be written out on non-volatile storage at the same time with or without ghes_panic_timeout overriding. The difference is that after rebooting, the error information in non-volatile storage can be extracted and reported via UI, SNMP, or MCTP earlier. > However, this is not optimal when a hard-to-debug issue requires long > time to reproduce and when that happens, the box will get rebooted after > 30 seconds and thus destroy the whole hw context of when the error > happened. > > So rip out the default GHES panic timeout and honor the global one. > > In the panic disabled (panic=0) case, the error will still be logged to > dmesg for later inspection and if panic after a hw error is really > required, then that can be controlled the usual way - use panic= on the > cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT. > > Reported-by: Feng Tang <feng.tang@linux.alibaba.com> > Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> > --- > drivers/acpi/apei/ghes.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 07789f0b59bc..b72772494655 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool; > static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE]; > static atomic_t ghes_estatus_cache_alloced; > > -static int ghes_panic_timeout __read_mostly = 30; > - > static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx) > { > phys_addr_t paddr; > @@ -983,14 +981,16 @@ static void __ghes_panic(struct ghes *ghes, > struct acpi_hest_generic_status *estatus, > u64 buf_paddr, enum fixed_addresses fixmap_idx) > { > + const char *msg = GHES_PFX "Fatal hardware error"; > + > __ghes_print_estatus(KERN_EMERG, ghes->generic, estatus); > > ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx); > > - /* reboot to log the error! */ > if (!panic_timeout) > - panic_timeout = ghes_panic_timeout; > - panic("Fatal hardware error!"); > + pr_emerg("%s but panic disabled\n", msg); > + > + panic(msg); > } > > static int ghes_proc(struct ghes *ghes) --- Best Regards, Huang, Ying
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 07789f0b59bc..a8a6310e476a 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -174,6 +174,7 @@ static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_ static atomic_t ghes_estatus_cache_alloced; static int ghes_panic_timeout __read_mostly = 30; +module_param(ghes_panic_timeout, int, 0644); static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx) {
There is a problem report that when debugging a hard-to-reproduce panic issue, user wanted the kernel to not reboot by adding "panic=0" in kernel cmdline, so that the panic context could be kept, say the panic was caught randomly in the mid-night, and user hoped to check it in the morning. GHES panic handler may overwrite that user setting and force to reboot after 'ghes_panic_timeout'(30) seconds. Make 'ghes_panic_timeout' a parameter can provide user some flexibility to change the timeout on demand, without changing current behavior. Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> --- drivers/acpi/apei/ghes.c | 1 + 1 file changed, 1 insertion(+)