diff mbox series

[24/30] panic: Refactor the panic path

Message ID 20220427224924.592546-25-gpiccoli@igalia.com
State New
Headers show
Series The panic notifiers refactor | expand

Commit Message

Guilherme G. Piccoli April 27, 2022, 10:49 p.m. UTC
The panic() function is somewhat convoluted - a lot of changes were
made over the years, adding comments that might be misleading/outdated
now, it has a code structure that is a bit complex to follow, with
lots of conditionals, for example. The panic notifier list is something
else - a single list, with multiple callbacks of different purposes,
that run in a non-deterministic order and may affect hardly kdump
reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.

This patch proposes a major refactor on the panic path based on Petr's
idea [0] - basically we split the notifiers list in three, having a set
of different call points in the panic path. Below a list of changes
proposed in this patch, culminating in the panic notifiers level
concept:

(a) First of all, we improved comments all over the function
and removed useless variables / includes. Also, as part of this
clean-up we concentrate the console flushing functions in a helper.

(b) As mentioned before, there is a split of the panic notifier list
in three, based on the purpose of the callback. The code contains
good documentation in form of comments, but a summary of the three
lists follows:

- the hypervisor list aims low-risk procedures to inform hypervisors
or firmware about the panic event, also includes LED-related functions;

- the informational list contains callbacks that provide more details,
like kernel offset or trace dump (if enabled) and also includes the
callbacks aimed at reducing log pollution or warns, like the RCU and
hung task disable callbacks;

- finally, the pre_reboot list is the old notifier list renamed,
containing the more risky callbacks that didn't fit the previous
lists. There is also a 4th list (the post_reboot one), but it's not
related with the original list - it contains late time architecture
callbacks aimed at stopping the machine, for example.

The 3 notifiers lists execute in different moments, hypervisor being
the first, followed by informational and finally the pre_reboot list.

(c) But then, there is the ordering problem of the notifiers against
the crash_kernel() call - kdump must be as reliable as possible.
For that, a simple binary "switch" as "crash_kexec_post_notifiers"
is not enough, hence we introduce here concept of panic notifier
levels: there are 5 levels, from 0 (no notifier executes before
kdump) until 4 (all notifiers run before kdump); the default level
is 2, in which the hypervisor and (iff we have any kmsg dumper)
the informational notifiers execute before kdump.

The detailed documentation of the levels is present in code comments
and in the kernel-parameters.txt file; as an analogy with the previous
panic() implementation, the level 0 is exactly the same as the old
behavior of notifiers, running all after kdump, and the level 4 is
the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as
a deprecated one).

(d) Finally, an important change made here: we now use only the
function "crash_smp_send_stop()" to shut all the secondary CPUs
in the panic path. Before, there was a case of using the regular
"smp_send_stop()", but the better approach is to simplify the
code and try to use the function which was created exclusively
for the panic path. Experiments showed that it works fine, and
code was very simplified with that.

Functional change is expected from this refactor, since now we
call some notifiers by default before kdump, but the goal here
besides code clean-up is to have a better panic path, more
reliable and deterministic, but also very customizable.

[0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
---

Special thanks to Petr and Baoquan for the suggestion and feedback in a previous
email thread. There's some important design decisions that worth mentioning and
discussing:

* The default panic notifiers level is 2, based on Petr Mladek's suggestion,
which makes a lot of sense. Of course, this is customizable through the
parameter, but would be something worthwhile to have a KConfig option to set
the default level? It would help distros that want the old behavior
(no notifiers before kdump) as default.

* The implementation choice was to _avoid_ intricate if conditionals in the
panic path, which would _definitely_ be present with the panic notifiers levels
idea; so, instead of lots of if conditionals, the set/clear bits approach with
functions called in 2 points (but executing only in one of them) is much easier
to follow an was used here; the ordering helper function and the comments also
help a lot to avoid confusion (hopefully).

* Choice was to *always* use crash_smp_send_stop() instead of sometimes making
use of the regular smp_send_stop(); for most architectures they are the same,
including Xen (on x86). For the ones that override it, all should work fine,
in the powerpc case it's even more correct (see the subsequent patch
"powerpc: Do not force all panic notifiers to execute before kdump")

There seems to be 2 cases that requires some plumbing to work 100% right:
- ARM doesn't disable local interrupts / FIQs in the crash version of
send_stop(); we patched that early in this series;
- x86 could face an issue if we have VMX and do use crash_smp_send_stop()
_without_ kdump, but this is fixed in the first patch of the series (and
it's a bug present even before this refactor).

* Notice we didn't add a sysrq for panic notifiers level - should have it?
Alejandro proposed recently to add a sysrq for "crash_kexec_post_notifiers",
let me know if you feel the need here Alejandro, since the core parameters are
present in /sys, I didn't consider much gain in having a sysrq, but of course
I'm open to suggestions!

Thanks advance for the review!

 .../admin-guide/kernel-parameters.txt         |  42 ++-
 include/linux/panic_notifier.h                |   1 +
 kernel/kexec_core.c                           |   8 +-
 kernel/panic.c                                | 292 +++++++++++++-----
 .../selftests/pstore/pstore_crash_test        |   5 +-
 5 files changed, 252 insertions(+), 96 deletions(-)

Comments

d.hatayama@fujitsu.com May 9, 2022, 3:16 p.m. UTC | #1
Sorry for the delayed response. Unfortunately, I had 10 days holidays
until yesterday...

>  .../admin-guide/kernel-parameters.txt         |  42 ++-
>  include/linux/panic_notifier.h                |   1 +
>  kernel/kexec_core.c                           |   8 +-
>  kernel/panic.c                                | 292 +++++++++++++-----
>  .../selftests/pstore/pstore_crash_test        |   5 +-
>  5 files changed, 252 insertions(+), 96 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3f1cc5e317ed..8d3524060ce3 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
...snip...
> @@ -3784,6 +3791,33 @@
>                         timeout < 0: reboot immediately
>                         Format: <timeout>
> 
> +       panic_notifiers_level=
> +                       [KNL] Set the panic notifiers execution order.
> +                       Format: <unsigned int>
> +                       We currently have 4 lists of panic notifiers; based
> +                       on the functionality and risk (for panic success) the
> +                       callbacks are added in a given list. The lists are:
> +                       - hypervisor/FW notification list (low risk);
> +                       - informational list (low/medium risk);
> +                       - pre_reboot list (higher risk);
> +                       - post_reboot list (only run late in panic and after
> +                       kdump, not configurable for now).
> +                       This parameter defines the ordering of the first 3
> +                       lists with regards to kdump; the levels determine
> +                       which set of notifiers execute before kdump. The
> +                       accepted levels are:
> +                       0: kdump is the first thing to run, NO list is
> +                       executed before kdump.
> +                       1: only the hypervisor list is executed before kdump.
> +                       2 (default level): the hypervisor list and (*if*

Hmmm, why are you trying to change default setting?

Based on the current design of kdump, it's natural to put what the
handlers for these level 1 and level 2 handlers do in
machine_crash_shutdown(), as these are necessary by default, right?

Or have you already tried that and figured out it's difficult in some
reason and reached the current design? If so, why is that difficult?
Could you point to if there is already such discussion online?

kdump is designed to perform as little things as possible before
transferring the execution to the 2nd kernel in order to increase
reliability. Just detour to panic() increases risks of kdump failure
in the sense of increasing the executed codes in the abnormal
situation, which is very the note in the explanation of
crash_kexec_post_notifiers.

Also, the current implementation of crash_kexec_post_notifiers uses
the panic notifier, but this is not from the technical
reason. Ideally, it should have been implemented in the context of
crash_kexec() independently of panic().

That is, it looks to me that, in addition to changing design of panic
notifier, you are trying to integrate shutdown code of the crash kexec
and the panic paths. If so, this is a big design change for kdump.
I'm concerned about increase of reliability. I'd like you to discuss
them carefully.

Thanks.
HATAYAMA, Daisuke
Guilherme G. Piccoli May 9, 2022, 4:39 p.m. UTC | #2
Hey Hatayma, thanks for your great analysis and no need for apologies!

I'll comment/respond properly inline below, just noticing here that I've
CCed Mark and Marc (from the ARM64 perspective), Michael (Hyper-V
perspective) and Hari (PowerPC perspective), besides the usual suspects
as Petr, Baoquan, etc.

On 09/05/2022 12:16, d.hatayama@fujitsu.com wrote:
> Sorry for the delayed response. Unfortunately, I had 10 days holidays
> until yesterday...
> [...] 
>> +                       We currently have 4 lists of panic notifiers; based
>> +                       on the functionality and risk (for panic success) the
>> +                       callbacks are added in a given list. The lists are:
>> +                       - hypervisor/FW notification list (low risk);
>> +                       - informational list (low/medium risk);
>> +                       - pre_reboot list (higher risk);
>> +                       - post_reboot list (only run late in panic and after
>> +                       kdump, not configurable for now).
>> +                       This parameter defines the ordering of the first 3
>> +                       lists with regards to kdump; the levels determine
>> +                       which set of notifiers execute before kdump. The
>> +                       accepted levels are:
>> +                       0: kdump is the first thing to run, NO list is
>> +                       executed before kdump.
>> +                       1: only the hypervisor list is executed before kdump.
>> +                       2 (default level): the hypervisor list and (*if*
> 
> Hmmm, why are you trying to change default setting?
> 
> Based on the current design of kdump, it's natural to put what the
> handlers for these level 1 and level 2 handlers do in
> machine_crash_shutdown(), as these are necessary by default, right?
> 
> Or have you already tried that and figured out it's difficult in some
> reason and reached the current design? If so, why is that difficult?
> Could you point to if there is already such discussion online?
> 
> kdump is designed to perform as little things as possible before
> transferring the execution to the 2nd kernel in order to increase
> reliability. Just detour to panic() increases risks of kdump failure
> in the sense of increasing the executed codes in the abnormal
> situation, which is very the note in the explanation of
> crash_kexec_post_notifiers.
> 
> Also, the current implementation of crash_kexec_post_notifiers uses
> the panic notifier, but this is not from the technical
> reason. Ideally, it should have been implemented in the context of
> crash_kexec() independently of panic().
> 
> That is, it looks to me that, in addition to changing design of panic
> notifier, you are trying to integrate shutdown code of the crash kexec
> and the panic paths. If so, this is a big design change for kdump.
> I'm concerned about increase of reliability. I'd like you to discuss
> them carefully.
Guilherme G. Piccoli May 15, 2022, 10:47 p.m. UTC | #3
On 12/05/2022 11:03, Petr Mladek wrote:
> Hello,
> 
> first, I am sorry for stepping into the discussion so late.
> I was busy with some other stuff and this patchset is far
> from trivial.
> 
> Second, thanks a lot for putting so much effort into it.
> Most of the changes look pretty good, especially all
> the fixes of particular notifiers and split into
> four lists.
> 
> Though this patch will need some more love. See below
> for more details.

Thanks a lot for your review Petr, it is much appreciated! No need for
apologies, there is no urgency here =)


> [...] 
> This talks only about kdump. The reality is much more complicated.
> The level affect the order of:
> 
>     + notifiers vs. kdump
>     + notifiers vs. crash_dump
>     + crash_dump vs. kdump

First of all, I'd like to ask you please to clarify to me *exactly* what
are the differences between "crash_dump" and "kdump". I'm sorry if
that's a silly question, I need to be 100% sure I understand the
concepts the same way you do.


> There might theoretically many variants of the ordering of kdump,
> crash_dump, and the 4 notifier list. Some variants do not make
> much sense. You choose 5 variants and tried to select them by
> a level number.
> 
> The question is if we really could easily describe the meaning this
> way. It is not only about a "level" of notifiers before kdump. It is
> also about the ordering of crash_dump vs. kdump. IMHO, "level"
> semantic does not fit there.
> 
> Maybe more parameters might be easier to understand the effect.
> Anyway, we first need to agree on the chosen variants.
> I am going to discuss it more in the code, see below.
> 
> 
> [...] 
> Here is the code using the above functions. It helps to discuss
> the design and logic.
> 
> <kernel/panic.c>
> 	order_panic_notifiers_and_kdump();
> 
> 	/* If no level, we should kdump ASAP. */
> 	if (!panic_notifiers_level)
> 		__crash_kexec(NULL);
> 
> 	crash_smp_send_stop();
> 	panic_notifier_hypervisor_once(buf);
> 
> 	if (panic_notifier_info_once(buf))
> 		kmsg_dump(KMSG_DUMP_PANIC);
> 
> 	panic_notifier_pre_reboot_once(buf);
> 
> 	__crash_kexec(NULL);
> 
> 	panic_notifier_hypervisor_once(buf);
> 
> 	if (panic_notifier_info_once(buf))
> 		kmsg_dump(KMSG_DUMP_PANIC);
> 
> 	panic_notifier_pre_reboot_once(buf);
> </kernel/panic.c>
> 
> I have to say that the logic is very unclear. Almost all
> functions are called twice:
> 
>    + __crash_kexec()
>    + kmsg_dump()
>    + panic_notifier_hypervisor_once()
>    + panic_notifier_pre_reboot_once()
>    + panic_notifier_info_once()
> 
> It is pretty hard to find what functions are always called in the same
> order and where the order can be inverted.
> 
> The really used code path is defined by order_panic_notifiers_and_kdump()
> that encodes "level" into "bits". The bits are then flipped in
> panic_notifier_*_once() calls that either do something or not.
> kmsg_dump() is called according to the bit flip.
> 
> It is an interesting approach. I guess that you wanted to avoid too
> many if/then/else levels in panic(). But honestly, it looks like
> a black magic to me.
> 
> IMHO, it is always easier to follow if/then/else logic than using
> a translation table that requires additional bit flips when
> a value is used more times.
> 
> Also I guess that it is good proof that "level" abstraction does
> not fit here. Normal levels would not need this kind of magic.

Heheh OK, I appreciate your opinion, but I guess we'll need to agree in
disagree here - I'm much more fond to this kind of code than a bunch of
if/else blocks that almost give headaches. Encoding such "level" logic
in the if/else scheme is very convoluted, generates a very big code. And
the functions aren't so black magic - they map a level in bits, and the
functions _once() are called...once! Although we switch the position in
the code, so there are 2 calls, one of them is called and the other not.

But that's totally fine to change - especially if we're moving away from
the "level" logic. I see below you propose a much simpler approach - if
we follow that, definitely we won't need the "black magic" approach heheh


> 
> OK, the question is how to make it better. Let's start with
> a clear picture of the problem:
> 
> 1. panic() has basically two funtions:
> 
>       + show/store debug information (optional ways and amount)
>       + do something with the system (reboot, stay hanged)
> 
> 
> 2. There are 4 ways how to show/store the information:
> 
>       + tell hypervisor to store what it is interested about
>       + crash_dump
>       + kmsg_dump()
>       + consoles
> 
>   , where crash_dump and consoles are special:
> 
>      + crash_dump does not return. Instead it ends up with reboot.
> 
>      + Consoles work transparently. They just need an extra flush
>        before reboot or staying hanged.
> 
> 
> 3. The various notifiers do things like:
> 
>      + tell hypervisor about the crash
>      + print more information (also stop watchdogs)
>      + prepare system for reboot (touch some interfaces)
>      + prepare system for staying hanged (blinking)
> 
>    Note that it pretty nicely matches the 4 notifier lists.
> 

I really appreciate the summary skill you have, to convert complex
problems in very clear and concise ideas. Thanks for that, very useful!
I agree with what was summarized above.


> Now, we need to decide about the ordering. The main area is how
> to store the debug information. Consoles are transparent so
> the quesition is about:
> 
>      + hypervisor
>      + crash_dump
>      + kmsg_dump
> 
> Some people need none and some people want all. There is a
> risk that system might hung at any stage. This why people want to
> make the order configurable.
> 
> But crash_dump() does not return when it succeeds. And kmsg_dump()
> users havn't complained about hypervisor problems yet. So, that
> two variants might be enough:
> 
>     + crash_dump (hypervisor, kmsg_dump as fallback)
>     + hypervisor, kmsg_dump, crash_dump
> 
> One option "panic_prefer_crash_dump" should be enough.
> And the code might look like:
> 
> void panic()
> {
> [...]
> 	dump_stack();
> 	kgdb_panic(buf);
> 
> 	< ---  here starts the reworked code --- >
> 
> 	/* crash dump is enough when enabled and preferred. */
> 	if (panic_prefer_crash_dump)
> 		__crash_kexec(NULL);
> 
> 	/* Stop other CPUs and focus on handling the panic state. */
> 	if (has_kexec_crash_image)
> 		crash_smp_send_stop();
> 	else
> 		smp_send_stop()
> 

Here we have a very important point. Why do we need 2 variants of SMP
CPU stopping functions? I disagree with that - my understanding of this
after some study in architectures is that the crash_() variant is
"stronger", should work in all cases and if not, we should fix that -
that'd be a bug.

Such variant either maps to smp_send_stop() (in various architectures,
including XEN/x86) or overrides the basic function with more proper
handling for panic() case...I don't see why we still need such
distinction, if you / others have some insight about that, I'd like to
hear =)


> 	/* Notify hypervisor about the system panic. */
> 	atomic_notifier_call_chain(&panic_hypervisor_list, 0, NULL);
> 
> 	/*
> 	 * No need to risk extra info when there is no kmsg dumper
> 	 * registered.
> 	 */
> 	if (!has_kmsg_dumper())
> 		__crash_kexec(NULL);
> 
> 	/* Add extra info from different subsystems. */
> 	atomic_notifier_call_chain(&panic_info_list, 0, NULL);
> 
> 	kmsg_dump(KMSG_DUMP_PANIC);
> 	__crash_kexec(NULL);
> 
> 	/* Flush console */
> 	unblank_screen();
> 	console_unblank();
> 	debug_locks_off();
> 	console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> 
> 	if (panic_timeout > 0) {
> 		delay()
> 	}
> 
> 	/*
> 	 * Prepare system for eventual reboot and allow custom
> 	 * reboot handling.
> 	 */
> 	atomic_notifier_call_chain(&panic_reboot_list, 0, NULL);

You had the order of panic_reboot_list VS. consoles flushing inverted.
It might make sense, although I didn't do that in V1...
Are you OK in having a helper for console flushing, as I did in V1? It
makes code of panic() a bit less polluted / more focused I feel.


> 
> 	if (panic_timeout != 0) {
> 		reboot();
> 	}
> 
> 	/*
> 	 * Prepare system for the infinite waiting, for example,
> 	 * setup blinking.
> 	 */
> 	atomic_notifier_call_chain(&panic_loop_list, 0, NULL);
> 
> 	infinite_loop();
> }
> 
> 
> __crash_kexec() is there 3 times but otherwise the code looks
> quite straight forward.
> 
> Note 1: I renamed the two last notifier list. The name 'post-reboot'
> 	did sound strange from the logical POV ;-)
> 
> Note 2: We have to avoid the possibility to call "reboot" list
> 	before kmsg_dump(). All callbacks providing info
> 	have to be in the info list. It a callback combines
> 	info and reboot functionality then it should be split.
> 
> 	There must be another way to calm down problematic
> 	info callbacks. And it has to be solved when such
> 	a problem is reported. Is there any known issue, please?
> 
> It is possible that I have missed something important.
> But I would really like to make the logic as simple as possible.

OK, I agree with you! It's indeed simpler and if others agree, I can
happily change the logic to what you proposed. Although...currently the
"crash_kexec_post_notifiers" allows to call _all_ panic_reboot_list
callbacks _before kdump_.

We need to mention this change in the commit messages, but I really
would like to hear the opinions of heavy users of notifiers (as
Michael/Hyper-V) and the kdump interested parties (like Baoquan / Dave
Young / Hayatama). If we all agree on such approach, will change that
for V2 =)

Thanks again Petr, for the time spent in such detailed review!
Cheers,


Guilherme
Petr Mladek May 16, 2022, 10:21 a.m. UTC | #4
On Sun 2022-05-15 19:47:39, Guilherme G. Piccoli wrote:
> On 12/05/2022 11:03, Petr Mladek wrote:
> > This talks only about kdump. The reality is much more complicated.
> > The level affect the order of:
> > 
> >     + notifiers vs. kdump
> >     + notifiers vs. crash_dump
> >     + crash_dump vs. kdump
> 
> First of all, I'd like to ask you please to clarify to me *exactly* what
> are the differences between "crash_dump" and "kdump". I'm sorry if
> that's a silly question, I need to be 100% sure I understand the
> concepts the same way you do.

Ah, it should have been:

     + notifiers vs. kmsg_dump
     + notifiers vs. crash_dump
     + crash_dump vs. kmsg_dump

I am sorry for the confusion. Even "crash_dump" is slightly
misleading because there is no function with this name.
But it seems to be easier to understand than __crash_kexec().


> > There might theoretically many variants of the ordering of kdump,
> > crash_dump, and the 4 notifier list. Some variants do not make
> > much sense. You choose 5 variants and tried to select them by
> > a level number.
> > 
> > The question is if we really could easily describe the meaning this
> > way. It is not only about a "level" of notifiers before kdump. It is
> > also about the ordering of crash_dump vs. kdump. IMHO, "level"
> > semantic does not fit there.
> > 
> > Maybe more parameters might be easier to understand the effect.
> > Anyway, we first need to agree on the chosen variants.
> > I am going to discuss it more in the code, see below.
> > 
> > 
> > [...] 
> > Here is the code using the above functions. It helps to discuss
> > the design and logic.
> > 
> > I have to say that the logic is very unclear. Almost all
> > functions are called twice:
> > 
> > The really used code path is defined by order_panic_notifiers_and_kdump()
> > that encodes "level" into "bits". The bits are then flipped in
> > panic_notifier_*_once() calls that either do something or not.
> > kmsg_dump() is called according to the bit flip.
> > 
> > Also I guess that it is good proof that "level" abstraction does
> > not fit here. Normal levels would not need this kind of magic.
> 
> Heheh OK, I appreciate your opinion, but I guess we'll need to agree in
> disagree here - I'm much more fond to this kind of code than a bunch of
> if/else blocks that almost give headaches. Encoding such "level" logic
> in the if/else scheme is very convoluted, generates a very big code. And
> the functions aren't so black magic - they map a level in bits, and the
> functions _once() are called...once! Although we switch the position in
> the code, so there are 2 calls, one of them is called and the other not.

I see. Well, I would consider this as a warning that the approach is
too complex. If the code, using if/then/else, would cause headaches
then also understanding of the behavior would cause headaches for
both users and programmers.


> But that's totally fine to change - especially if we're moving away from
> the "level" logic. I see below you propose a much simpler approach - if
> we follow that, definitely we won't need the "black magic" approach heheh

I do not say that my proposal is fully correct. But we really need
this kind of simpler approach.


> > OK, the question is how to make it better.

> > One option "panic_prefer_crash_dump" should be enough.
> > And the code might look like:
> > 
> > void panic()
> > {
> > [...]
> > 	dump_stack();
> > 	kgdb_panic(buf);
> > 
> > 	< ---  here starts the reworked code --- >
> > 
> > 	/* crash dump is enough when enabled and preferred. */
> > 	if (panic_prefer_crash_dump)
> > 		__crash_kexec(NULL);
> > 
> > 	/* Stop other CPUs and focus on handling the panic state. */
> > 	if (has_kexec_crash_image)
> > 		crash_smp_send_stop();
> > 	else
> > 		smp_send_stop()
> > 
> 
> Here we have a very important point. Why do we need 2 variants of SMP
> CPU stopping functions? I disagree with that - my understanding of this
> after some study in architectures is that the crash_() variant is
> "stronger", should work in all cases and if not, we should fix that -
> that'd be a bug.
> 
> Such variant either maps to smp_send_stop() (in various architectures,
> including XEN/x86) or overrides the basic function with more proper
> handling for panic() case...I don't see why we still need such
> distinction, if you / others have some insight about that, I'd like to
> hear =)

The two variants were introduced by the commit 0ee59413c967c35a6dd
("x86/panic: replace smp_send_stop() with kdump friendly version in
panic path")

It points to https://lkml.org/lkml/2015/6/24/44 that talks about
still running watchdogs.

It is possible that the problem could be fixed another way. It is
even possible that it has already been fixed by the notifiers
that disable the watchdogs.

Anyway, any change of the smp_send_stop() behavior should be done
in a separate patch. It will help with bisection of possible
regression. Also it would require a good explanation in
the commit message. I would personally do it in a separate
patch(set).


> > 	/* Notify hypervisor about the system panic. */
> > 	atomic_notifier_call_chain(&panic_hypervisor_list, 0, NULL);
> > 
> > 	/*
> > 	 * No need to risk extra info when there is no kmsg dumper
> > 	 * registered.
> > 	 */
> > 	if (!has_kmsg_dumper())
> > 		__crash_kexec(NULL);
> > 
> > 	/* Add extra info from different subsystems. */
> > 	atomic_notifier_call_chain(&panic_info_list, 0, NULL);
> > 
> > 	kmsg_dump(KMSG_DUMP_PANIC);
> > 	__crash_kexec(NULL);
> > 
> > 	/* Flush console */
> > 	unblank_screen();
> > 	console_unblank();
> > 	debug_locks_off();
> > 	console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> > 
> > 	if (panic_timeout > 0) {
> > 		delay()
> > 	}
> > 
> > 	/*
> > 	 * Prepare system for eventual reboot and allow custom
> > 	 * reboot handling.
> > 	 */
> > 	atomic_notifier_call_chain(&panic_reboot_list, 0, NULL);
> 
> You had the order of panic_reboot_list VS. consoles flushing inverted.
> It might make sense, although I didn't do that in V1...

IMHO, it makes sense:

  1. panic_reboot_list contains notifiers that do the reboot
     immediately, for example, xen_panic_event, alpha_panic_event.
     The consoles have to be flushed earlier.

  2. console_flush_on_panic() ignores the result of console_trylock()
     and always calls console_unlock(). As a result the lock should
     be unlocked at the end. And any further printk() should be able
     to printk the messages to the console immediately. It means
     that any messages printed by the reboot notifiers should appear
     on the console as well.

> Are you OK in having a helper for console flushing, as I did in V1? It
> makes code of panic() a bit less polluted / more focused I feel.

Yes, it makes sense. Well, it would better to do it in a separate
patch. The patch patch reworking the logic should be as small
as possible. It will simplify the review.


> > 	if (panic_timeout != 0) {
> > 		reboot();
> > 	}
> > 
> > 	/*
> > 	 * Prepare system for the infinite waiting, for example,
> > 	 * setup blinking.
> > 	 */
> > 	atomic_notifier_call_chain(&panic_loop_list, 0, NULL);
> > 
> > 	infinite_loop();
> > }
> > 
> > 
> > __crash_kexec() is there 3 times but otherwise the code looks
> > quite straight forward.
> > 
> > Note 1: I renamed the two last notifier list. The name 'post-reboot'
> > 	did sound strange from the logical POV ;-)
> > 
> > Note 2: We have to avoid the possibility to call "reboot" list
> > 	before kmsg_dump(). All callbacks providing info
> > 	have to be in the info list. It a callback combines
> > 	info and reboot functionality then it should be split.
> > 
> > 	There must be another way to calm down problematic
> > 	info callbacks. And it has to be solved when such
> > 	a problem is reported. Is there any known issue, please?
> > 
> > It is possible that I have missed something important.
> > But I would really like to make the logic as simple as possible.
> 
> OK, I agree with you! It's indeed simpler and if others agree, I can
> happily change the logic to what you proposed. Although...currently the
> "crash_kexec_post_notifiers" allows to call _all_ panic_reboot_list
> callbacks _before kdump_.
>
> We need to mention this change in the commit messages, but I really
> would like to hear the opinions of heavy users of notifiers (as
> Michael/Hyper-V) and the kdump interested parties (like Baoquan / Dave
> Young / Hayatama). If we all agree on such approach, will change that
> for V2 =)

Sure, we need to make sure that we call everything that is needed.
And it should be documented.

I believe that this is the right way because:

  + It was actually the motivation for this patchset. We split
    the notifiers into separate lists because we want to call
    only the really needed ones before kmsg_dump and crash_dump.

  + If anything is needed for crash_dump that it should be called
    even when crash_dump is called first. It should be either
    hardcoded into crash_dump() or we would need another notifier
    list that will be always called before crash_dump.


Thanks a lot for working on this.

Best Regards,
Petr
Guilherme G. Piccoli May 16, 2022, 4:32 p.m. UTC | #5
On 16/05/2022 07:21, Petr Mladek wrote:
> [...]
> Ah, it should have been:
> 
>      + notifiers vs. kmsg_dump
>      + notifiers vs. crash_dump
>      + crash_dump vs. kmsg_dump
> 
> I am sorry for the confusion. Even "crash_dump" is slightly
> misleading because there is no function with this name.
> But it seems to be easier to understand than __crash_kexec().

Cool, thanks! Now it's totally clear for me =)
I feel crash dump is the proper term, but I personally prefer kdump to
avoid mess-up with user space "core dump" concept heheh
Also, KDUMP is an entry on MAINTAINERS file.


> [...]
>> Heheh OK, I appreciate your opinion, but I guess we'll need to agree in
>> disagree here - I'm much more fond to this kind of code than a bunch of
>> if/else blocks that almost give headaches. Encoding such "level" logic
>> in the if/else scheme is very convoluted, generates a very big code. And
>> the functions aren't so black magic - they map a level in bits, and the
>> functions _once() are called...once! Although we switch the position in
>> the code, so there are 2 calls, one of them is called and the other not.
> 
> I see. Well, I would consider this as a warning that the approach is
> too complex. If the code, using if/then/else, would cause headaches
> then also understanding of the behavior would cause headaches for
> both users and programmers.
> 
> 
>> But that's totally fine to change - especially if we're moving away from
>> the "level" logic. I see below you propose a much simpler approach - if
>> we follow that, definitely we won't need the "black magic" approach heheh
> 
> I do not say that my proposal is fully correct. But we really need
> this kind of simpler approach.

It's cool, I agree that your idea is much simpler and makes sense - mine
seems to be an over-engineering effort. Let's see the opinions of the
interested parties, I'm curious to see if everybody agrees here, that'd
would be ideal (and kind of "wishful thinking" I guess heheh - panic
path is polemic).


> [...] 
>> Here we have a very important point. Why do we need 2 variants of SMP
>> CPU stopping functions? I disagree with that - my understanding of this
>> after some study in architectures is that the crash_() variant is
>> "stronger", should work in all cases and if not, we should fix that -
>> that'd be a bug.
>>
>> Such variant either maps to smp_send_stop() (in various architectures,
>> including XEN/x86) or overrides the basic function with more proper
>> handling for panic() case...I don't see why we still need such
>> distinction, if you / others have some insight about that, I'd like to
>> hear =)
> 
> The two variants were introduced by the commit 0ee59413c967c35a6dd
> ("x86/panic: replace smp_send_stop() with kdump friendly version in
> panic path")
> 
> It points to https://lkml.org/lkml/2015/6/24/44 that talks about
> still running watchdogs.
> 
> It is possible that the problem could be fixed another way. It is
> even possible that it has already been fixed by the notifiers
> that disable the watchdogs.
> 
> Anyway, any change of the smp_send_stop() behavior should be done
> in a separate patch. It will help with bisection of possible
> regression. Also it would require a good explanation in
> the commit message. I would personally do it in a separate
> patch(set).

Thanks for the archeology and interesting findings. I agree that is
better to split in smaller patches. I'm planning a split in 3 patches
for V2: clean-up (comment, console flushing idea, useless header), the
refactor itself and finally, this SMP change.


> [...] 
>> You had the order of panic_reboot_list VS. consoles flushing inverted.
>> It might make sense, although I didn't do that in V1...
> 
> IMHO, it makes sense:
> 
>   1. panic_reboot_list contains notifiers that do the reboot
>      immediately, for example, xen_panic_event, alpha_panic_event.
>      The consoles have to be flushed earlier.
> 
>   2. console_flush_on_panic() ignores the result of console_trylock()
>      and always calls console_unlock(). As a result the lock should
>      be unlocked at the end. And any further printk() should be able
>      to printk the messages to the console immediately. It means
>      that any messages printed by the reboot notifiers should appear
>      on the console as well.
> [...] 
>> OK, I agree with you! It's indeed simpler and if others agree, I can
>> happily change the logic to what you proposed. Although...currently the
>> "crash_kexec_post_notifiers" allows to call _all_ panic_reboot_list
>> callbacks _before kdump_.
>>
>> We need to mention this change in the commit messages, but I really
>> would like to hear the opinions of heavy users of notifiers (as
>> Michael/Hyper-V) and the kdump interested parties (like Baoquan / Dave
>> Young / Hayatama). If we all agree on such approach, will change that
>> for V2 =)
> 
> Sure, we need to make sure that we call everything that is needed.
> And it should be documented.
> 
> I believe that this is the right way because:
> 
>   + It was actually the motivation for this patchset. We split
>     the notifiers into separate lists because we want to call
>     only the really needed ones before kmsg_dump and crash_dump.
> 
>   + If anything is needed for crash_dump that it should be called
>     even when crash_dump is called first. It should be either
>     hardcoded into crash_dump() or we would need another notifier
>     list that will be always called before crash_dump.

Ack, makes sense! Will do that in V2 =)

For the "hardcoded" part, we have the custom machine_crash_kexec() in
some archs (like x86), unfortunately not in all of them - this path is
ideally for mandatory code that is required for a successful crash_kexec().

The problem is the "same old, same old" - architecture folks push that
to panic notifiers; notifiers folks push it to the arch custom shutdown
handler (see [0] heheh).
(CCed Marc / Mark in case they want to chime-in here...)


>[...] 
> Thanks a lot for working on this.
> 
> Best Regards,
> Petr


You're welcome, _thank you_ for the great and detailed reviews!
Cheers,


Guilherme


[0]
https://lore.kernel.org/lkml/427a8277-49f0-4317-d6c3-4a15d7070e55@igalia.com/
Baoquan He May 19, 2022, 11:45 p.m. UTC | #6
On 05/15/22 at 07:47pm, Guilherme G. Piccoli wrote:
> On 12/05/2022 11:03, Petr Mladek wrote:
...... 
> > OK, the question is how to make it better. Let's start with
> > a clear picture of the problem:
> > 
> > 1. panic() has basically two funtions:
> > 
> >       + show/store debug information (optional ways and amount)
> >       + do something with the system (reboot, stay hanged)
> > 
> > 
> > 2. There are 4 ways how to show/store the information:
> > 
> >       + tell hypervisor to store what it is interested about
> >       + crash_dump
> >       + kmsg_dump()
> >       + consoles
> > 
> >   , where crash_dump and consoles are special:
> > 
> >      + crash_dump does not return. Instead it ends up with reboot.
> > 
> >      + Consoles work transparently. They just need an extra flush
> >        before reboot or staying hanged.
> > 
> > 
> > 3. The various notifiers do things like:
> > 
> >      + tell hypervisor about the crash
> >      + print more information (also stop watchdogs)
> >      + prepare system for reboot (touch some interfaces)
> >      + prepare system for staying hanged (blinking)
> > 
> >    Note that it pretty nicely matches the 4 notifier lists.
> > 
> 
> I really appreciate the summary skill you have, to convert complex
> problems in very clear and concise ideas. Thanks for that, very useful!
> I agree with what was summarized above.

I want to say the similar words to Petr's reviewing comment when I went
through the patches and traced each reviewing sub-thread to try to
catch up. Petr has reivewed this series so carefully and given many
comments I want to ack immediately.

I agree with most of the suggestions from Petr to this patch, except of
one tiny concern, please see below inline comment.

> 
> 
> > Now, we need to decide about the ordering. The main area is how
> > to store the debug information. Consoles are transparent so
> > the quesition is about:
> > 
> >      + hypervisor
> >      + crash_dump
> >      + kmsg_dump
> > 
> > Some people need none and some people want all. There is a
> > risk that system might hung at any stage. This why people want to
> > make the order configurable.
> > 
> > But crash_dump() does not return when it succeeds. And kmsg_dump()
> > users havn't complained about hypervisor problems yet. So, that
> > two variants might be enough:
> > 
> >     + crash_dump (hypervisor, kmsg_dump as fallback)
> >     + hypervisor, kmsg_dump, crash_dump
> > 
> > One option "panic_prefer_crash_dump" should be enough.
> > And the code might look like:
> > 
> > void panic()
> > {
> > [...]
> > 	dump_stack();
> > 	kgdb_panic(buf);
> > 
> > 	< ---  here starts the reworked code --- >
> > 
> > 	/* crash dump is enough when enabled and preferred. */
> > 	if (panic_prefer_crash_dump)
> > 		__crash_kexec(NULL);

I like the proposed skeleton of panic() and code style suggested by
Petr very much. About panic_prefer_crash_dump which might need be added,
I hope it has a default value true. This makes crash_dump execute at
first by default just as before, unless people specify
panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add
panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump,
this is inconsistent with the old behaviour.

> > 
> > 	/* Stop other CPUs and focus on handling the panic state. */
> > 	if (has_kexec_crash_image)
> > 		crash_smp_send_stop();
> > 	else
> > 		smp_send_stop()
> > 
> 
> Here we have a very important point. Why do we need 2 variants of SMP
> CPU stopping functions? I disagree with that - my understanding of this
> after some study in architectures is that the crash_() variant is
> "stronger", should work in all cases and if not, we should fix that -
> that'd be a bug.
> 
> Such variant either maps to smp_send_stop() (in various architectures,
> including XEN/x86) or overrides the basic function with more proper
> handling for panic() case...I don't see why we still need such
> distinction, if you / others have some insight about that, I'd like to
> hear =)
> 
> 
> > 	/* Notify hypervisor about the system panic. */
> > 	atomic_notifier_call_chain(&panic_hypervisor_list, 0, NULL);
> > 
> > 	/*
> > 	 * No need to risk extra info when there is no kmsg dumper
> > 	 * registered.
> > 	 */
> > 	if (!has_kmsg_dumper())
> > 		__crash_kexec(NULL);
> > 
> > 	/* Add extra info from different subsystems. */
> > 	atomic_notifier_call_chain(&panic_info_list, 0, NULL);
> > 
> > 	kmsg_dump(KMSG_DUMP_PANIC);
> > 	__crash_kexec(NULL);
> > 
> > 	/* Flush console */
> > 	unblank_screen();
> > 	console_unblank();
> > 	debug_locks_off();
> > 	console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> > 
> > 	if (panic_timeout > 0) {
> > 		delay()
> > 	}
> > 
> > 	/*
> > 	 * Prepare system for eventual reboot and allow custom
> > 	 * reboot handling.
> > 	 */
> > 	atomic_notifier_call_chain(&panic_reboot_list, 0, NULL);
> 
> You had the order of panic_reboot_list VS. consoles flushing inverted.
> It might make sense, although I didn't do that in V1...
> Are you OK in having a helper for console flushing, as I did in V1? It
> makes code of panic() a bit less polluted / more focused I feel.
> 
> 
> > 
> > 	if (panic_timeout != 0) {
> > 		reboot();
> > 	}
> > 
> > 	/*
> > 	 * Prepare system for the infinite waiting, for example,
> > 	 * setup blinking.
> > 	 */
> > 	atomic_notifier_call_chain(&panic_loop_list, 0, NULL);
> > 
> > 	infinite_loop();
> > }
> > 
> > 
> > __crash_kexec() is there 3 times but otherwise the code looks
> > quite straight forward.
> > 
> > Note 1: I renamed the two last notifier list. The name 'post-reboot'
> > 	did sound strange from the logical POV ;-)
> > 
> > Note 2: We have to avoid the possibility to call "reboot" list
> > 	before kmsg_dump(). All callbacks providing info
> > 	have to be in the info list. It a callback combines
> > 	info and reboot functionality then it should be split.
> > 
> > 	There must be another way to calm down problematic
> > 	info callbacks. And it has to be solved when such
> > 	a problem is reported. Is there any known issue, please?
> > 
> > It is possible that I have missed something important.
> > But I would really like to make the logic as simple as possible.
> 
> OK, I agree with you! It's indeed simpler and if others agree, I can
> happily change the logic to what you proposed. Although...currently the
> "crash_kexec_post_notifiers" allows to call _all_ panic_reboot_list
> callbacks _before kdump_.
> 
> We need to mention this change in the commit messages, but I really
> would like to hear the opinions of heavy users of notifiers (as
> Michael/Hyper-V) and the kdump interested parties (like Baoquan / Dave
> Young / Hayatama). If we all agree on such approach, will change that
> for V2 =)
> 
> Thanks again Petr, for the time spent in such detailed review!
> Cheers,
> 
> 
> Guilherme
>
Guilherme G. Piccoli May 20, 2022, 11:23 a.m. UTC | #7
On 19/05/2022 20:45, Baoquan He wrote:
> [...]
>> I really appreciate the summary skill you have, to convert complex
>> problems in very clear and concise ideas. Thanks for that, very useful!
>> I agree with what was summarized above.
> 
> I want to say the similar words to Petr's reviewing comment when I went
> through the patches and traced each reviewing sub-thread to try to
> catch up. Petr has reivewed this series so carefully and given many
> comments I want to ack immediately.
> 
> I agree with most of the suggestions from Petr to this patch, except of
> one tiny concern, please see below inline comment.

Hi Baoquan, thanks! I'm glad you're also reviewing that =)


> [...]
> 
> I like the proposed skeleton of panic() and code style suggested by
> Petr very much. About panic_prefer_crash_dump which might need be added,
> I hope it has a default value true. This makes crash_dump execute at
> first by default just as before, unless people specify
> panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add
> panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump,
> this is inconsistent with the old behaviour.

I'd like to understand better why the crash_kexec() must always be the
first thing in your use case. If we keep that behavior, we'll see all
sorts of workarounds - see the last patches of this series, Hyper-V and
PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force
execution of their relevant notifiers (like the vmbus disconnect,
specially in arm64 that has no custom machine_crash_shutdown, or the
fadump case in ppc). This led to more risk in kdump.

The thing is: with the notifiers' split, we tried to keep only the most
relevant/necessary stuff in this first list, things that ultimately
should improve kdump reliability or if not, at least not break it. My
feeling is that, with this series, we should change the idea/concept
that kdump must run first nevertheless, not matter what. We're here
trying to accommodate the antagonistic goals of hypervisors that need
some clean-up (even for kdump to work) VS. kdump users, that wish a
"pristine" system reboot ASAP after the crash.

Cheers,


Guilherme
Eric W. Biederman May 24, 2022, 2:44 p.m. UTC | #8
"Guilherme G. Piccoli" <gpiccoli@igalia.com> writes:

> The panic() function is somewhat convoluted - a lot of changes were
> made over the years, adding comments that might be misleading/outdated
> now, it has a code structure that is a bit complex to follow, with
> lots of conditionals, for example. The panic notifier list is something
> else - a single list, with multiple callbacks of different purposes,
> that run in a non-deterministic order and may affect hardly kdump
> reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.
>
> This patch proposes a major refactor on the panic path based on Petr's
> idea [0] - basically we split the notifiers list in three, having a set
> of different call points in the panic path. Below a list of changes
> proposed in this patch, culminating in the panic notifiers level
> concept:
>
> (a) First of all, we improved comments all over the function
> and removed useless variables / includes. Also, as part of this
> clean-up we concentrate the console flushing functions in a helper.
>
> (b) As mentioned before, there is a split of the panic notifier list
> in three, based on the purpose of the callback. The code contains
> good documentation in form of comments, but a summary of the three
> lists follows:
>
> - the hypervisor list aims low-risk procedures to inform hypervisors
> or firmware about the panic event, also includes LED-related functions;
>
> - the informational list contains callbacks that provide more details,
> like kernel offset or trace dump (if enabled) and also includes the
> callbacks aimed at reducing log pollution or warns, like the RCU and
> hung task disable callbacks;
>
> - finally, the pre_reboot list is the old notifier list renamed,
> containing the more risky callbacks that didn't fit the previous
> lists. There is also a 4th list (the post_reboot one), but it's not
> related with the original list - it contains late time architecture
> callbacks aimed at stopping the machine, for example.
>
> The 3 notifiers lists execute in different moments, hypervisor being
> the first, followed by informational and finally the pre_reboot list.
>
> (c) But then, there is the ordering problem of the notifiers against
> the crash_kernel() call - kdump must be as reliable as possible.
> For that, a simple binary "switch" as "crash_kexec_post_notifiers"
> is not enough, hence we introduce here concept of panic notifier
> levels: there are 5 levels, from 0 (no notifier executes before
> kdump) until 4 (all notifiers run before kdump); the default level
> is 2, in which the hypervisor and (iff we have any kmsg dumper)
> the informational notifiers execute before kdump.
>
> The detailed documentation of the levels is present in code comments
> and in the kernel-parameters.txt file; as an analogy with the previous
> panic() implementation, the level 0 is exactly the same as the old
> behavior of notifiers, running all after kdump, and the level 4 is
> the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as
> a deprecated one).
>
> (d) Finally, an important change made here: we now use only the
> function "crash_smp_send_stop()" to shut all the secondary CPUs
> in the panic path. Before, there was a case of using the regular
> "smp_send_stop()", but the better approach is to simplify the
> code and try to use the function which was created exclusively
> for the panic path. Experiments showed that it works fine, and
> code was very simplified with that.
>
> Functional change is expected from this refactor, since now we
> call some notifiers by default before kdump, but the goal here
> besides code clean-up is to have a better panic path, more
> reliable and deterministic, but also very customizable.
>
> [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/

I am late to this discussion.  My apologies.

Unfortunately I am also very grouchy.

Notifiers before kexec on panic are fundamentally broken.  So please
just remove crash_kexec_post notifiers and be done with it.  Part of the
deep issue is that firmware always has a common and broken
implementation for anything that is not mission critical to
motherboards.

Notifiers in any sense on these paths are just bollocks.  Any kind of
notifier list is fundamentally fragile in the face of memory corruption
and very very difficult to review.

So I am going to refresh my ancient NACK on this.

I can certainly appreciate that there are pieces of the reboot paths
that can be improved.  I don't think making anything more feature full
or flexible is any kind of real improvement.

Eric



>
> Suggested-by: Petr Mladek <pmladek@suse.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
>
> Special thanks to Petr and Baoquan for the suggestion and feedback in a previous
> email thread. There's some important design decisions that worth mentioning and
> discussing:
>
> * The default panic notifiers level is 2, based on Petr Mladek's suggestion,
> which makes a lot of sense. Of course, this is customizable through the
> parameter, but would be something worthwhile to have a KConfig option to set
> the default level? It would help distros that want the old behavior
> (no notifiers before kdump) as default.
>
> * The implementation choice was to _avoid_ intricate if conditionals in the
> panic path, which would _definitely_ be present with the panic notifiers levels
> idea; so, instead of lots of if conditionals, the set/clear bits approach with
> functions called in 2 points (but executing only in one of them) is much easier
> to follow an was used here; the ordering helper function and the comments also
> help a lot to avoid confusion (hopefully).
>
> * Choice was to *always* use crash_smp_send_stop() instead of sometimes making
> use of the regular smp_send_stop(); for most architectures they are the same,
> including Xen (on x86). For the ones that override it, all should work fine,
> in the powerpc case it's even more correct (see the subsequent patch
> "powerpc: Do not force all panic notifiers to execute before kdump")
>
> There seems to be 2 cases that requires some plumbing to work 100% right:
> - ARM doesn't disable local interrupts / FIQs in the crash version of
> send_stop(); we patched that early in this series;
> - x86 could face an issue if we have VMX and do use crash_smp_send_stop()
> _without_ kdump, but this is fixed in the first patch of the series (and
> it's a bug present even before this refactor).
>
> * Notice we didn't add a sysrq for panic notifiers level - should have it?
> Alejandro proposed recently to add a sysrq for "crash_kexec_post_notifiers",
> let me know if you feel the need here Alejandro, since the core parameters are
> present in /sys, I didn't consider much gain in having a sysrq, but of course
> I'm open to suggestions!
>
> Thanks advance for the review!
>
>  .../admin-guide/kernel-parameters.txt         |  42 ++-
>  include/linux/panic_notifier.h                |   1 +
>  kernel/kexec_core.c                           |   8 +-
>  kernel/panic.c                                | 292 +++++++++++++-----
>  .../selftests/pstore/pstore_crash_test        |   5 +-
>  5 files changed, 252 insertions(+), 96 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3f1cc5e317ed..8d3524060ce3 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -829,6 +829,13 @@
>  			It will be ignored when crashkernel=X,high is not used
>  			or memory reserved is below 4G.
>  
> +	crash_kexec_post_notifiers
> +			This was DEPRECATED - users should always prefer the
> +			parameter "panic_notifiers_level" - check its entry
> +			in this documentation for details on how it works.
> +			Setting this parameter is exactly the same as setting
> +			"panic_notifiers_level=4".
> +
>  	cryptomgr.notests
>  			[KNL] Disable crypto self-tests
>  
> @@ -3784,6 +3791,33 @@
>  			timeout < 0: reboot immediately
>  			Format: <timeout>
>  
> +	panic_notifiers_level=
> +			[KNL] Set the panic notifiers execution order.
> +			Format: <unsigned int>
> +			We currently have 4 lists of panic notifiers; based
> +			on the functionality and risk (for panic success) the
> +			callbacks are added in a given list. The lists are:
> +			- hypervisor/FW notification list (low risk);
> +			- informational list (low/medium risk);
> +			- pre_reboot list (higher risk);
> +			- post_reboot list (only run late in panic and after
> +			kdump, not configurable for now).
> +			This parameter defines the ordering of the first 3
> +			lists with regards to kdump; the levels determine
> +			which set of notifiers execute before kdump. The
> +			accepted levels are:
> +			0: kdump is the first thing to run, NO list is
> +			executed before kdump.
> +			1: only the hypervisor list is executed before kdump.
> +			2 (default level): the hypervisor list and (*if*
> +			there's any kmsg_dumper defined) the informational
> +			list are executed before kdump.
> +			3: both the hypervisor and the informational lists
> +			(always) execute before kdump.
> +			4: the 3 lists (hypervisor, info and pre_reboot)
> +			execute before kdump - this behavior is analog to the
> +			deprecated parameter "crash_kexec_post_notifiers".
> +
>  	panic_print=	Bitmask for printing system info when panic happens.
>  			User can chose combination of the following bits:
>  			bit 0: print all tasks info
> @@ -3814,14 +3848,6 @@
>  	panic_on_warn	panic() instead of WARN().  Useful to cause kdump
>  			on a WARN().
>  
> -	crash_kexec_post_notifiers
> -			Run kdump after running panic-notifiers and dumping
> -			kmsg. This only for the users who doubt kdump always
> -			succeeds in any situation.
> -			Note that this also increases risks of kdump failure,
> -			because some panic notifiers can make the crashed
> -			kernel more unstable.
> -
>  	parkbd.port=	[HW] Parallel port number the keyboard adapter is
>  			connected to, default is 0.
>  			Format: <parport#>
> diff --git a/include/linux/panic_notifier.h b/include/linux/panic_notifier.h
> index bcf6a5ea9d7f..b5041132321d 100644
> --- a/include/linux/panic_notifier.h
> +++ b/include/linux/panic_notifier.h
> @@ -10,6 +10,7 @@ extern struct atomic_notifier_head panic_info_list;
>  extern struct atomic_notifier_head panic_pre_reboot_list;
>  extern struct atomic_notifier_head panic_post_reboot_list;
>  
> +bool panic_notifiers_before_kdump(void);
>  extern bool crash_kexec_post_notifiers;
>  
>  enum panic_notifier_val {
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 68480f731192..f8906db8ca72 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -74,11 +74,11 @@ struct resource crashk_low_res = {
>  int kexec_should_crash(struct task_struct *p)
>  {
>  	/*
> -	 * If crash_kexec_post_notifiers is enabled, don't run
> -	 * crash_kexec() here yet, which must be run after panic
> -	 * notifiers in panic().
> +	 * In case any panic notifiers are set to execute before kexec,
> +	 * don't run crash_kexec() here yet, which must run after these
> +	 * panic notifiers are executed, in the panic() path.
>  	 */
> -	if (crash_kexec_post_notifiers)
> +	if (panic_notifiers_before_kdump())
>  		return 0;
>  	/*
>  	 * There are 4 panic() calls in make_task_dead() path, each of which
> diff --git a/kernel/panic.c b/kernel/panic.c
> index bf792102b43e..b7c055d4f87f 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -15,7 +15,6 @@
>  #include <linux/kgdb.h>
>  #include <linux/kmsg_dump.h>
>  #include <linux/kallsyms.h>
> -#include <linux/notifier.h>
>  #include <linux/vt_kern.h>
>  #include <linux/module.h>
>  #include <linux/random.h>
> @@ -52,14 +51,23 @@ static unsigned long tainted_mask =
>  static int pause_on_oops;
>  static int pause_on_oops_flag;
>  static DEFINE_SPINLOCK(pause_on_oops_lock);
> -bool crash_kexec_post_notifiers;
> +
>  int panic_on_warn __read_mostly;
> +bool panic_on_taint_nousertaint;
>  unsigned long panic_on_taint;
> -bool panic_on_taint_nousertaint = false;
>  
>  int panic_timeout = CONFIG_PANIC_TIMEOUT;
>  EXPORT_SYMBOL_GPL(panic_timeout);
>  
> +/* Initialized with all notifiers set to run before kdump */
> +static unsigned long panic_notifiers_bits = 15;
> +
> +/* Default level is 2, see kernel-parameters.txt */
> +unsigned int panic_notifiers_level = 2;
> +
> +/* DEPRECATED in favor of panic_notifiers_level */
> +bool crash_kexec_post_notifiers;
> +
>  #define PANIC_PRINT_TASK_INFO		0x00000001
>  #define PANIC_PRINT_MEM_INFO		0x00000002
>  #define PANIC_PRINT_TIMER_INFO		0x00000004
> @@ -109,10 +117,14 @@ void __weak nmi_panic_self_stop(struct pt_regs *regs)
>  }
>  
>  /*
> - * Stop other CPUs in panic.  Architecture dependent code may override this
> - * with more suitable version.  For example, if the architecture supports
> - * crash dump, it should save registers of each stopped CPU and disable
> - * per-CPU features such as virtualization extensions.
> + * Stop other CPUs in panic context.
> + *
> + * Architecture dependent code may override this with more suitable version.
> + * For example, if the architecture supports crash dump, it should save the
> + * registers of each stopped CPU and disable per-CPU features such as
> + * virtualization extensions. When not overridden in arch code (and for
> + * x86/xen), this is exactly the same as execute smp_send_stop(), but
> + * guarded against duplicate execution.
>   */
>  void __weak crash_smp_send_stop(void)
>  {
> @@ -183,6 +195,112 @@ static void panic_print_sys_info(bool console_flush)
>  		ftrace_dump(DUMP_ALL);
>  }
>  
> +/*
> + * Helper that accumulates all console flushing routines executed on panic.
> + */
> +static void console_flushing(void)
> +{
> +#ifdef CONFIG_VT
> +	unblank_screen();
> +#endif
> +	console_unblank();
> +
> +	/*
> +	 * In this point, we may have disabled other CPUs, hence stopping the
> +	 * CPU holding the lock while still having some valuable data in the
> +	 * console buffer.
> +	 *
> +	 * Try to acquire the lock then release it regardless of the result.
> +	 * The release will also print the buffers out. Locks debug should
> +	 * be disabled to avoid reporting bad unlock balance when panic()
> +	 * is not being called from OOPS.
> +	 */
> +	debug_locks_off();
> +	console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> +
> +	panic_print_sys_info(true);
> +}
> +
> +#define PN_HYPERVISOR_BIT	0
> +#define PN_INFO_BIT		1
> +#define PN_PRE_REBOOT_BIT	2
> +#define PN_POST_REBOOT_BIT	3
> +
> +/*
> + * Determine the order of panic notifiers with regards to kdump.
> + *
> + * This function relies in the "panic_notifiers_level" kernel parameter
> + * to determine how to order the notifiers with regards to kdump. We
> + * have currently 5 levels. For details, please check the kernel docs for
> + * "panic_notifiers_level" at Documentation/admin-guide/kernel-parameters.txt.
> + *
> + * Default level is 2, which means the panic hypervisor and informational
> + * (unless we don't have any kmsg_dumper) lists will execute before kdump.
> + */
> +static void order_panic_notifiers_and_kdump(void)
> +{
> +	/*
> +	 * The parameter "crash_kexec_post_notifiers" is deprecated, but
> +	 * valid. Users that set it want really all panic notifiers to
> +	 * execute before kdump, so it's effectively the same as setting
> +	 * the panic notifiers level to 4.
> +	 */
> +	if (panic_notifiers_level >= 4 || crash_kexec_post_notifiers)
> +		return;
> +
> +	/*
> +	 * Based on the level configured (smaller than 4), we clear the
> +	 * proper bits in "panic_notifiers_bits". Notice that this bitfield
> +	 * is initialized with all notifiers set.
> +	 */
> +	switch (panic_notifiers_level) {
> +	case 3:
> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> +		break;
> +	case 2:
> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> +
> +		if (!kmsg_has_dumpers())
> +			clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> +		break;
> +	case 1:
> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> +		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> +		break;
> +	case 0:
> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> +		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> +		clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
> +		break;
> +	}
> +}
> +
> +/*
> + * Set of helpers to execute the panic notifiers only once.
> + * Just the informational notifier cares about the return.
> + */
> +static inline bool notifier_run_once(struct atomic_notifier_head head,
> +				     char *buf, long bit)
> +{
> +	if (test_and_change_bit(bit, &panic_notifiers_bits)) {
> +		atomic_notifier_call_chain(&head, PANIC_NOTIFIER, buf);
> +		return true;
> +	}
> +	return false;
> +}
> +
> +#define panic_notifier_hypervisor_once(buf)\
> +	notifier_run_once(panic_hypervisor_list, buf, PN_HYPERVISOR_BIT)
> +
> +#define panic_notifier_info_once(buf)\
> +	notifier_run_once(panic_info_list, buf, PN_INFO_BIT)
> +
> +#define panic_notifier_pre_reboot_once(buf)\
> +	notifier_run_once(panic_pre_reboot_list, buf, PN_PRE_REBOOT_BIT)
> +
> +#define panic_notifier_post_reboot_once(buf)\
> +	notifier_run_once(panic_post_reboot_list, buf, PN_POST_REBOOT_BIT)
> +
>  /**
>   *	panic - halt the system
>   *	@fmt: The text string to print
> @@ -198,32 +316,29 @@ void panic(const char *fmt, ...)
>  	long i, i_next = 0, len;
>  	int state = 0;
>  	int old_cpu, this_cpu;
> -	bool _crash_kexec_post_notifiers = crash_kexec_post_notifiers;
>  
> -	if (panic_on_warn) {
> -		/*
> -		 * This thread may hit another WARN() in the panic path.
> -		 * Resetting this prevents additional WARN() from panicking the
> -		 * system on this thread.  Other threads are blocked by the
> -		 * panic_mutex in panic().
> -		 */
> -		panic_on_warn = 0;
> -	}
> +	/*
> +	 * This thread may hit another WARN() in the panic path, so
> +	 * resetting this option prevents additional WARN() from
> +	 * re-panicking the system here.
> +	 */
> +	panic_on_warn = 0;
>  
>  	/*
>  	 * Disable local interrupts. This will prevent panic_smp_self_stop
> -	 * from deadlocking the first cpu that invokes the panic, since
> -	 * there is nothing to prevent an interrupt handler (that runs
> -	 * after setting panic_cpu) from invoking panic() again.
> +	 * from deadlocking the first cpu that invokes the panic, since there
> +	 * is nothing to prevent an interrupt handler (that runs after setting
> +	 * panic_cpu) from invoking panic() again. Also disables preemption
> +	 * here - notice it's not safe to rely on interrupt disabling to avoid
> +	 * preemption, since any cond_resched() or cond_resched_lock() might
> +	 * trigger a reschedule if the preempt count is 0 (for reference, see
> +	 * Documentation/locking/preempt-locking.rst). Some functions called
> +	 * from here want preempt disabled, so no point enabling it later.
>  	 */
>  	local_irq_disable();
>  	preempt_disable_notrace();
>  
>  	/*
> -	 * It's possible to come here directly from a panic-assertion and
> -	 * not have preempt disabled. Some functions called from here want
> -	 * preempt to be disabled. No point enabling it later though...
> -	 *
>  	 * Only one CPU is allowed to execute the panic code from here. For
>  	 * multiple parallel invocations of panic, all other CPUs either
>  	 * stop themself or will wait until they are stopped by the 1st CPU
> @@ -266,73 +381,75 @@ void panic(const char *fmt, ...)
>  	kgdb_panic(buf);
>  
>  	/*
> -	 * If we have crashed and we have a crash kernel loaded let it handle
> -	 * everything else.
> -	 * If we want to run this after calling panic_notifiers, pass
> -	 * the "crash_kexec_post_notifiers" option to the kernel.
> +	 * Here lies one of the most subtle parts of the panic path,
> +	 * the panic notifiers and their order with regards to kdump.
> +	 * We currently have 4 sets of notifiers:
>  	 *
> -	 * Bypass the panic_cpu check and call __crash_kexec directly.
> +	 *  - the hypervisor list is composed by callbacks that are related
> +	 *  to warn the FW / hypervisor about panic, or non-invasive LED
> +	 *  controlling functions - (hopefully) low-risk for kdump, should
> +	 *  run early if possible.
> +	 *
> +	 *  - the informational list is composed by functions dumping data
> +	 *  like kernel offsets, device error registers or tracing buffer;
> +	 *  also log flooding prevention callbacks fit in this list. It is
> +	 *  relatively safe to run before kdump.
> +	 *
> +	 *  - the pre_reboot list basically is everything else, all the
> +	 *  callbacks that don't fit in the 2 previous lists. It should
> +	 *  run *after* kdump if possible, as it contains high-risk
> +	 *  functions that may break kdump.
> +	 *
> +	 *  - we also have a 4th list of notifiers, the post_reboot
> +	 *  callbacks. This is not strongly related to kdump since it's
> +	 *  always executed late in the panic path, after the restart
> +	 *  mechanism (if set); its goal is to provide a way for
> +	 *  architecture code effectively power-off/disable the system.
> +	 *
> +	 *  The kernel provides the "panic_notifiers_level" parameter
> +	 *  to adjust the ordering in which these notifiers should run
> +	 *  with regards to kdump - the default level is 2, so both the
> +	 *  hypervisor and informational notifiers should execute before
> +	 *  the __crash_kexec(); the info notifier won't run by default
> +	 *  unless there's some kmsg_dumper() registered. For details
> +	 *  about it, check Documentation/admin-guide/kernel-parameters.txt.
> +	 *
> +	 *  Notice that the code relies in bits set/clear operations to
> +	 *  determine the ordering, functions *_once() execute only one
> +	 *  time, as their name implies. The goal is to prevent too much
> +	 *  if conditionals and more confusion. Finally, regarding CPUs
> +	 *  disabling: unless NO panic notifier executes before kdump,
> +	 *  we always disable secondary CPUs before __crash_kexec() and
> +	 *  the notifiers execute.
>  	 */
> -	if (!_crash_kexec_post_notifiers) {
> +	order_panic_notifiers_and_kdump();
> +
> +	/* If no level, we should kdump ASAP. */
> +	if (!panic_notifiers_level)
>  		__crash_kexec(NULL);
>  
> -		/*
> -		 * Note smp_send_stop is the usual smp shutdown function, which
> -		 * unfortunately means it may not be hardened to work in a
> -		 * panic situation.
> -		 */
> -		smp_send_stop();
> -	} else {
> -		/*
> -		 * If we want to do crash dump after notifier calls and
> -		 * kmsg_dump, we will need architecture dependent extra
> -		 * works in addition to stopping other CPUs.
> -		 */
> -		crash_smp_send_stop();
> -	}
> +	crash_smp_send_stop();
> +	panic_notifier_hypervisor_once(buf);
>  
> -	/*
> -	 * Run any panic handlers, including those that might need to
> -	 * add information to the kmsg dump output.
> -	 */
> -	atomic_notifier_call_chain(&panic_hypervisor_list, PANIC_NOTIFIER, buf);
> -	atomic_notifier_call_chain(&panic_info_list, PANIC_NOTIFIER, buf);
> -	atomic_notifier_call_chain(&panic_pre_reboot_list, PANIC_NOTIFIER, buf);
> +	if (panic_notifier_info_once(buf)) {
> +		panic_print_sys_info(false);
> +		kmsg_dump(KMSG_DUMP_PANIC);
> +	}
>  
> -	panic_print_sys_info(false);
> +	panic_notifier_pre_reboot_once(buf);
>  
> -	kmsg_dump(KMSG_DUMP_PANIC);
> +	__crash_kexec(NULL);
>  
> -	/*
> -	 * If you doubt kdump always works fine in any situation,
> -	 * "crash_kexec_post_notifiers" offers you a chance to run
> -	 * panic_notifiers and dumping kmsg before kdump.
> -	 * Note: since some panic_notifiers can make crashed kernel
> -	 * more unstable, it can increase risks of the kdump failure too.
> -	 *
> -	 * Bypass the panic_cpu check and call __crash_kexec directly.
> -	 */
> -	if (_crash_kexec_post_notifiers)
> -		__crash_kexec(NULL);
> +	panic_notifier_hypervisor_once(buf);
>  
> -#ifdef CONFIG_VT
> -	unblank_screen();
> -#endif
> -	console_unblank();
> -
> -	/*
> -	 * We may have ended up stopping the CPU holding the lock (in
> -	 * smp_send_stop()) while still having some valuable data in the console
> -	 * buffer.  Try to acquire the lock then release it regardless of the
> -	 * result.  The release will also print the buffers out.  Locks debug
> -	 * should be disabled to avoid reporting bad unlock balance when
> -	 * panic() is not being callled from OOPS.
> -	 */
> -	debug_locks_off();
> -	console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> +	if (panic_notifier_info_once(buf)) {
> +		panic_print_sys_info(false);
> +		kmsg_dump(KMSG_DUMP_PANIC);
> +	}
>  
> -	panic_print_sys_info(true);
> +	panic_notifier_pre_reboot_once(buf);
>  
> +	console_flushing();
>  	if (!panic_blink)
>  		panic_blink = no_blink;
>  
> @@ -363,8 +480,7 @@ void panic(const char *fmt, ...)
>  		emergency_restart();
>  	}
>  
> -	atomic_notifier_call_chain(&panic_post_reboot_list,
> -				   PANIC_NOTIFIER, buf);
> +	panic_notifier_post_reboot_once(buf);
>  
>  	pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);
>  
> @@ -383,6 +499,15 @@ void panic(const char *fmt, ...)
>  
>  EXPORT_SYMBOL(panic);
>  
> +/*
> + * Helper used in the kexec code, to validate if any
> + * panic notifier is set to execute early, before kdump.
> + */
> +inline bool panic_notifiers_before_kdump(void)
> +{
> +	return panic_notifiers_level || crash_kexec_post_notifiers;
> +}
> +
>  /*
>   * TAINT_FORCED_RMMOD could be a per-module flag but the module
>   * is being removed anyway.
> @@ -692,6 +817,9 @@ core_param(panic, panic_timeout, int, 0644);
>  core_param(panic_print, panic_print, ulong, 0644);
>  core_param(pause_on_oops, pause_on_oops, int, 0644);
>  core_param(panic_on_warn, panic_on_warn, int, 0644);
> +core_param(panic_notifiers_level, panic_notifiers_level, uint, 0644);
> +
> +/* DEPRECATED in favor of panic_notifiers_level */
>  core_param(crash_kexec_post_notifiers, crash_kexec_post_notifiers, bool, 0644);
>  
>  static int __init oops_setup(char *s)
> diff --git a/tools/testing/selftests/pstore/pstore_crash_test b/tools/testing/selftests/pstore/pstore_crash_test
> index 2a329bbb4aca..1e60ce4501aa 100755
> --- a/tools/testing/selftests/pstore/pstore_crash_test
> +++ b/tools/testing/selftests/pstore/pstore_crash_test
> @@ -25,6 +25,7 @@ touch $REBOOT_FLAG
>  sync
>  
>  # cause crash
> -# Note: If you use kdump and want to see kmesg-* files after reboot, you should
> -#       specify 'crash_kexec_post_notifiers' in 1st kernel's cmdline.
> +# Note: If you use kdump and want to see kmsg-* files after reboot, you should
> +#       be sure that the parameter "panic_notifiers_level" is more than '2' (the
> +#       default value for this parameter is '2') in the first kernel's cmdline.
>  echo c > /proc/sysrq-trigger
Guilherme G. Piccoli May 26, 2022, 4:25 p.m. UTC | #9
Hey folks, first of all thanks a lot for the reviews / opinions about
this. I imagined that such change would be polemic, and I see I was
right heh


I'll try to "mix" all the relevant opinions in a single email, since
they happened in different responses and even different mail threads.

I've looped here the most interested parties based on the feedback
received, such as Baoquan (kdump), Hatayama (kdump), Eric (kexec), Mark
(arm64), Michael (Hyper-V), Petr (console/printk) and Vitaly (hyper-v /
kvm). I hope we can discuss and try to reach some consensus - my
apologies in advance for this long message!

So, here goes some feedback we received about this change and correlated
feedback from arm64 community - my apologies if I missed something
important, I've tried to collect the most relevant portions, while
keeping the summary "as short" as possible. I'll respond to such
feedback below, after the quotes.


On 24/05/2022 05:32, Baoquan He wrote:
>> [...] 
>> Firstly, kdump is not always the first thing. In any use case, if kdump
>> kernel is not loaded, it's not the first thing at all. Not to mention
>> if crash_kexec_post_notifiers is specified.
>> [...]
>> Changing this will cause regression. During these years, nobody ever doubt
>> kdump should execute firstly if crashkernel is reserved and kdump kernel is
>> loaded. That's not saying we can't change
>> this, but need a convincing justification.
>> [...] 
>> Secondly, even with the notifiers' split, we can't guarantee people will
>> absolutely add notifiers into right list in the future. Letting kdump
>> execute behind lists by default will put kdump into risk.
>> [...] 
>> As for Hyper-V, if it enforces to terminate VMbus connection, no matter
>> it's kdump or not, why not taking it out of panic notifiers list and
>> execute it before kdump unconditionally.


On 24/05/2022 05:01, Petr Mladek wrote:
>> [...]
>> Anyway, I see four possible solutions:
>> 
>>   1. The most conservative approach is to keep the current behavior
>>      and call kdump first by default.
>> 
>>   2. A medium conservative approach to change the default default
>>      behavior and call hypervisor and eventually the info notifiers
>>      before kdump. There still would be the possibility to call kdump
>>      first by the command line parameter.
>> 
>>   3. Remove the possibility to call kdump first completely. It would
>>      assume that all the notifiers in the info list are super safe
>>      or that they make kdump actually more safe.
>> 
>>   4. Create one more notifier list for operations that always should
>>      be called before crash_dump.
>> 
>> Regarding the extra notifier list (4th solution). It is not clear to
>> me whether it would be always called even before hypervisor list or
>> when kdump is not enabled. We must not over-engineer it.
>> 
>> 2nd proposal looks like a good compromise. But maybe we could do
>> this change few releases later. The notifiers split is a big
>> change on its own.


On 24/05/2022 07:18, Baoquan He wrote:
>>[...]
>> I would vote for 1 or 4 without any hesitation, and prefer 4. I ever
>> suggest the variant of solution 4 in v1 reviewing. That's taking those
>> notifiers out of list and enforcing to execute them before kdump. E.g
>> the one on HyperV to terminate VMbus connection. Maybe solution 4 is
>> better to provide a determinate way for people to add necessary code
>> at the earliest part.
>> [...] 
>>>
>>> Regarding the extra notifier list (4th solution). It is not clear to
>>> me whether it would be always called even before hypervisor list or
>>> when kdump is not enabled. We must not over-engineer it.
>> 
>> One thing I would like to notice is, no matter how perfect we split the
>> lists this time, we can't gurantee people will add notifiers reasonablly
>> in the future. And people from different sub-component may not do
>> sufficient investigation and add them to fulfil their local purpose.
>> 
>> The current panic notifers list is the best example. Hyper-V actually
>> wants to run some necessary code before kdump, but not all of them, they
>> just add it, ignoring the original purpose of
>> crash_kexec_post_notifiers. I guess they do like this just because it's
>> easy to do, no need to bother changing code in generic place.
>> 
>> Solution 4 can make this no doubt, that's why I like it better.
>> [...] 
>> As I replied to Guilherme, solution 2 will cause regression if not
>> calling kdump firstly. Solution 3 leaves people space to make mistake,
>> they could add nontifier into wrong list.
>> 
>> I would like to note again that the panic notifiers are optional to run,
>> while kdump is expectd once loaded, from the original purpose. I guess
>> people I know will still have this thought, e.g Hatayama, Masa, they are
>> truly often use panic notifiers like this on their company's system.


On 24/05/2022 11:44, Eric W. Biederman wrote:
> [...]
> Unfortunately I am also very grouchy.
> 
> Notifiers before kexec on panic are fundamentally broken.  So please
> just remove crash_kexec_post notifiers and be done with it.  Part of the
> deep issue is that firmware always has a common and broken
> implementation for anything that is not mission critical to
> motherboards.
> 
> Notifiers in any sense on these paths are just bollocks.  Any kind of
> notifier list is fundamentally fragile in the face of memory corruption
> and very very difficult to review.
> 
> So I am going to refresh my ancient NACK on this.
> 
> I can certainly appreciate that there are pieces of the reboot paths
> that can be improved.  I don't think making anything more feature full
> or flexible is any kind of real improvement.


Now, from the thread "Should arm64 have a custom crash shutdown
handler?" (link:
https://lore.kernel.org/lkml/427a8277-49f0-4317-d6c3-4a15d7070e55@igalia.com/),
we have:

On 05/05/2022 08:10, Mark Rutland wrote:
>> On Wed, May 04, 2022 at 05:00:42PM -0300, Guilherme G. Piccoli wrote:
>>> [...]
>>> Currently, when we kexec in arm64, the function machine_crash_shutdown()
>>> is called as a handler to disable CPUs and (potentially) do extra
>>> quiesce work. In the aforementioned architectures, there's a way to
>>> override this function, if for example an hypervisor wish to have its
>>> guests running their own custom shutdown machinery.
>> 
>> What exactly do you need to do in this custom shutdown machinery?
>> 
>> The general expectation for arm64 is that any hypervisor can implement PSCI,
>> and as long as you have that, CPUs (and the VM as a whole) can be shutdown in a
>> standard way.
>> 
>> I suspect what you're actually after is a mechanism to notify the hypervisor
>> when the guest crashes, rather than changing the way the shutdown itself
>> occurs? If so, we already have panic notifiers, and QEMU has a "pvpanic"
>> device using that. See drivers/misc/pvpanic/.


OK, so it seems we have some points in which agreement exists, and some
points that there is no agreement and instead, we have antagonistic /
opposite views and needs. Let's start with the easier part heh


It seems everybody agrees that *we shouldn't over-engineer things*, and
as per Eric good words: making the panic path more feature-full or
increasing flexibility isn't a good idea. So, as a "corollary": the
panic level approach I'm proposing is not a good fit, I'll drop it and
let's go with something simpler.

Another point of agreement seems to be that _notifier lists in the panic
path are dangerous_, for *2 different reasons*:

(a) We cannot guarantee that people won't add crazy callbacks there, we
can plan and document things the best as possible - it'll never be
enough, somebody eventually would slip a nonsense callback that would
break things and defeat the planned purpose of such a list;

(b) As per Eric point, in a panic/crash situation we might have memory
corruption exactly in the list code / pointers, etc, so the notifier
lists are, by nature, a bit fragile. But I think we shouldn't consider
it completely "bollocks", since this approach has been used for a while
with a good success rate. So, lists aren't perfect at all, but at the
same time, they aren't completely useless.


Now, to the points in which there are conflicting / antagonistic
needs/views:

(I) Kdump should be the first thing to run, as it's been like that since
forever. But...notice that "crash_kexec_post_notifiers" was created
exactly as a way to circumvent that, so we can see this is not an
absolute truth. Some users really *require to execute* some special code
*before kdump*.
Worth noticing here that regular kexec invokes the drivers .shutdown()
handlers, while kdump [aka crash_kexec()] does not, so we must have a
way to run code before kdump in a crash situation.

(II) If *we need* to have some code always running before kdump/reboot
on panic path (like the Hyper-V vmbus connection unload), *where to add
such code*? Again, conflicting views. Some would say we should hardcode
this in the panic() function. Others, that we should use the custom
machine_crash_shutdown() infrastructure - but notice that this isn't
available in all architectures, like arm64. Finally, others suggest
to...use notifier lists! Which was more or less the approach we took in
this patch.

How can we reach consensus on this? Not everybody will be 100% happy,
that's for sure. Also, I'd risk to say keep things as-is now or even
getting rid of "crash_kexec_post_notifiers" won't work at all, we have
users with legitimate needs of running code before a kdump/reboot when
crash happens. The *main goal* should be to have a *simple solution*
that doesn't require users to abuse parameters, like it's been done with
"crash_kexec_post_notifiers" (Hyper-V and PowerPC currently force this
parameter to be enabled, for example).


To avoid using a 4th list, especially given the list nature is a bit
fragile, I'd suggest one of the 3 following approaches - I *really
appreciate feedbacks* on that so I can implement the best solution and
avoid wasting time in some poor/disliked solution:

(1) We could have a helper function in the "beginning" of panic path,
i.e., after the logic to disable preemption/IRQs/secondary CPUs, but
*before* kdump. Then, users like Hyper-V that require to execute stuff
regardless of kdump or not, would run their callbacks from there,
directly, no lists involved.

- pros: simple, doesn't mess with arch code or involve lists.
- cons: header issues - will need to "export" such function from driver
code, for example, to run in core code. Also, some code might only be
required to run in some architectures, or only if kdump is set, etc.,
making the callbacks more complex / full of if conditionals.


(2) Similarly to previous solution, we could have a helper in the kexec
core code, not in the panic path. This way, users that require running
stuff *before a kdump* would add direct calls there; if kdump isn't
configured, and if such users also require that this code execute in
panic nevertheless, they'd need to also add a callback to some notifier
list.

- pros: also simple / doesn't mess with arch code or involve lists;
restricts the callbacks to kdump case.
- cons: also header issues, but might cause code duplicity too, since
some users would require both to run their code before a kdump and in
some panic notifier list.


(3) Have a way in arm64 (and all archs maybe) to run code before a kdump
- this is analog to the custom machine_crash_shutdown() we have nowadays
in some archs.

- pros: decouple kdump-required callbacks from panic notifiers, doesn't
involve lists, friendly to arch-dependent callbacks.
- cons: also header issues, might cause code duplicity (since some users
would also require to run their code in panic path regardless of kdump)
and involve changing arch code (some maintainers like Mark aren't fond
about that, with good reasons!).


So, hopefully we can converge to some direction even if not 100% of
users are happy - this problem is naturally full of trade-offs.
Thanks again for the reviews and the time you're spending reading these
long threads.

Cheers,


Guilherme
Guilherme G. Piccoli June 15, 2022, 9:36 a.m. UTC | #10
Perfect Petr, thanks for your feedback!

I'll be out for some weeks, but after that what I'm doing is to split
the series in 2 parts:

(a) The general fixes, which should be reviewed by subsystem maintainers
and even merged individually by them.

(b) The proper panic refactor, which includes the notifiers list split,
etc. I'll think about what I consider the best solution for the
crash_dump required ones, and will try to split in very simple patches
to make it easier to review.

Cheers,


Guilherme
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3f1cc5e317ed..8d3524060ce3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -829,6 +829,13 @@ 
 			It will be ignored when crashkernel=X,high is not used
 			or memory reserved is below 4G.
 
+	crash_kexec_post_notifiers
+			This was DEPRECATED - users should always prefer the
+			parameter "panic_notifiers_level" - check its entry
+			in this documentation for details on how it works.
+			Setting this parameter is exactly the same as setting
+			"panic_notifiers_level=4".
+
 	cryptomgr.notests
 			[KNL] Disable crypto self-tests
 
@@ -3784,6 +3791,33 @@ 
 			timeout < 0: reboot immediately
 			Format: <timeout>
 
+	panic_notifiers_level=
+			[KNL] Set the panic notifiers execution order.
+			Format: <unsigned int>
+			We currently have 4 lists of panic notifiers; based
+			on the functionality and risk (for panic success) the
+			callbacks are added in a given list. The lists are:
+			- hypervisor/FW notification list (low risk);
+			- informational list (low/medium risk);
+			- pre_reboot list (higher risk);
+			- post_reboot list (only run late in panic and after
+			kdump, not configurable for now).
+			This parameter defines the ordering of the first 3
+			lists with regards to kdump; the levels determine
+			which set of notifiers execute before kdump. The
+			accepted levels are:
+			0: kdump is the first thing to run, NO list is
+			executed before kdump.
+			1: only the hypervisor list is executed before kdump.
+			2 (default level): the hypervisor list and (*if*
+			there's any kmsg_dumper defined) the informational
+			list are executed before kdump.
+			3: both the hypervisor and the informational lists
+			(always) execute before kdump.
+			4: the 3 lists (hypervisor, info and pre_reboot)
+			execute before kdump - this behavior is analog to the
+			deprecated parameter "crash_kexec_post_notifiers".
+
 	panic_print=	Bitmask for printing system info when panic happens.
 			User can chose combination of the following bits:
 			bit 0: print all tasks info
@@ -3814,14 +3848,6 @@ 
 	panic_on_warn	panic() instead of WARN().  Useful to cause kdump
 			on a WARN().
 
-	crash_kexec_post_notifiers
-			Run kdump after running panic-notifiers and dumping
-			kmsg. This only for the users who doubt kdump always
-			succeeds in any situation.
-			Note that this also increases risks of kdump failure,
-			because some panic notifiers can make the crashed
-			kernel more unstable.
-
 	parkbd.port=	[HW] Parallel port number the keyboard adapter is
 			connected to, default is 0.
 			Format: <parport#>
diff --git a/include/linux/panic_notifier.h b/include/linux/panic_notifier.h
index bcf6a5ea9d7f..b5041132321d 100644
--- a/include/linux/panic_notifier.h
+++ b/include/linux/panic_notifier.h
@@ -10,6 +10,7 @@  extern struct atomic_notifier_head panic_info_list;
 extern struct atomic_notifier_head panic_pre_reboot_list;
 extern struct atomic_notifier_head panic_post_reboot_list;
 
+bool panic_notifiers_before_kdump(void);
 extern bool crash_kexec_post_notifiers;
 
 enum panic_notifier_val {
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 68480f731192..f8906db8ca72 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -74,11 +74,11 @@  struct resource crashk_low_res = {
 int kexec_should_crash(struct task_struct *p)
 {
 	/*
-	 * If crash_kexec_post_notifiers is enabled, don't run
-	 * crash_kexec() here yet, which must be run after panic
-	 * notifiers in panic().
+	 * In case any panic notifiers are set to execute before kexec,
+	 * don't run crash_kexec() here yet, which must run after these
+	 * panic notifiers are executed, in the panic() path.
 	 */
-	if (crash_kexec_post_notifiers)
+	if (panic_notifiers_before_kdump())
 		return 0;
 	/*
 	 * There are 4 panic() calls in make_task_dead() path, each of which
diff --git a/kernel/panic.c b/kernel/panic.c
index bf792102b43e..b7c055d4f87f 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -15,7 +15,6 @@ 
 #include <linux/kgdb.h>
 #include <linux/kmsg_dump.h>
 #include <linux/kallsyms.h>
-#include <linux/notifier.h>
 #include <linux/vt_kern.h>
 #include <linux/module.h>
 #include <linux/random.h>
@@ -52,14 +51,23 @@  static unsigned long tainted_mask =
 static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
-bool crash_kexec_post_notifiers;
+
 int panic_on_warn __read_mostly;
+bool panic_on_taint_nousertaint;
 unsigned long panic_on_taint;
-bool panic_on_taint_nousertaint = false;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;
 EXPORT_SYMBOL_GPL(panic_timeout);
 
+/* Initialized with all notifiers set to run before kdump */
+static unsigned long panic_notifiers_bits = 15;
+
+/* Default level is 2, see kernel-parameters.txt */
+unsigned int panic_notifiers_level = 2;
+
+/* DEPRECATED in favor of panic_notifiers_level */
+bool crash_kexec_post_notifiers;
+
 #define PANIC_PRINT_TASK_INFO		0x00000001
 #define PANIC_PRINT_MEM_INFO		0x00000002
 #define PANIC_PRINT_TIMER_INFO		0x00000004
@@ -109,10 +117,14 @@  void __weak nmi_panic_self_stop(struct pt_regs *regs)
 }
 
 /*
- * Stop other CPUs in panic.  Architecture dependent code may override this
- * with more suitable version.  For example, if the architecture supports
- * crash dump, it should save registers of each stopped CPU and disable
- * per-CPU features such as virtualization extensions.
+ * Stop other CPUs in panic context.
+ *
+ * Architecture dependent code may override this with more suitable version.
+ * For example, if the architecture supports crash dump, it should save the
+ * registers of each stopped CPU and disable per-CPU features such as
+ * virtualization extensions. When not overridden in arch code (and for
+ * x86/xen), this is exactly the same as execute smp_send_stop(), but
+ * guarded against duplicate execution.
  */
 void __weak crash_smp_send_stop(void)
 {
@@ -183,6 +195,112 @@  static void panic_print_sys_info(bool console_flush)
 		ftrace_dump(DUMP_ALL);
 }
 
+/*
+ * Helper that accumulates all console flushing routines executed on panic.
+ */
+static void console_flushing(void)
+{
+#ifdef CONFIG_VT
+	unblank_screen();
+#endif
+	console_unblank();
+
+	/*
+	 * In this point, we may have disabled other CPUs, hence stopping the
+	 * CPU holding the lock while still having some valuable data in the
+	 * console buffer.
+	 *
+	 * Try to acquire the lock then release it regardless of the result.
+	 * The release will also print the buffers out. Locks debug should
+	 * be disabled to avoid reporting bad unlock balance when panic()
+	 * is not being called from OOPS.
+	 */
+	debug_locks_off();
+	console_flush_on_panic(CONSOLE_FLUSH_PENDING);
+
+	panic_print_sys_info(true);
+}
+
+#define PN_HYPERVISOR_BIT	0
+#define PN_INFO_BIT		1
+#define PN_PRE_REBOOT_BIT	2
+#define PN_POST_REBOOT_BIT	3
+
+/*
+ * Determine the order of panic notifiers with regards to kdump.
+ *
+ * This function relies in the "panic_notifiers_level" kernel parameter
+ * to determine how to order the notifiers with regards to kdump. We
+ * have currently 5 levels. For details, please check the kernel docs for
+ * "panic_notifiers_level" at Documentation/admin-guide/kernel-parameters.txt.
+ *
+ * Default level is 2, which means the panic hypervisor and informational
+ * (unless we don't have any kmsg_dumper) lists will execute before kdump.
+ */
+static void order_panic_notifiers_and_kdump(void)
+{
+	/*
+	 * The parameter "crash_kexec_post_notifiers" is deprecated, but
+	 * valid. Users that set it want really all panic notifiers to
+	 * execute before kdump, so it's effectively the same as setting
+	 * the panic notifiers level to 4.
+	 */
+	if (panic_notifiers_level >= 4 || crash_kexec_post_notifiers)
+		return;
+
+	/*
+	 * Based on the level configured (smaller than 4), we clear the
+	 * proper bits in "panic_notifiers_bits". Notice that this bitfield
+	 * is initialized with all notifiers set.
+	 */
+	switch (panic_notifiers_level) {
+	case 3:
+		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
+		break;
+	case 2:
+		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
+
+		if (!kmsg_has_dumpers())
+			clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
+		break;
+	case 1:
+		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
+		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
+		break;
+	case 0:
+		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
+		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
+		clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
+		break;
+	}
+}
+
+/*
+ * Set of helpers to execute the panic notifiers only once.
+ * Just the informational notifier cares about the return.
+ */
+static inline bool notifier_run_once(struct atomic_notifier_head head,
+				     char *buf, long bit)
+{
+	if (test_and_change_bit(bit, &panic_notifiers_bits)) {
+		atomic_notifier_call_chain(&head, PANIC_NOTIFIER, buf);
+		return true;
+	}
+	return false;
+}
+
+#define panic_notifier_hypervisor_once(buf)\
+	notifier_run_once(panic_hypervisor_list, buf, PN_HYPERVISOR_BIT)
+
+#define panic_notifier_info_once(buf)\
+	notifier_run_once(panic_info_list, buf, PN_INFO_BIT)
+
+#define panic_notifier_pre_reboot_once(buf)\
+	notifier_run_once(panic_pre_reboot_list, buf, PN_PRE_REBOOT_BIT)
+
+#define panic_notifier_post_reboot_once(buf)\
+	notifier_run_once(panic_post_reboot_list, buf, PN_POST_REBOOT_BIT)
+
 /**
  *	panic - halt the system
  *	@fmt: The text string to print
@@ -198,32 +316,29 @@  void panic(const char *fmt, ...)
 	long i, i_next = 0, len;
 	int state = 0;
 	int old_cpu, this_cpu;
-	bool _crash_kexec_post_notifiers = crash_kexec_post_notifiers;
 
-	if (panic_on_warn) {
-		/*
-		 * This thread may hit another WARN() in the panic path.
-		 * Resetting this prevents additional WARN() from panicking the
-		 * system on this thread.  Other threads are blocked by the
-		 * panic_mutex in panic().
-		 */
-		panic_on_warn = 0;
-	}
+	/*
+	 * This thread may hit another WARN() in the panic path, so
+	 * resetting this option prevents additional WARN() from
+	 * re-panicking the system here.
+	 */
+	panic_on_warn = 0;
 
 	/*
 	 * Disable local interrupts. This will prevent panic_smp_self_stop
-	 * from deadlocking the first cpu that invokes the panic, since
-	 * there is nothing to prevent an interrupt handler (that runs
-	 * after setting panic_cpu) from invoking panic() again.
+	 * from deadlocking the first cpu that invokes the panic, since there
+	 * is nothing to prevent an interrupt handler (that runs after setting
+	 * panic_cpu) from invoking panic() again. Also disables preemption
+	 * here - notice it's not safe to rely on interrupt disabling to avoid
+	 * preemption, since any cond_resched() or cond_resched_lock() might
+	 * trigger a reschedule if the preempt count is 0 (for reference, see
+	 * Documentation/locking/preempt-locking.rst). Some functions called
+	 * from here want preempt disabled, so no point enabling it later.
 	 */
 	local_irq_disable();
 	preempt_disable_notrace();
 
 	/*
-	 * It's possible to come here directly from a panic-assertion and
-	 * not have preempt disabled. Some functions called from here want
-	 * preempt to be disabled. No point enabling it later though...
-	 *
 	 * Only one CPU is allowed to execute the panic code from here. For
 	 * multiple parallel invocations of panic, all other CPUs either
 	 * stop themself or will wait until they are stopped by the 1st CPU
@@ -266,73 +381,75 @@  void panic(const char *fmt, ...)
 	kgdb_panic(buf);
 
 	/*
-	 * If we have crashed and we have a crash kernel loaded let it handle
-	 * everything else.
-	 * If we want to run this after calling panic_notifiers, pass
-	 * the "crash_kexec_post_notifiers" option to the kernel.
+	 * Here lies one of the most subtle parts of the panic path,
+	 * the panic notifiers and their order with regards to kdump.
+	 * We currently have 4 sets of notifiers:
 	 *
-	 * Bypass the panic_cpu check and call __crash_kexec directly.
+	 *  - the hypervisor list is composed by callbacks that are related
+	 *  to warn the FW / hypervisor about panic, or non-invasive LED
+	 *  controlling functions - (hopefully) low-risk for kdump, should
+	 *  run early if possible.
+	 *
+	 *  - the informational list is composed by functions dumping data
+	 *  like kernel offsets, device error registers or tracing buffer;
+	 *  also log flooding prevention callbacks fit in this list. It is
+	 *  relatively safe to run before kdump.
+	 *
+	 *  - the pre_reboot list basically is everything else, all the
+	 *  callbacks that don't fit in the 2 previous lists. It should
+	 *  run *after* kdump if possible, as it contains high-risk
+	 *  functions that may break kdump.
+	 *
+	 *  - we also have a 4th list of notifiers, the post_reboot
+	 *  callbacks. This is not strongly related to kdump since it's
+	 *  always executed late in the panic path, after the restart
+	 *  mechanism (if set); its goal is to provide a way for
+	 *  architecture code effectively power-off/disable the system.
+	 *
+	 *  The kernel provides the "panic_notifiers_level" parameter
+	 *  to adjust the ordering in which these notifiers should run
+	 *  with regards to kdump - the default level is 2, so both the
+	 *  hypervisor and informational notifiers should execute before
+	 *  the __crash_kexec(); the info notifier won't run by default
+	 *  unless there's some kmsg_dumper() registered. For details
+	 *  about it, check Documentation/admin-guide/kernel-parameters.txt.
+	 *
+	 *  Notice that the code relies in bits set/clear operations to
+	 *  determine the ordering, functions *_once() execute only one
+	 *  time, as their name implies. The goal is to prevent too much
+	 *  if conditionals and more confusion. Finally, regarding CPUs
+	 *  disabling: unless NO panic notifier executes before kdump,
+	 *  we always disable secondary CPUs before __crash_kexec() and
+	 *  the notifiers execute.
 	 */
-	if (!_crash_kexec_post_notifiers) {
+	order_panic_notifiers_and_kdump();
+
+	/* If no level, we should kdump ASAP. */
+	if (!panic_notifiers_level)
 		__crash_kexec(NULL);
 
-		/*
-		 * Note smp_send_stop is the usual smp shutdown function, which
-		 * unfortunately means it may not be hardened to work in a
-		 * panic situation.
-		 */
-		smp_send_stop();
-	} else {
-		/*
-		 * If we want to do crash dump after notifier calls and
-		 * kmsg_dump, we will need architecture dependent extra
-		 * works in addition to stopping other CPUs.
-		 */
-		crash_smp_send_stop();
-	}
+	crash_smp_send_stop();
+	panic_notifier_hypervisor_once(buf);
 
-	/*
-	 * Run any panic handlers, including those that might need to
-	 * add information to the kmsg dump output.
-	 */
-	atomic_notifier_call_chain(&panic_hypervisor_list, PANIC_NOTIFIER, buf);
-	atomic_notifier_call_chain(&panic_info_list, PANIC_NOTIFIER, buf);
-	atomic_notifier_call_chain(&panic_pre_reboot_list, PANIC_NOTIFIER, buf);
+	if (panic_notifier_info_once(buf)) {
+		panic_print_sys_info(false);
+		kmsg_dump(KMSG_DUMP_PANIC);
+	}
 
-	panic_print_sys_info(false);
+	panic_notifier_pre_reboot_once(buf);
 
-	kmsg_dump(KMSG_DUMP_PANIC);
+	__crash_kexec(NULL);
 
-	/*
-	 * If you doubt kdump always works fine in any situation,
-	 * "crash_kexec_post_notifiers" offers you a chance to run
-	 * panic_notifiers and dumping kmsg before kdump.
-	 * Note: since some panic_notifiers can make crashed kernel
-	 * more unstable, it can increase risks of the kdump failure too.
-	 *
-	 * Bypass the panic_cpu check and call __crash_kexec directly.
-	 */
-	if (_crash_kexec_post_notifiers)
-		__crash_kexec(NULL);
+	panic_notifier_hypervisor_once(buf);
 
-#ifdef CONFIG_VT
-	unblank_screen();
-#endif
-	console_unblank();
-
-	/*
-	 * We may have ended up stopping the CPU holding the lock (in
-	 * smp_send_stop()) while still having some valuable data in the console
-	 * buffer.  Try to acquire the lock then release it regardless of the
-	 * result.  The release will also print the buffers out.  Locks debug
-	 * should be disabled to avoid reporting bad unlock balance when
-	 * panic() is not being callled from OOPS.
-	 */
-	debug_locks_off();
-	console_flush_on_panic(CONSOLE_FLUSH_PENDING);
+	if (panic_notifier_info_once(buf)) {
+		panic_print_sys_info(false);
+		kmsg_dump(KMSG_DUMP_PANIC);
+	}
 
-	panic_print_sys_info(true);
+	panic_notifier_pre_reboot_once(buf);
 
+	console_flushing();
 	if (!panic_blink)
 		panic_blink = no_blink;
 
@@ -363,8 +480,7 @@  void panic(const char *fmt, ...)
 		emergency_restart();
 	}
 
-	atomic_notifier_call_chain(&panic_post_reboot_list,
-				   PANIC_NOTIFIER, buf);
+	panic_notifier_post_reboot_once(buf);
 
 	pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);
 
@@ -383,6 +499,15 @@  void panic(const char *fmt, ...)
 
 EXPORT_SYMBOL(panic);
 
+/*
+ * Helper used in the kexec code, to validate if any
+ * panic notifier is set to execute early, before kdump.
+ */
+inline bool panic_notifiers_before_kdump(void)
+{
+	return panic_notifiers_level || crash_kexec_post_notifiers;
+}
+
 /*
  * TAINT_FORCED_RMMOD could be a per-module flag but the module
  * is being removed anyway.
@@ -692,6 +817,9 @@  core_param(panic, panic_timeout, int, 0644);
 core_param(panic_print, panic_print, ulong, 0644);
 core_param(pause_on_oops, pause_on_oops, int, 0644);
 core_param(panic_on_warn, panic_on_warn, int, 0644);
+core_param(panic_notifiers_level, panic_notifiers_level, uint, 0644);
+
+/* DEPRECATED in favor of panic_notifiers_level */
 core_param(crash_kexec_post_notifiers, crash_kexec_post_notifiers, bool, 0644);
 
 static int __init oops_setup(char *s)
diff --git a/tools/testing/selftests/pstore/pstore_crash_test b/tools/testing/selftests/pstore/pstore_crash_test
index 2a329bbb4aca..1e60ce4501aa 100755
--- a/tools/testing/selftests/pstore/pstore_crash_test
+++ b/tools/testing/selftests/pstore/pstore_crash_test
@@ -25,6 +25,7 @@  touch $REBOOT_FLAG
 sync
 
 # cause crash
-# Note: If you use kdump and want to see kmesg-* files after reboot, you should
-#       specify 'crash_kexec_post_notifiers' in 1st kernel's cmdline.
+# Note: If you use kdump and want to see kmsg-* files after reboot, you should
+#       be sure that the parameter "panic_notifiers_level" is more than '2' (the
+#       default value for this parameter is '2') in the first kernel's cmdline.
 echo c > /proc/sysrq-trigger