diff mbox series

panic, kexec: Don't mutex_trylock() in __crash_kexec()

Message ID 20220616123709.347053-1-vschneid@redhat.com
State New
Headers show
Series panic, kexec: Don't mutex_trylock() in __crash_kexec() | expand

Commit Message

Valentin Schneider June 16, 2022, 12:37 p.m. UTC
Attempting to get a crash dump out of a debug PREEMPT_RT kernel via an NMI
panic() doesn't work. The cause of that lies in the PREEMPT_RT definition
of mutex_trylock():

	if (IS_ENABLED(CONFIG_DEBUG_RT_MUTEXES) && WARN_ON_ONCE(!in_task()))
		return 0;

This prevents an NMI panic() from executing the main body of
__crash_kexec() which does the actual kexec into the kdump kernel.
The warning and return are explained by:

  6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
  [...]
  The reasons for this are:

      1) There is a potential deadlock in the slowpath

      2) Another cpu which blocks on the rtmutex will boost the task
	 which allegedly locked the rtmutex, but that cannot work
	 because the hard/softirq context borrows the task context.

Use a pair of barrier-ordered variables to serialize loading vs executing a
crash kernel.

Tested by triggering NMI panics via:

  $ echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi
  $ echo 1 > /proc/sys/kernel/unknown_nmi_panic
  $ echo 1 > /proc/sys/kernel/panic

  $ ipmitool power diag

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
Regarding the original explanation for the WARN & return:

I don't get why 2) is a problem - if the lock is acquired by the trylock
then the critical section will be run without interruption since it
cannot sleep, the interrupted task may get boosted but that will not
have any actual impact AFAICT.
Regardless, even if this doesn't sleep, the ->wait_lock in the slowpath
isn't NMI safe so this needs changing.

I've thought about trying to defer the kexec out of an NMI (or IRQ)
context, but that pretty much means deferring the panic() which I'm
not sure is such a great idea.
---
 include/linux/kexec.h |  2 ++
 kernel/kexec.c        | 18 ++++++++++++++----
 kernel/kexec_core.c   | 41 +++++++++++++++++++++++++----------------
 kernel/kexec_file.c   | 14 ++++++++++++++
 4 files changed, 55 insertions(+), 20 deletions(-)

Comments

Tao Zhou June 17, 2022, 10:42 a.m. UTC | #1
Hi Valentin,

On Thu, Jun 16, 2022 at 01:37:09PM +0100, Valentin Schneider wrote:
> Attempting to get a crash dump out of a debug PREEMPT_RT kernel via an NMI
> panic() doesn't work. The cause of that lies in the PREEMPT_RT definition
> of mutex_trylock():
> 
> 	if (IS_ENABLED(CONFIG_DEBUG_RT_MUTEXES) && WARN_ON_ONCE(!in_task()))
> 		return 0;
> 
> This prevents an NMI panic() from executing the main body of
> __crash_kexec() which does the actual kexec into the kdump kernel.
> The warning and return are explained by:
> 
>   6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
>   [...]
>   The reasons for this are:
> 
>       1) There is a potential deadlock in the slowpath
> 
>       2) Another cpu which blocks on the rtmutex will boost the task
> 	 which allegedly locked the rtmutex, but that cannot work
> 	 because the hard/softirq context borrows the task context.
> 
> Use a pair of barrier-ordered variables to serialize loading vs executing a
> crash kernel.
> 
> Tested by triggering NMI panics via:
> 
>   $ echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi
>   $ echo 1 > /proc/sys/kernel/unknown_nmi_panic
>   $ echo 1 > /proc/sys/kernel/panic
> 
>   $ ipmitool power diag
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> Regarding the original explanation for the WARN & return:
> 
> I don't get why 2) is a problem - if the lock is acquired by the trylock
> then the critical section will be run without interruption since it
> cannot sleep, the interrupted task may get boosted but that will not
> have any actual impact AFAICT.
> Regardless, even if this doesn't sleep, the ->wait_lock in the slowpath
> isn't NMI safe so this needs changing.
> 
> I've thought about trying to defer the kexec out of an NMI (or IRQ)
> context, but that pretty much means deferring the panic() which I'm
> not sure is such a great idea.
> ---
>  include/linux/kexec.h |  2 ++
>  kernel/kexec.c        | 18 ++++++++++++++----
>  kernel/kexec_core.c   | 41 +++++++++++++++++++++++++----------------
>  kernel/kexec_file.c   | 14 ++++++++++++++
>  4 files changed, 55 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index ce6536f1d269..89bbe150752e 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -369,6 +369,8 @@ extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
>  
>  extern struct kimage *kexec_image;
>  extern struct kimage *kexec_crash_image;
> +extern bool panic_wants_kexec;
> +extern bool kexec_loading;
>  extern int kexec_load_disabled;
>  
>  #ifndef kexec_flush_icache_page
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index b5e40f069768..1253f4bb3079 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -94,14 +94,23 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
>  	/*
>  	 * Because we write directly to the reserved memory region when loading
>  	 * crash kernels we need a mutex here to prevent multiple crash kernels
> -	 * from attempting to load simultaneously, and to prevent a crash kernel
> -	 * from loading over the top of a in use crash kernel.
> -	 *
> -	 * KISS: always take the mutex.
> +	 * from attempting to load simultaneously.
>  	 */
>  	if (!mutex_trylock(&kexec_mutex))
>  		return -EBUSY;
>  
> +	/*
> +	 * Prevent loading a new crash kernel while one is in use.
> +	 *
> +	 * Pairs with smp_mb() in __crash_kexec().
> +	 */
> +	WRITE_ONCE(kexec_loading, true);
> +	smp_mb();
> +	if (READ_ONCE(panic_wants_kexec)) {
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}
> +
>  	if (flags & KEXEC_ON_CRASH) {
>  		dest_image = &kexec_crash_image;
>  		if (kexec_crash_image)
> @@ -165,6 +174,7 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
>  
>  	kimage_free(image);
>  out_unlock:
> +	WRITE_ONCE(kexec_loading, false);
>  	mutex_unlock(&kexec_mutex);
>  	return ret;
>  }
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 4d34c78334ce..932cc0d4daa3 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -933,6 +933,8 @@ int kimage_load_segment(struct kimage *image,
>  
>  struct kimage *kexec_image;
>  struct kimage *kexec_crash_image;
> +bool panic_wants_kexec;
> +bool kexec_loading;
>  int kexec_load_disabled;
>  #ifdef CONFIG_SYSCTL
>  static struct ctl_table kexec_core_sysctls[] = {
> @@ -964,24 +966,31 @@ late_initcall(kexec_core_sysctl_init);
>   */
>  void __noclone __crash_kexec(struct pt_regs *regs)
>  {
> -	/* Take the kexec_mutex here to prevent sys_kexec_load
> -	 * running on one cpu from replacing the crash kernel
> -	 * we are using after a panic on a different cpu.
> +	/*
> +	 * This should be taking kexec_mutex before doing anything with the
> +	 * kexec_crash_image, but this code can be run in NMI context which
> +	 * means we can't even trylock.
>  	 *
> -	 * If the crash kernel was not located in a fixed area
> -	 * of memory the xchg(&kexec_crash_image) would be
> -	 * sufficient.  But since I reuse the memory...
> +	 * Pairs with smp_mb() in do_kexec_load() and sys_kexec_file_load()
>  	 */
> -	if (mutex_trylock(&kexec_mutex)) {
> -		if (kexec_crash_image) {
> -			struct pt_regs fixed_regs;
> -
> -			crash_setup_regs(&fixed_regs, regs);
> -			crash_save_vmcoreinfo();
> -			machine_crash_shutdown(&fixed_regs);
> -			machine_kexec(kexec_crash_image);
> -		}
> -		mutex_unlock(&kexec_mutex);
> +	WRITE_ONCE(panic_wants_kexec, true);
> +	smp_mb();
> +	/*
> +	 * If we're panic'ing while someone else is messing with the crash
> +	 * kernel, this isn't going to end well.
> +	 */
> +	if (READ_ONCE(kexec_loading)) {
> +		WRITE_ONCE(panic_wants_kexec, false);
> +		return;
> +	}

So this is from NMI. The mutex guarantee that kexec_file_load() or 
do_kexec_load() just one of them beat on cpu. NMI can happen on more
than one cpu. That means that here be cumulativity here IMHO.

kexec_file_load()/                 NMI0                     NMI1..
do_kexec_load()

set kexec_loading=true     
smp_mb()                set panic_wants_kexec=ture
                        smp_mb()
                        see kexec_loading=ture and
                          conditionally set
                          panic_wants_kexec=false;
                                                 set panic_wants_kexec=ture
                                                 smp_mb()
see panic_wants_kexec=ture
  conditionally set
  kexec_loading=false
                                                 see kexec_loading=false
                                                 do kexec nmi things.

You see conditionlly set kexec_loading or panic_wants_kexec there no barrier
there and if the cumulativity to have the effect there should be a acquire-release,
if I am not wrong.

__crash_kexec():

WRITE_ONCE(panic_wants_kexec, true);
smp_mb();
/*
 * If we're panic'ing while someone else is messing with the crash
 * kernel, this isn't going to end well.
 */
if (READ_ONCE(kexec_loading)) {
	smp_store_release(panic_wants_kexec, false);
	return;
}

kexec_file_load()/do_kexec_load():

WRITE_ONCE(kexec_loading, true);
smp_mb();
if (smp_load_acquire(panic_wants_kexec)) {
    WRITE_ONCE(kexec_loading, false);
    ...
}

For those input, I'm sure I lost and feel hot..
I thought that change the patten to load-store and set initial
value but failed.

Thanks,
Tao
> +	if (kexec_crash_image) {
> +		struct pt_regs fixed_regs;
> +
> +		crash_setup_regs(&fixed_regs, regs);
> +		crash_save_vmcoreinfo();
> +		machine_crash_shutdown(&fixed_regs);
> +		machine_kexec(kexec_crash_image);
>  	}
>  }
>  STACK_FRAME_NON_STANDARD(__crash_kexec);
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 145321a5e798..4bb399e6623e 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -337,6 +337,18 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
>  	if (!mutex_trylock(&kexec_mutex))
>  		return -EBUSY;
>  
> +	/*
> +	 * Prevent loading a new crash kernel while one is in use.
> +	 *
> +	 * Pairs with smp_mb() in __crash_kexec().
> +	 */
> +	WRITE_ONCE(kexec_loading, true);
> +	smp_mb();
> +	if (READ_ONCE(panic_wants_kexec)) {
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}
> +
>  	dest_image = &kexec_image;
>  	if (flags & KEXEC_FILE_ON_CRASH) {
>  		dest_image = &kexec_crash_image;
> @@ -406,6 +418,8 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
>  	if ((flags & KEXEC_FILE_ON_CRASH) && kexec_crash_image)
>  		arch_kexec_protect_crashkres();
>  
> +out_unlock:
> +	WRITE_ONCE(kexec_loading, false);
>  	mutex_unlock(&kexec_mutex);
>  	kimage_free(image);
>  	return ret;
> -- 
> 2.27.0
>
Valentin Schneider June 17, 2022, 11:52 a.m. UTC | #2
Hi Tao,

On 17/06/22 18:42, Tao Zhou wrote:
> Hi Valentin,
>
> On Thu, Jun 16, 2022 at 01:37:09PM +0100, Valentin Schneider wrote:
>> @@ -964,24 +966,31 @@ late_initcall(kexec_core_sysctl_init);
>>   */
>>  void __noclone __crash_kexec(struct pt_regs *regs)
>>  {
>> -	/* Take the kexec_mutex here to prevent sys_kexec_load
>> -	 * running on one cpu from replacing the crash kernel
>> -	 * we are using after a panic on a different cpu.
>> +	/*
>> +	 * This should be taking kexec_mutex before doing anything with the
>> +	 * kexec_crash_image, but this code can be run in NMI context which
>> +	 * means we can't even trylock.
>>  	 *
>> -	 * If the crash kernel was not located in a fixed area
>> -	 * of memory the xchg(&kexec_crash_image) would be
>> -	 * sufficient.  But since I reuse the memory...
>> +	 * Pairs with smp_mb() in do_kexec_load() and sys_kexec_file_load()
>>  	 */
>> -	if (mutex_trylock(&kexec_mutex)) {
>> -		if (kexec_crash_image) {
>> -			struct pt_regs fixed_regs;
>> -
>> -			crash_setup_regs(&fixed_regs, regs);
>> -			crash_save_vmcoreinfo();
>> -			machine_crash_shutdown(&fixed_regs);
>> -			machine_kexec(kexec_crash_image);
>> -		}
>> -		mutex_unlock(&kexec_mutex);
>> +	WRITE_ONCE(panic_wants_kexec, true);
>> +	smp_mb();
>> +	/*
>> +	 * If we're panic'ing while someone else is messing with the crash
>> +	 * kernel, this isn't going to end well.
>> +	 */
>> +	if (READ_ONCE(kexec_loading)) {
>> +		WRITE_ONCE(panic_wants_kexec, false);
>> +		return;
>> +	}
>
> So this is from NMI. The mutex guarantee that kexec_file_load() or 
> do_kexec_load() just one of them beat on cpu. NMI can happen on more
> than one cpu. That means that here be cumulativity here IMHO.
>

If you look at __crash_kexec() in isolation yes, but if you look at panic()
and nmi_panic() only a single NMI can get in there. I think that is also
true for invocations via crash_kexec().


> kexec_file_load()/                 NMI0                     NMI1..
> do_kexec_load()
>
> set kexec_loading=true     
> smp_mb()                set panic_wants_kexec=ture
>                         smp_mb()
>                         see kexec_loading=ture and
>                           conditionally set
>                           panic_wants_kexec=false;
>                                                  set panic_wants_kexec=ture
>                                                  smp_mb()
> see panic_wants_kexec=ture
>   conditionally set
>   kexec_loading=false
>                                                  see kexec_loading=false
>                                                  do kexec nmi things.
>
> You see conditionlly set kexec_loading or panic_wants_kexec there no barrier
> there and if the cumulativity to have the effect there should be a acquire-release,
> if I am not wrong.
>
> __crash_kexec():
>
> WRITE_ONCE(panic_wants_kexec, true);
> smp_mb();
> /*
>  * If we're panic'ing while someone else is messing with the crash
>  * kernel, this isn't going to end well.
>  */
> if (READ_ONCE(kexec_loading)) {
> 	smp_store_release(panic_wants_kexec, false);
> 	return;
> }
>
> kexec_file_load()/do_kexec_load():
>
> WRITE_ONCE(kexec_loading, true);
> smp_mb();
> if (smp_load_acquire(panic_wants_kexec)) {
>     WRITE_ONCE(kexec_loading, false);
>     ...
> }
>
> For those input, I'm sure I lost and feel hot..
> I thought that change the patten to load-store and set initial
> value but failed.
>

I'm not sure if further ordering is required here, the base case being

  WRITE_ONCE(panic_wants_kexec, true);         WRITE_ONCE(kexec_loading);
  smp_mb();                                    smp_mb();
  v0 = READ_ONCE(kexec_loading);               v1 = READ_ONCE(panic_wants_kexec);

  (see SB+fencembonceonces litmus test)

Wich ensures (!v0 && !v1) is never true. If modified to:

  WRITE_ONCE(panic_wants_kexec, true);         WRITE_ONCE(kexec_loading);
  smp_mb();                                    smp_mb();
  v0 = READ_ONCE(kexec_loading);               v1 = READ_ONCE(panic_wants_kexec);  
  if (v0)                                      if (v1)
          WRITE_ONCE(panic_wants_kexec, false);        WRITE_ONCE(kexec_loading, false);

then "(!v0 && !v1) is never true" still holds, so the exclusivity is
maintained AFAICT.
Petr Mladek June 17, 2022, 1:52 p.m. UTC | #3
On Fri 2022-06-17 12:52:05, Valentin Schneider wrote:
> Hi Tao,
> 
> On 17/06/22 18:42, Tao Zhou wrote:
> > Hi Valentin,
> >
> > On Thu, Jun 16, 2022 at 01:37:09PM +0100, Valentin Schneider wrote:
> >> @@ -964,24 +966,31 @@ late_initcall(kexec_core_sysctl_init);
> >>   */
> >>  void __noclone __crash_kexec(struct pt_regs *regs)
> >>  {
> >> -	/* Take the kexec_mutex here to prevent sys_kexec_load
> >> -	 * running on one cpu from replacing the crash kernel
> >> -	 * we are using after a panic on a different cpu.
> >> +	/*
> >> +	 * This should be taking kexec_mutex before doing anything with the
> >> +	 * kexec_crash_image, but this code can be run in NMI context which
> >> +	 * means we can't even trylock.
> >>  	 *
> >> -	 * If the crash kernel was not located in a fixed area
> >> -	 * of memory the xchg(&kexec_crash_image) would be
> >> -	 * sufficient.  But since I reuse the memory...
> >> +	 * Pairs with smp_mb() in do_kexec_load() and sys_kexec_file_load()
> >>  	 */
> >> -	if (mutex_trylock(&kexec_mutex)) {
> >> -		if (kexec_crash_image) {
> >> -			struct pt_regs fixed_regs;
> >> -
> >> -			crash_setup_regs(&fixed_regs, regs);
> >> -			crash_save_vmcoreinfo();
> >> -			machine_crash_shutdown(&fixed_regs);
> >> -			machine_kexec(kexec_crash_image);
> >> -		}
> >> -		mutex_unlock(&kexec_mutex);
> >> +	WRITE_ONCE(panic_wants_kexec, true);
> >> +	smp_mb();
> >> +	/*
> >> +	 * If we're panic'ing while someone else is messing with the crash
> >> +	 * kernel, this isn't going to end well.
> >> +	 */
> >> +	if (READ_ONCE(kexec_loading)) {
> >> +		WRITE_ONCE(panic_wants_kexec, false);
> >> +		return;
> >> +	}
> >
> > So this is from NMI. The mutex guarantee that kexec_file_load() or 
> > do_kexec_load() just one of them beat on cpu. NMI can happen on more
> > than one cpu. That means that here be cumulativity here IMHO.
> >
> 
> If you look at __crash_kexec() in isolation yes, but if you look at panic()
> and nmi_panic() only a single NMI can get in there. I think that is also
> true for invocations via crash_kexec().

It is true that panic() could be called only once, see this code
in panic():

	 * Only one CPU is allowed to execute the panic code from here. For
	this_cpu = raw_smp_processor_id();
	old_cpu  = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);

	if (old_cpu != PANIC_CPU_INVALID && old_cpu != this_cpu)
		panic_smp_self_stop();


One the other hand, the aproach with two variables is strage
and brings exactly these questions.

If a trylock is enough that the mutex can be replaced by two
simple atomic operations. The mutex would be needed only
when a user really would need to wait for another one.


	atomic_t crash_kexec_lock;

	/* trylock part */
	if (atomic_cmpxchg_acquire(&crash_kexec_lock, 0, 1) != 0)
		return -EBUSY;

	/* do anything guarded by crash_kexec_lock */

	/* release lock */
	atomic_set_release(&crash_kexec_lock, 0);

The _acquire, _release variants will do the barriers correctly.

Best Regards,
Petr
Valentin Schneider June 17, 2022, 2:46 p.m. UTC | #4
On 17/06/22 15:52, Petr Mladek wrote:
> On Fri 2022-06-17 12:52:05, Valentin Schneider wrote:
>> If you look at __crash_kexec() in isolation yes, but if you look at panic()
>> and nmi_panic() only a single NMI can get in there. I think that is also
>> true for invocations via crash_kexec().
>
> It is true that panic() could be called only once, see this code
> in panic():
>
> 	 * Only one CPU is allowed to execute the panic code from here. For
> 	this_cpu = raw_smp_processor_id();
> 	old_cpu  = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
>
> 	if (old_cpu != PANIC_CPU_INVALID && old_cpu != this_cpu)
> 		panic_smp_self_stop();
>
>
> One the other hand, the aproach with two variables is strage
> and brings exactly these questions.
>
> If a trylock is enough that the mutex can be replaced by two
> simple atomic operations. The mutex would be needed only
> when a user really would need to wait for another one.
>
>
> 	atomic_t crash_kexec_lock;
>
> 	/* trylock part */
> 	if (atomic_cmpxchg_acquire(&crash_kexec_lock, 0, 1) != 0)
> 		return -EBUSY;
>
> 	/* do anything guarded by crash_kexec_lock */
>
> 	/* release lock */
> 	atomic_set_release(&crash_kexec_lock, 0);
>
> The _acquire, _release variants will do the barriers correctly.
>

Looks saner! I can't think of anything wrong with it so I'll shove that in
v2. Thanks!

> Best Regards,
> Petr
Sebastian Andrzej Siewior June 17, 2022, 3:13 p.m. UTC | #5
On 2022-06-16 13:37:09 [+0100], Valentin Schneider wrote:
> Regarding the original explanation for the WARN & return:
> 
> I don't get why 2) is a problem - if the lock is acquired by the trylock
> then the critical section will be run without interruption since it
> cannot sleep, the interrupted task may get boosted but that will not
> have any actual impact AFAICT.

boosting an unrelated task is considered wrong. I don't know how bad
it gets in terms of lock chains since a task is set as owner which did
not actually ask for the lock.

> Regardless, even if this doesn't sleep, the ->wait_lock in the slowpath
> isn't NMI safe so this needs changing.

This includes the unlock path which may wake a waiter and deboost.

> I've thought about trying to defer the kexec out of an NMI (or IRQ)
> context, but that pretty much means deferring the panic() which I'm
> not sure is such a great idea.

If we could defer it out of NMI on RT then it would work non-RT, too. If
the system is "stuck" and the NMI is the only to respond then I guess
that it is not a great idea.

Sebastian
Valentin Schneider June 17, 2022, 4:09 p.m. UTC | #6
On 17/06/22 17:13, Sebastian Andrzej Siewior wrote:
> On 2022-06-16 13:37:09 [+0100], Valentin Schneider wrote:
>> Regarding the original explanation for the WARN & return:
>> 
>> I don't get why 2) is a problem - if the lock is acquired by the trylock
>> then the critical section will be run without interruption since it
>> cannot sleep, the interrupted task may get boosted but that will not
>> have any actual impact AFAICT.
>
> boosting an unrelated task is considered wrong. I don't know how bad
> it gets in terms of lock chains since a task is set as owner which did
> not actually ask for the lock.
>
>> Regardless, even if this doesn't sleep, the ->wait_lock in the slowpath
>> isn't NMI safe so this needs changing.
>
> This includes the unlock path which may wake a waiter and deboost.
>

Both are good points, thank you for lighting my lantern :)

>> I've thought about trying to defer the kexec out of an NMI (or IRQ)
>> context, but that pretty much means deferring the panic() which I'm
>> not sure is such a great idea.
>
> If we could defer it out of NMI on RT then it would work non-RT, too. If
> the system is "stuck" and the NMI is the only to respond then I guess
> that it is not a great idea.
>

Those were pretty much my thoughts. I *think* panic() can be re-entrant on
the same CPU if the first entry was from NMI, but that still requires being
able to schedule a thread that panics which isn't a given after getting
that panic NMI. So for now actually doing the kexec in NMI (or IRQ) context
seems to be the less hazardous route. 

> Sebastian
Sebastian Andrzej Siewior June 17, 2022, 4:53 p.m. UTC | #7
On 2022-06-17 17:09:24 [+0100], Valentin Schneider wrote:
> Those were pretty much my thoughts. I *think* panic() can be re-entrant on
> the same CPU if the first entry was from NMI, but that still requires being
> able to schedule a thread that panics which isn't a given after getting
> that panic NMI. So for now actually doing the kexec in NMI (or IRQ) context
> seems to be the less hazardous route. 

most likely. Just get rid of the mutex and we should be good to go ;)

Sebastian
kernel test robot June 22, 2022, 3:34 p.m. UTC | #8
Hi Valentin,

I love your patch! Yet something to improve:

[auto build test ERROR on soc/for-next]
[also build test ERROR on linus/master v5.19-rc3 next-20220622]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/panic-kexec-Don-t-mutex_trylock-in-__crash_kexec/20220616-203915
base:   https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git for-next
config: riscv-randconfig-c024-20220622 (https://download.01.org/0day-ci/archive/20220622/202206222322.WsTIFQGo-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/d05873bf87d81eb3ddfa1fe20b3743cc4a1ab259
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Valentin-Schneider/panic-kexec-Don-t-mutex_trylock-in-__crash_kexec/20220616-203915
        git checkout d05873bf87d81eb3ddfa1fe20b3743cc4a1ab259
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   riscv64-linux-ld: kernel/kexec_core.o: in function `kimage_crash_copy_vmcoreinfo':
   kernel/kexec_core.c:502: undefined reference to `machine_kexec_cleanup'
   riscv64-linux-ld: kernel/kexec_core.o: in function `clear_bit':
>> arch/riscv/include/asm/bitops.h:129: undefined reference to `machine_crash_shutdown'
   riscv64-linux-ld: kernel/kexec_core.o: in function `kimage_free_pages':
   kernel/kexec_core.c:337: undefined reference to `machine_kexec'
   riscv64-linux-ld: kernel/kexec_core.o: in function `kimage_free':
   kernel/kexec_core.c:651: undefined reference to `riscv_crash_save_regs'
   riscv64-linux-ld: kernel/kexec_core.o: in function `__nr_to_section':
   include/linux/mmzone.h:1403: undefined reference to `machine_shutdown'
   riscv64-linux-ld: include/linux/mmzone.h:1403: undefined reference to `machine_kexec'
   riscv64-linux-ld: kernel/kexec_file.o: in function `__section_mem_map_addr':
   include/linux/mmzone.h:1434: undefined reference to `machine_kexec_prepare'


vim +129 arch/riscv/include/asm/bitops.h

fab957c11efe2f Palmer Dabbelt 2017-07-10  117  
fab957c11efe2f Palmer Dabbelt 2017-07-10  118  /**
fab957c11efe2f Palmer Dabbelt 2017-07-10  119   * clear_bit - Clears a bit in memory
fab957c11efe2f Palmer Dabbelt 2017-07-10  120   * @nr: Bit to clear
fab957c11efe2f Palmer Dabbelt 2017-07-10  121   * @addr: Address to start counting from
fab957c11efe2f Palmer Dabbelt 2017-07-10  122   *
fab957c11efe2f Palmer Dabbelt 2017-07-10  123   * Note: there are no guarantees that this function will not be reordered
fab957c11efe2f Palmer Dabbelt 2017-07-10  124   * on non x86 architectures, so if you are writing portable code,
fab957c11efe2f Palmer Dabbelt 2017-07-10  125   * make sure not to rely on its reordering guarantees.
fab957c11efe2f Palmer Dabbelt 2017-07-10  126   */
fab957c11efe2f Palmer Dabbelt 2017-07-10  127  static inline void clear_bit(int nr, volatile unsigned long *addr)
fab957c11efe2f Palmer Dabbelt 2017-07-10  128  {
fab957c11efe2f Palmer Dabbelt 2017-07-10 @129  	__op_bit(and, __NOT, nr, addr);
fab957c11efe2f Palmer Dabbelt 2017-07-10  130  }
fab957c11efe2f Palmer Dabbelt 2017-07-10  131
diff mbox series

Patch

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index ce6536f1d269..89bbe150752e 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -369,6 +369,8 @@  extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
 
 extern struct kimage *kexec_image;
 extern struct kimage *kexec_crash_image;
+extern bool panic_wants_kexec;
+extern bool kexec_loading;
 extern int kexec_load_disabled;
 
 #ifndef kexec_flush_icache_page
diff --git a/kernel/kexec.c b/kernel/kexec.c
index b5e40f069768..1253f4bb3079 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -94,14 +94,23 @@  static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 	/*
 	 * Because we write directly to the reserved memory region when loading
 	 * crash kernels we need a mutex here to prevent multiple crash kernels
-	 * from attempting to load simultaneously, and to prevent a crash kernel
-	 * from loading over the top of a in use crash kernel.
-	 *
-	 * KISS: always take the mutex.
+	 * from attempting to load simultaneously.
 	 */
 	if (!mutex_trylock(&kexec_mutex))
 		return -EBUSY;
 
+	/*
+	 * Prevent loading a new crash kernel while one is in use.
+	 *
+	 * Pairs with smp_mb() in __crash_kexec().
+	 */
+	WRITE_ONCE(kexec_loading, true);
+	smp_mb();
+	if (READ_ONCE(panic_wants_kexec)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
 	if (flags & KEXEC_ON_CRASH) {
 		dest_image = &kexec_crash_image;
 		if (kexec_crash_image)
@@ -165,6 +174,7 @@  static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 
 	kimage_free(image);
 out_unlock:
+	WRITE_ONCE(kexec_loading, false);
 	mutex_unlock(&kexec_mutex);
 	return ret;
 }
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 4d34c78334ce..932cc0d4daa3 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -933,6 +933,8 @@  int kimage_load_segment(struct kimage *image,
 
 struct kimage *kexec_image;
 struct kimage *kexec_crash_image;
+bool panic_wants_kexec;
+bool kexec_loading;
 int kexec_load_disabled;
 #ifdef CONFIG_SYSCTL
 static struct ctl_table kexec_core_sysctls[] = {
@@ -964,24 +966,31 @@  late_initcall(kexec_core_sysctl_init);
  */
 void __noclone __crash_kexec(struct pt_regs *regs)
 {
-	/* Take the kexec_mutex here to prevent sys_kexec_load
-	 * running on one cpu from replacing the crash kernel
-	 * we are using after a panic on a different cpu.
+	/*
+	 * This should be taking kexec_mutex before doing anything with the
+	 * kexec_crash_image, but this code can be run in NMI context which
+	 * means we can't even trylock.
 	 *
-	 * If the crash kernel was not located in a fixed area
-	 * of memory the xchg(&kexec_crash_image) would be
-	 * sufficient.  But since I reuse the memory...
+	 * Pairs with smp_mb() in do_kexec_load() and sys_kexec_file_load()
 	 */
-	if (mutex_trylock(&kexec_mutex)) {
-		if (kexec_crash_image) {
-			struct pt_regs fixed_regs;
-
-			crash_setup_regs(&fixed_regs, regs);
-			crash_save_vmcoreinfo();
-			machine_crash_shutdown(&fixed_regs);
-			machine_kexec(kexec_crash_image);
-		}
-		mutex_unlock(&kexec_mutex);
+	WRITE_ONCE(panic_wants_kexec, true);
+	smp_mb();
+	/*
+	 * If we're panic'ing while someone else is messing with the crash
+	 * kernel, this isn't going to end well.
+	 */
+	if (READ_ONCE(kexec_loading)) {
+		WRITE_ONCE(panic_wants_kexec, false);
+		return;
+	}
+
+	if (kexec_crash_image) {
+		struct pt_regs fixed_regs;
+
+		crash_setup_regs(&fixed_regs, regs);
+		crash_save_vmcoreinfo();
+		machine_crash_shutdown(&fixed_regs);
+		machine_kexec(kexec_crash_image);
 	}
 }
 STACK_FRAME_NON_STANDARD(__crash_kexec);
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 145321a5e798..4bb399e6623e 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -337,6 +337,18 @@  SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	if (!mutex_trylock(&kexec_mutex))
 		return -EBUSY;
 
+	/*
+	 * Prevent loading a new crash kernel while one is in use.
+	 *
+	 * Pairs with smp_mb() in __crash_kexec().
+	 */
+	WRITE_ONCE(kexec_loading, true);
+	smp_mb();
+	if (READ_ONCE(panic_wants_kexec)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
 	dest_image = &kexec_image;
 	if (flags & KEXEC_FILE_ON_CRASH) {
 		dest_image = &kexec_crash_image;
@@ -406,6 +418,8 @@  SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	if ((flags & KEXEC_FILE_ON_CRASH) && kexec_crash_image)
 		arch_kexec_protect_crashkres();
 
+out_unlock:
+	WRITE_ONCE(kexec_loading, false);
 	mutex_unlock(&kexec_mutex);
 	kimage_free(image);
 	return ret;