diff mbox series

[v2] panic, kexec: Make __crash_kexec() NMI safe

Message ID 20220620111520.1039685-1-vschneid@redhat.com
State New
Headers show
Series [v2] panic, kexec: Make __crash_kexec() NMI safe | expand

Commit Message

Valentin Schneider June 20, 2022, 11:15 a.m. UTC
Attempting to get a crash dump out of a debug PREEMPT_RT kernel via an NMI
panic() doesn't work. The cause of that lies in the PREEMPT_RT definition
of mutex_trylock():

	if (IS_ENABLED(CONFIG_DEBUG_RT_MUTEXES) && WARN_ON_ONCE(!in_task()))
		return 0;

This prevents an NMI panic() from executing the main body of
__crash_kexec() which does the actual kexec into the kdump kernel.
The warning and return are explained by:

  6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
  [...]
  The reasons for this are:

      1) There is a potential deadlock in the slowpath

      2) Another cpu which blocks on the rtmutex will boost the task
	 which allegedly locked the rtmutex, but that cannot work
	 because the hard/softirq context borrows the task context.

Furthermore, grabbing the lock isn't NMI safe, so do away with it and
use an atomic variable to serialize reads vs writes of
kexec_crash_image.

Tested by triggering NMI panics via:

  $ echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi
  $ echo 1 > /proc/sys/kernel/unknown_nmi_panic
  $ echo 1 > /proc/sys/kernel/panic

  $ ipmitool power diag

Fixes: 6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
v1 -> v2
++++++++

o Changed from Peterson-like synchronization to simpler atomic_cmpxchg
  (Petr)
o Slightly reworded changelog
o Added Fixes: tag. Technically should be up to since kexec can happen
  in an NMI, but that isn't such a clear target
---
 include/linux/kexec.h |  1 +
 kernel/kexec.c        | 16 ++++++++++++----
 kernel/kexec_core.c   | 36 +++++++++++++++++++-----------------
 kernel/kexec_file.c   | 11 +++++++++++
 4 files changed, 43 insertions(+), 21 deletions(-)

Comments

Sebastian Andrzej Siewior June 23, 2022, 9:31 a.m. UTC | #1
On 2022-06-20 12:15:20 [+0100], Valentin Schneider wrote:
> Attempting to get a crash dump out of a debug PREEMPT_RT kernel via an NMI
> panic() doesn't work. The cause of that lies in the PREEMPT_RT definition
> of mutex_trylock():
> 
> Fixes: 6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

> ---
> v1 -> v2
> ++++++++
> 
> o Changed from Peterson-like synchronization to simpler atomic_cmpxchg
>   (Petr)
> o Slightly reworded changelog
> o Added Fixes: tag. Technically should be up to since kexec can happen
>   in an NMI, but that isn't such a clear target

RT-wise it would be needed for each release.
There is also a mutex_unlock() in case an image is missing. This can go
via the scheduler if there is a waiter which does not look good with the
NMI in the picture.

Sebastian
Valentin Schneider June 23, 2022, 11:39 a.m. UTC | #2
On 23/06/22 11:31, Sebastian Andrzej Siewior wrote:
> On 2022-06-20 12:15:20 [+0100], Valentin Schneider wrote:
>> Attempting to get a crash dump out of a debug PREEMPT_RT kernel via an NMI
>> panic() doesn't work. The cause of that lies in the PREEMPT_RT definition
>> of mutex_trylock():
>> 
> …
>
>> Fixes: 6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
>> ---
>> v1 -> v2
>> ++++++++
>> 
>> o Changed from Peterson-like synchronization to simpler atomic_cmpxchg
>>   (Petr)
>> o Slightly reworded changelog
>> o Added Fixes: tag. Technically should be up to since kexec can happen
>>   in an NMI, but that isn't such a clear target
>
> RT-wise it would be needed for each release.

So git tells me the Fixes: commit dates from v4.2; from [1] and [2] I get
that both current longterm stable and stable-rt trees go as far back as
v4.9, so I'm guessing if that gets picked up for the stable trees then it
should make its way into the stable -rt trees?

[1]: https://wiki.linuxfoundation.org/realtime/preempt_rt_versions
[2]: https://www.kernel.org/category/releases.html

> There is also a mutex_unlock() in case an image is missing. This can go
> via the scheduler if there is a waiter which does not look good with the
> NMI in the picture.
>
> Sebastian
Sebastian Andrzej Siewior June 23, 2022, 1:35 p.m. UTC | #3
On 2022-06-23 12:39:57 [+0100], Valentin Schneider wrote:
> > RT-wise it would be needed for each release.
> 
> So git tells me the Fixes: commit dates from v4.2; from [1] and [2] I get
> that both current longterm stable and stable-rt trees go as far back as
> v4.9, so I'm guessing if that gets picked up for the stable trees then it
> should make its way into the stable -rt trees?

correct, if it goes -stable. Just pointing it out. But the
mutex_unlock() looks like it might also be relevant for non-RT kernels.

Sebastian
Baoquan He June 24, 2022, 1:30 a.m. UTC | #4
On 06/20/22 at 12:15pm, Valentin Schneider wrote:
> Attempting to get a crash dump out of a debug PREEMPT_RT kernel via an NMI
> panic() doesn't work. The cause of that lies in the PREEMPT_RT definition
> of mutex_trylock():
> 
> 	if (IS_ENABLED(CONFIG_DEBUG_RT_MUTEXES) && WARN_ON_ONCE(!in_task()))
> 		return 0;
> 
> This prevents an NMI panic() from executing the main body of
> __crash_kexec() which does the actual kexec into the kdump kernel.
> The warning and return are explained by:
> 
>   6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
>   [...]
>   The reasons for this are:
> 
>       1) There is a potential deadlock in the slowpath
> 
>       2) Another cpu which blocks on the rtmutex will boost the task
> 	 which allegedly locked the rtmutex, but that cannot work
> 	 because the hard/softirq context borrows the task context.
> 
> Furthermore, grabbing the lock isn't NMI safe, so do away with it and
> use an atomic variable to serialize reads vs writes of
> kexec_crash_image.
> 
> Tested by triggering NMI panics via:
> 
>   $ echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi
>   $ echo 1 > /proc/sys/kernel/unknown_nmi_panic
>   $ echo 1 > /proc/sys/kernel/panic
> 
>   $ ipmitool power diag
> 
> Fixes: 6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> v1 -> v2
> ++++++++
> 
> o Changed from Peterson-like synchronization to simpler atomic_cmpxchg
>   (Petr)
> o Slightly reworded changelog
> o Added Fixes: tag. Technically should be up to since kexec can happen
>   in an NMI, but that isn't such a clear target
> ---
>  include/linux/kexec.h |  1 +
>  kernel/kexec.c        | 16 ++++++++++++----
>  kernel/kexec_core.c   | 36 +++++++++++++++++++-----------------
>  kernel/kexec_file.c   | 11 +++++++++++
>  4 files changed, 43 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index ce6536f1d269..5849a15ae3dd 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -369,6 +369,7 @@ extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
>  
>  extern struct kimage *kexec_image;
>  extern struct kimage *kexec_crash_image;
> +extern atomic_t crash_kexec_lock;
>  extern int kexec_load_disabled;
>  
>  #ifndef kexec_flush_icache_page
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index b5e40f069768..73e0df2c608f 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -94,14 +94,20 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
>  	/*
>  	 * Because we write directly to the reserved memory region when loading
>  	 * crash kernels we need a mutex here to prevent multiple crash kernels
> -	 * from attempting to load simultaneously, and to prevent a crash kernel
> -	 * from loading over the top of a in use crash kernel.
> -	 *
> -	 * KISS: always take the mutex.
> +	 * from attempting to load simultaneously.
>  	 */
>  	if (!mutex_trylock(&kexec_mutex))
>  		return -EBUSY;

So kexec_mutex is degenerated to only avoid simultaneous loading,
should we rename to reflect that?, e.g kexec_load_mutex.

>  
> +	/*
> +	 * Prevent loading a new crash kernel while one is in use.
> +	 * See associated comment in __crash_kexec().
> +	 */
> +	if (atomic_cmpxchg_acquire(&crash_kexec_lock, 0, 1)) {
> +		ret = -EBUSY;
> +		goto out_unlock_mutex;
> +	}
> +
>  	if (flags & KEXEC_ON_CRASH) {
>  		dest_image = &kexec_crash_image;
>  		if (kexec_crash_image)
> @@ -165,6 +171,8 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
>  
>  	kimage_free(image);
>  out_unlock:
> +	atomic_set_release(&crash_kexec_lock, 0);
> +out_unlock_mutex:
>  	mutex_unlock(&kexec_mutex);
>  	return ret;
>  }
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 4d34c78334ce..f957109a266c 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -933,6 +933,7 @@ int kimage_load_segment(struct kimage *image,
>  
>  struct kimage *kexec_image;
>  struct kimage *kexec_crash_image;
> +atomic_t crash_kexec_lock = ATOMIC_INIT(0);
>  int kexec_load_disabled;
>  #ifdef CONFIG_SYSCTL
>  static struct ctl_table kexec_core_sysctls[] = {
> @@ -964,25 +965,26 @@ late_initcall(kexec_core_sysctl_init);
>   */
>  void __noclone __crash_kexec(struct pt_regs *regs)
>  {
> -	/* Take the kexec_mutex here to prevent sys_kexec_load
> -	 * running on one cpu from replacing the crash kernel
> -	 * we are using after a panic on a different cpu.
> -	 *
> -	 * If the crash kernel was not located in a fixed area
> -	 * of memory the xchg(&kexec_crash_image) would be
> -	 * sufficient.  But since I reuse the memory...
> +	/*
> +	 * This should be taking kexec_mutex before doing anything with the
> +	 * kexec_crash_image, but this code can be run in NMI context which
> +	 * means we can't even trylock. This is circumvented by using an
> +	 * atomic variable that is *also* used by the codepaths that take
> +	 * the mutex to modify kexec_crash_image.
>  	 */
> -	if (mutex_trylock(&kexec_mutex)) {
> -		if (kexec_crash_image) {
> -			struct pt_regs fixed_regs;
> -
> -			crash_setup_regs(&fixed_regs, regs);
> -			crash_save_vmcoreinfo();
> -			machine_crash_shutdown(&fixed_regs);
> -			machine_kexec(kexec_crash_image);
> -		}
> -		mutex_unlock(&kexec_mutex);
> +	if (atomic_cmpxchg_acquire(&crash_kexec_lock, 0, 1))
> +		return;
> +
> +	if (kexec_crash_image) {
> +		struct pt_regs fixed_regs;
> +
> +		crash_setup_regs(&fixed_regs, regs);
> +		crash_save_vmcoreinfo();
> +		machine_crash_shutdown(&fixed_regs);
> +		machine_kexec(kexec_crash_image);
>  	}
> +
> +	atomic_set_release(&crash_kexec_lock, 0);
>  }
>  STACK_FRAME_NON_STANDARD(__crash_kexec);
>  
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 145321a5e798..3faec031cfc9 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -337,6 +337,15 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
>  	if (!mutex_trylock(&kexec_mutex))
>  		return -EBUSY;
>  
> +	/*
> +	 * Prevent loading a new crash kernel while one is in use.
> +	 * See associated comment in __crash_kexec().
> +	 */
> +	if (atomic_cmpxchg_acquire(&crash_kexec_lock, 0, 1)) {
> +		ret = -EBUSY;
> +		goto out_mutex_unlock;
> +	}
> +
>  	dest_image = &kexec_image;
>  	if (flags & KEXEC_FILE_ON_CRASH) {
>  		dest_image = &kexec_crash_image;
> @@ -406,6 +415,8 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
>  	if ((flags & KEXEC_FILE_ON_CRASH) && kexec_crash_image)
>  		arch_kexec_protect_crashkres();
>  
> +	atomic_set_release(&crash_kexec_lock, 0);
> +out_mutex_unlock:
>  	mutex_unlock(&kexec_mutex);
>  	kimage_free(image);
>  	return ret;
> -- 
> 2.31.1
>
Valentin Schneider June 24, 2022, 1:37 p.m. UTC | #5
On 24/06/22 09:30, Baoquan He wrote:
> On 06/20/22 at 12:15pm, Valentin Schneider wrote:
>> @@ -94,14 +94,20 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
>>  	/*
>>  	 * Because we write directly to the reserved memory region when loading
>>  	 * crash kernels we need a mutex here to prevent multiple crash kernels
>> -	 * from attempting to load simultaneously, and to prevent a crash kernel
>> -	 * from loading over the top of a in use crash kernel.
>> -	 *
>> -	 * KISS: always take the mutex.
>> +	 * from attempting to load simultaneously.
>>  	 */
>>  	if (!mutex_trylock(&kexec_mutex))
>>  		return -EBUSY;
>
> So kexec_mutex is degenerated to only avoid simultaneous loading,
> should we rename to reflect that?, e.g kexec_load_mutex.
>

It's also serializing crash_get_memory_size() and crash_shrink_memory();
more generally it should still be the preferred serialization mechanism as
it's a "proper" lock visible by instrumentation, the atomic variable is a
side character for the NMI case.
Eric W. Biederman June 25, 2022, 5:04 p.m. UTC | #6
Valentin Schneider <vschneid@redhat.com> writes:

> Attempting to get a crash dump out of a debug PREEMPT_RT kernel via an NMI
> panic() doesn't work. The cause of that lies in the PREEMPT_RT definition
> of mutex_trylock():
>
> 	if (IS_ENABLED(CONFIG_DEBUG_RT_MUTEXES) && WARN_ON_ONCE(!in_task()))
> 		return 0;
>
> This prevents an NMI panic() from executing the main body of
> __crash_kexec() which does the actual kexec into the kdump kernel.
> The warning and return are explained by:
>
>   6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
>   [...]
>   The reasons for this are:
>
>       1) There is a potential deadlock in the slowpath
>
>       2) Another cpu which blocks on the rtmutex will boost the task
> 	 which allegedly locked the rtmutex, but that cannot work
> 	 because the hard/softirq context borrows the task context.
>
> Furthermore, grabbing the lock isn't NMI safe, so do away with it and
> use an atomic variable to serialize reads vs writes of
> kexec_crash_image.
>
> Tested by triggering NMI panics via:
>
>   $ echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi
>   $ echo 1 > /proc/sys/kernel/unknown_nmi_panic
>   $ echo 1 > /proc/sys/kernel/panic
>
>   $ ipmitool power diag
>
> Fixes: 6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>

I am not particularly fond of this patch as it adds more complexity than
is necessary to solve the problem.

Calling a spade a spade PREEMPT_RT's mutex_trylock implementation is
broken as it can not support the use cases of an ordinary mutex_trylock.
I have not seen (possibly I skimmed too quickly) anywhere in the
discussion why PREEMPT_RT is not being fixed.  Looking at the code
there is enough going on in try_to_take_rt_mutex that I can imagine
that some part of that code is not nmi safe.  So I can believe
PREEMPT_RT may be unfix-ably broken.


At this point I recommend going back to being ``unconventional'' with
the kexec locking and effectively reverting commit 8c5a1cf0ad3a ("kexec:
use a mutex for locking rather than xchg()").

That would also mean that we don't have to worry about the lockdep code
doing something weird in the future and breaking kexec.

Your change starting to is atomic_cmpxchng is most halfway to a revert
of commit 8c5a1cf0ad3a ("kexec: use a mutex for locking rather than
xchg()").  So we might as well go the whole way and just document that
the kexec on panic code can not use conventional kernel locking
primitives and has to dig deep and build it's own.  At which point it
makes no sense for the rest of the kexec code to use anything different.

Eric
Baoquan He June 26, 2022, 10:37 a.m. UTC | #7
On 06/24/22 at 02:37pm, Valentin Schneider wrote:
> On 24/06/22 09:30, Baoquan He wrote:
> > On 06/20/22 at 12:15pm, Valentin Schneider wrote:
> >> @@ -94,14 +94,20 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
> >>  	/*
> >>  	 * Because we write directly to the reserved memory region when loading
> >>  	 * crash kernels we need a mutex here to prevent multiple crash kernels
> >> -	 * from attempting to load simultaneously, and to prevent a crash kernel
> >> -	 * from loading over the top of a in use crash kernel.
> >> -	 *
> >> -	 * KISS: always take the mutex.
> >> +	 * from attempting to load simultaneously.
> >>  	 */
> >>  	if (!mutex_trylock(&kexec_mutex))
> >>  		return -EBUSY;
> >
> > So kexec_mutex is degenerated to only avoid simultaneous loading,
> > should we rename to reflect that?, e.g kexec_load_mutex.
> >
> 
> It's also serializing crash_get_memory_size() and crash_shrink_memory();
> more generally it should still be the preferred serialization mechanism as
> it's a "proper" lock visible by instrumentation, the atomic variable is a
> side character for the NMI case.

You are right. I only checked the code comment in this place. Then this
patch looks good to me, thx.

Acked-by: Baoquan He <bhe@redhat.com>
Baoquan He June 26, 2022, 10:45 a.m. UTC | #8
On 06/26/22 at 06:37pm, Baoquan He wrote:
> On 06/24/22 at 02:37pm, Valentin Schneider wrote:
> > On 24/06/22 09:30, Baoquan He wrote:
> > > On 06/20/22 at 12:15pm, Valentin Schneider wrote:
> > >> @@ -94,14 +94,20 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
> > >>  	/*
> > >>  	 * Because we write directly to the reserved memory region when loading
> > >>  	 * crash kernels we need a mutex here to prevent multiple crash kernels
> > >> -	 * from attempting to load simultaneously, and to prevent a crash kernel
> > >> -	 * from loading over the top of a in use crash kernel.
> > >> -	 *
> > >> -	 * KISS: always take the mutex.
> > >> +	 * from attempting to load simultaneously.
> > >>  	 */
> > >>  	if (!mutex_trylock(&kexec_mutex))
> > >>  		return -EBUSY;
> > >
> > > So kexec_mutex is degenerated to only avoid simultaneous loading,
> > > should we rename to reflect that?, e.g kexec_load_mutex.
> > >
> > 
> > It's also serializing crash_get_memory_size() and crash_shrink_memory();
> > more generally it should still be the preferred serialization mechanism as
> > it's a "proper" lock visible by instrumentation, the atomic variable is a
> > side character for the NMI case.
> 
> You are right. I only checked the code comment in this place. Then this
> patch looks good to me, thx.
> 
> Acked-by: Baoquan He <bhe@redhat.com>

OK, just saw Eric's comment after I replied.
Valentin Schneider June 27, 2022, 12:42 p.m. UTC | #9
On 25/06/22 12:04, Eric W. Biederman wrote:
> I am not particularly fond of this patch as it adds more complexity than
> is necessary to solve the problem.
>
> Calling a spade a spade PREEMPT_RT's mutex_trylock implementation is
> broken as it can not support the use cases of an ordinary mutex_trylock.
> I have not seen (possibly I skimmed too quickly) anywhere in the
> discussion why PREEMPT_RT is not being fixed.  Looking at the code
> there is enough going on in try_to_take_rt_mutex that I can imagine
> that some part of that code is not nmi safe.  So I can believe
> PREEMPT_RT may be unfix-ably broken.
>

AFAICT same goes for !PREEMPT_RT given the mutex_unlock(); it's a bit
convoluted but you can craft scenarios where the NMI ends up spinning on
mutex->wait_lock that is owned by the interrupted task, e.g.

  CPU0                    CPU1

                          crash_shrink_memory()
                            mutex_lock();
  crash_get_memory_size()
    mutex_lock()
      raw_spin_lock(&lock->wait_lock);
      // Lock acquired
  <NMI>
                            mutex_unlock()
                              <Release lock->owner>;

  // Owner is free at this point so this succeeds
  mutex_trylock();
  // No kexec_crash_image
  mutex_unlock()
    raw_spin_lock(&lock->wait_lock);

>
> At this point I recommend going back to being ``unconventional'' with
> the kexec locking and effectively reverting commit 8c5a1cf0ad3a ("kexec:
> use a mutex for locking rather than xchg()").
>
> That would also mean that we don't have to worry about the lockdep code
> doing something weird in the future and breaking kexec.
>
> Your change starting to is atomic_cmpxchng is most halfway to a revert
> of commit 8c5a1cf0ad3a ("kexec: use a mutex for locking rather than
> xchg()").  So we might as well go the whole way and just document that
> the kexec on panic code can not use conventional kernel locking
> primitives and has to dig deep and build it's own.  At which point it
> makes no sense for the rest of the kexec code to use anything different.
>

Hm, I'm a bit torn about that one, ideally I'd prefer to keep "homegrown"
locking primitives to just where they are needed (loading & kexec'ing), but
I'm also not immensely fond of the "hybrid" mutex+cmpxchg approach.

> Eric
Valentin Schneider June 28, 2022, 5:33 p.m. UTC | #10
On 27/06/22 13:42, Valentin Schneider wrote:
> On 25/06/22 12:04, Eric W. Biederman wrote:
>> At this point I recommend going back to being ``unconventional'' with
>> the kexec locking and effectively reverting commit 8c5a1cf0ad3a ("kexec:
>> use a mutex for locking rather than xchg()").
>>
>> That would also mean that we don't have to worry about the lockdep code
>> doing something weird in the future and breaking kexec.
>>
>> Your change starting to is atomic_cmpxchng is most halfway to a revert
>> of commit 8c5a1cf0ad3a ("kexec: use a mutex for locking rather than
>> xchg()").  So we might as well go the whole way and just document that
>> the kexec on panic code can not use conventional kernel locking
>> primitives and has to dig deep and build it's own.  At which point it
>> makes no sense for the rest of the kexec code to use anything different.
>>
>
> Hm, I'm a bit torn about that one, ideally I'd prefer to keep "homegrown"
> locking primitives to just where they are needed (loading & kexec'ing), but
> I'm also not immensely fond of the "hybrid" mutex+cmpxchg approach.
>

8c5a1cf0ad3a ("kexec: use a mutex for locking rather than xchg()") was
straightforward enough because it turned

        if (xchg(&lock, 1))
                return -EBUSY;

into

        if (!mutex_trylock(&lock))
                return -EBUSY;

Now, most of the kexec_mutex uses are trylocks, except for:
- crash_get_memory_size()
- crash_shrink_memory()

I really don't want to go down the route of turning those into cmpxchg
try-loops, would it be acceptable to make those use trylocks (i.e. return
-EBUSY if the cmpxchg fails)?

Otherwise, we keep the mutexes for functions like those which go nowhere
near an NMI.
Petr Mladek June 29, 2022, 11:55 a.m. UTC | #11
On Tue 2022-06-28 18:33:08, Valentin Schneider wrote:
> On 27/06/22 13:42, Valentin Schneider wrote:
> > On 25/06/22 12:04, Eric W. Biederman wrote:
> >> At this point I recommend going back to being ``unconventional'' with
> >> the kexec locking and effectively reverting commit 8c5a1cf0ad3a ("kexec:
> >> use a mutex for locking rather than xchg()").
> >>
> >> That would also mean that we don't have to worry about the lockdep code
> >> doing something weird in the future and breaking kexec.
> >>
> >> Your change starting to is atomic_cmpxchng is most halfway to a revert
> >> of commit 8c5a1cf0ad3a ("kexec: use a mutex for locking rather than
> >> xchg()").  So we might as well go the whole way and just document that
> >> the kexec on panic code can not use conventional kernel locking
> >> primitives and has to dig deep and build it's own.  At which point it
> >> makes no sense for the rest of the kexec code to use anything different.
> >>
> >
> > Hm, I'm a bit torn about that one, ideally I'd prefer to keep "homegrown"
> > locking primitives to just where they are needed (loading & kexec'ing), but
> > I'm also not immensely fond of the "hybrid" mutex+cmpxchg approach.
> >
> 
> 8c5a1cf0ad3a ("kexec: use a mutex for locking rather than xchg()") was
> straightforward enough because it turned
> 
>         if (xchg(&lock, 1))
>                 return -EBUSY;
> 
> into
> 
>         if (!mutex_trylock(&lock))
>                 return -EBUSY;
> 
> Now, most of the kexec_mutex uses are trylocks, except for:
> - crash_get_memory_size()
> - crash_shrink_memory()
> 
> I really don't want to go down the route of turning those into cmpxchg
> try-loops, would it be acceptable to make those use trylocks (i.e. return
> -EBUSY if the cmpxchg fails)?

IMHO, -EBUSY is acceptable for both crash_get_memory_size()
and crash_shrink_memory(). They are used in the sysfs interface.

> Otherwise, we keep the mutexes for functions like those which go nowhere
> near an NMI.

If we go this way then I would hide the locking into some wrappers,
like crash_kexec_trylock()/unlock() that would do both mutex
and xchg. The xchg part might be hidden in a separate wrapper
__crash_kexec_trylock()/unlock() or
crash_kexec_atomic_trylock()/unlock().

Best Regards,
Petr
Valentin Schneider June 29, 2022, 12:23 p.m. UTC | #12
On 29/06/22 13:55, Petr Mladek wrote:
> On Tue 2022-06-28 18:33:08, Valentin Schneider wrote:
>>
>> 8c5a1cf0ad3a ("kexec: use a mutex for locking rather than xchg()") was
>> straightforward enough because it turned
>>
>>         if (xchg(&lock, 1))
>>                 return -EBUSY;
>>
>> into
>>
>>         if (!mutex_trylock(&lock))
>>                 return -EBUSY;
>>
>> Now, most of the kexec_mutex uses are trylocks, except for:
>> - crash_get_memory_size()
>> - crash_shrink_memory()
>>
>> I really don't want to go down the route of turning those into cmpxchg
>> try-loops, would it be acceptable to make those use trylocks (i.e. return
>> -EBUSY if the cmpxchg fails)?
>
> IMHO, -EBUSY is acceptable for both crash_get_memory_size()
> and crash_shrink_memory(). They are used in the sysfs interface.
>
>> Otherwise, we keep the mutexes for functions like those which go nowhere
>> near an NMI.
>
> If we go this way then I would hide the locking into some wrappers,
> like crash_kexec_trylock()/unlock() that would do both mutex
> and xchg. The xchg part might be hidden in a separate wrapper
> __crash_kexec_trylock()/unlock() or
> crash_kexec_atomic_trylock()/unlock().
>

Makes sense, thanks. I've started playing with the trylock/-EBUSY approach,
I'll toss it out if I don't end up hating it.

> Best Regards,
> Petr
diff mbox series

Patch

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index ce6536f1d269..5849a15ae3dd 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -369,6 +369,7 @@  extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
 
 extern struct kimage *kexec_image;
 extern struct kimage *kexec_crash_image;
+extern atomic_t crash_kexec_lock;
 extern int kexec_load_disabled;
 
 #ifndef kexec_flush_icache_page
diff --git a/kernel/kexec.c b/kernel/kexec.c
index b5e40f069768..73e0df2c608f 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -94,14 +94,20 @@  static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 	/*
 	 * Because we write directly to the reserved memory region when loading
 	 * crash kernels we need a mutex here to prevent multiple crash kernels
-	 * from attempting to load simultaneously, and to prevent a crash kernel
-	 * from loading over the top of a in use crash kernel.
-	 *
-	 * KISS: always take the mutex.
+	 * from attempting to load simultaneously.
 	 */
 	if (!mutex_trylock(&kexec_mutex))
 		return -EBUSY;
 
+	/*
+	 * Prevent loading a new crash kernel while one is in use.
+	 * See associated comment in __crash_kexec().
+	 */
+	if (atomic_cmpxchg_acquire(&crash_kexec_lock, 0, 1)) {
+		ret = -EBUSY;
+		goto out_unlock_mutex;
+	}
+
 	if (flags & KEXEC_ON_CRASH) {
 		dest_image = &kexec_crash_image;
 		if (kexec_crash_image)
@@ -165,6 +171,8 @@  static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 
 	kimage_free(image);
 out_unlock:
+	atomic_set_release(&crash_kexec_lock, 0);
+out_unlock_mutex:
 	mutex_unlock(&kexec_mutex);
 	return ret;
 }
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 4d34c78334ce..f957109a266c 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -933,6 +933,7 @@  int kimage_load_segment(struct kimage *image,
 
 struct kimage *kexec_image;
 struct kimage *kexec_crash_image;
+atomic_t crash_kexec_lock = ATOMIC_INIT(0);
 int kexec_load_disabled;
 #ifdef CONFIG_SYSCTL
 static struct ctl_table kexec_core_sysctls[] = {
@@ -964,25 +965,26 @@  late_initcall(kexec_core_sysctl_init);
  */
 void __noclone __crash_kexec(struct pt_regs *regs)
 {
-	/* Take the kexec_mutex here to prevent sys_kexec_load
-	 * running on one cpu from replacing the crash kernel
-	 * we are using after a panic on a different cpu.
-	 *
-	 * If the crash kernel was not located in a fixed area
-	 * of memory the xchg(&kexec_crash_image) would be
-	 * sufficient.  But since I reuse the memory...
+	/*
+	 * This should be taking kexec_mutex before doing anything with the
+	 * kexec_crash_image, but this code can be run in NMI context which
+	 * means we can't even trylock. This is circumvented by using an
+	 * atomic variable that is *also* used by the codepaths that take
+	 * the mutex to modify kexec_crash_image.
 	 */
-	if (mutex_trylock(&kexec_mutex)) {
-		if (kexec_crash_image) {
-			struct pt_regs fixed_regs;
-
-			crash_setup_regs(&fixed_regs, regs);
-			crash_save_vmcoreinfo();
-			machine_crash_shutdown(&fixed_regs);
-			machine_kexec(kexec_crash_image);
-		}
-		mutex_unlock(&kexec_mutex);
+	if (atomic_cmpxchg_acquire(&crash_kexec_lock, 0, 1))
+		return;
+
+	if (kexec_crash_image) {
+		struct pt_regs fixed_regs;
+
+		crash_setup_regs(&fixed_regs, regs);
+		crash_save_vmcoreinfo();
+		machine_crash_shutdown(&fixed_regs);
+		machine_kexec(kexec_crash_image);
 	}
+
+	atomic_set_release(&crash_kexec_lock, 0);
 }
 STACK_FRAME_NON_STANDARD(__crash_kexec);
 
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 145321a5e798..3faec031cfc9 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -337,6 +337,15 @@  SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	if (!mutex_trylock(&kexec_mutex))
 		return -EBUSY;
 
+	/*
+	 * Prevent loading a new crash kernel while one is in use.
+	 * See associated comment in __crash_kexec().
+	 */
+	if (atomic_cmpxchg_acquire(&crash_kexec_lock, 0, 1)) {
+		ret = -EBUSY;
+		goto out_mutex_unlock;
+	}
+
 	dest_image = &kexec_image;
 	if (flags & KEXEC_FILE_ON_CRASH) {
 		dest_image = &kexec_crash_image;
@@ -406,6 +415,8 @@  SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	if ((flags & KEXEC_FILE_ON_CRASH) && kexec_crash_image)
 		arch_kexec_protect_crashkres();
 
+	atomic_set_release(&crash_kexec_lock, 0);
+out_mutex_unlock:
 	mutex_unlock(&kexec_mutex);
 	kimage_free(image);
 	return ret;