diff mbox series

kgdb: Resolve races during kgdb_io_register/unregister_module

Message ID 20200629171529.558003-1-daniel.thompson@linaro.org
State New
Headers show
Series kgdb: Resolve races during kgdb_io_register/unregister_module | expand

Commit Message

Daniel Thompson June 29, 2020, 5:15 p.m. UTC
Currently kgdb_register_callbacks() and kgdb_unregister_callbacks()
are called outside the scope of the kgdb_registration_lock. This
allows them to race with each other. This could do all sorts of crazy
things up to and including dbg_io_ops becoming NULL partway through the
execution of the kgdb trap handler (which isn't allowed and would be
fatal).

Fix this by bringing the trap handler setup and teardown into the scope
of the registration lock.

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>

---
 kernel/debug/debug_core.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)


base-commit: 9ebcfadb0610322ac537dd7aa5d9cbc2b2894c68
--
2.25.4

Comments

Doug Anderson June 29, 2020, 9:03 p.m. UTC | #1
Hi,

On Mon, Jun 29, 2020 at 10:15 AM Daniel Thompson
<daniel.thompson@linaro.org> wrote:
>

> Currently kgdb_register_callbacks() and kgdb_unregister_callbacks()

> are called outside the scope of the kgdb_registration_lock. This

> allows them to race with each other. This could do all sorts of crazy

> things up to and including dbg_io_ops becoming NULL partway through the

> execution of the kgdb trap handler (which isn't allowed and would be

> fatal).

>

> Fix this by bringing the trap handler setup and teardown into the scope

> of the registration lock.

>

> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>

> ---

>  kernel/debug/debug_core.c | 8 +++++---

>  1 file changed, 5 insertions(+), 3 deletions(-)

>

> diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c

> index 9e5934780f41..9799f2c6dc94 100644

> --- a/kernel/debug/debug_core.c

> +++ b/kernel/debug/debug_core.c

> @@ -1117,9 +1117,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)

>

>         dbg_io_ops = new_dbg_io_ops;

>

> -       spin_unlock(&kgdb_registration_lock);

> -

>         if (old_dbg_io_ops) {

> +               spin_unlock(&kgdb_registration_lock);

>                 old_dbg_io_ops->deinit();

>                 return 0;

>         }

> @@ -1129,6 +1128,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)

>         /* Arm KGDB now. */

>         kgdb_register_callbacks();

>

> +       spin_unlock(&kgdb_registration_lock);


From looking at code paths, I think this is illegal, isn't it?  You're
now calling kgdb_register_callbacks() while holding a spinlock, but:

kgdb_register_callbacks()
-> register_console()
--> console_lock()
---> might_sleep()
----> <boom!>


I'm a little curious about the exact race we're trying to solve.
Calling unregister on an IO module before register even finished seems
like an error on the caller, so I guess it would be calling register
from a 2nd thread for a different IO module while the first thread was
partway through unregistering?  Even that seems awfully sketchy since
you're risking registering a 2nd IO ops while the first is still there
and that's illegal enough that we do a pr_err() for it (though we
don't crash), but let's say we're trying to solve that one.

Looking at it closely, I _think_ the only race in this case is if the
one we're trying to unregister had a deinit() function and we going to
replace it?  If it didn't have a deinit function:

cpu1 (unregister)                 cpu2 (register):
-----------------                 ----------------------
kgdb_unregister_callbacks()
                                  spin_lock() <got>
spin_lock() <blocked>
                                  if (old_dbg_io_ops) <true>
                                    if (has dinit) <false>
                                      print error
                                      spin_unlock()
                                      return -EBUSY
<finish unregister>

The above is fine and is the same thing that would happen if the
whole register function ran before the unregister even started, right?

Also: if the unregister won the race that should also be fine.

So really the problem is this:

cpu1 (unregister)                 cpu2 (register):
-----------------                 ----------------------
kgdb_unregister_callbacks()
                                  spin_lock() <got>
spin_lock() <blocked>
                                  if (old_dbg_io_ops) <true>
                                    if (has dinit) <true>
                                      print Replacing
                                  init new IO ops
                                  spin_unlock()
                                  if (old_dbg_io_ops) <true>
                                    finish deinit of old
                                    return true
WARN_ON() <hits and shouts!>
dbg_io_ops = NULL
spin_unlock()
if (deinit) <true>
  double-call to deinit of old

So in this case we'll hit a WARN_ON(), incorrectly unregister the new
IO ops, and call deinit twice.

-Doug
Daniel Thompson June 30, 2020, 3:05 p.m. UTC | #2
On Mon, Jun 29, 2020 at 02:03:52PM -0700, Doug Anderson wrote:
> Hi,

> 

> On Mon, Jun 29, 2020 at 10:15 AM Daniel Thompson

> <daniel.thompson@linaro.org> wrote:

> >

> > Currently kgdb_register_callbacks() and kgdb_unregister_callbacks()

> > are called outside the scope of the kgdb_registration_lock. This

> > allows them to race with each other. This could do all sorts of crazy

> > things up to and including dbg_io_ops becoming NULL partway through the

> > execution of the kgdb trap handler (which isn't allowed and would be

> > fatal).

> >

> > Fix this by bringing the trap handler setup and teardown into the scope

> > of the registration lock.

> >

> > Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>

> > ---

> >  kernel/debug/debug_core.c | 8 +++++---

> >  1 file changed, 5 insertions(+), 3 deletions(-)

> >

> > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c

> > index 9e5934780f41..9799f2c6dc94 100644

> > --- a/kernel/debug/debug_core.c

> > +++ b/kernel/debug/debug_core.c

> > @@ -1117,9 +1117,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)

> >

> >         dbg_io_ops = new_dbg_io_ops;

> >

> > -       spin_unlock(&kgdb_registration_lock);

> > -

> >         if (old_dbg_io_ops) {

> > +               spin_unlock(&kgdb_registration_lock);

> >                 old_dbg_io_ops->deinit();

> >                 return 0;

> >         }

> > @@ -1129,6 +1128,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)

> >         /* Arm KGDB now. */

> >         kgdb_register_callbacks();

> >

> > +       spin_unlock(&kgdb_registration_lock);

> 

> From looking at code paths, I think this is illegal, isn't it?  You're

> now calling kgdb_register_callbacks() while holding a spinlock, but:

> 

> kgdb_register_callbacks()

> -> register_console()

> --> console_lock()

> ---> might_sleep()

> ----> <boom!>


Thanks.

I very nearly didn't press "Send" yesterday because I was worried I was
rushing it too much (in order to avoid forgetting it ;-) ). Should have
listened to myself!


> I'm a little curious about the exact race we're trying to solve.

> Calling unregister on an IO module before register even finished seems

> like an error on the caller, so I guess it would be calling register

> from a 2nd thread for a different IO module while the first thread was

> partway through unregistering?  Even that seems awfully sketchy since

> you're risking registering a 2nd IO ops while the first is still there

> and that's illegal enough that we do a pr_err() for it (though we

> don't crash), but let's say we're trying to solve that one.


I didn't follow all the possible paths. Utlimately the
(un)register_callbacks() functions use a flag variable without a lock
and that can interact in lots of different ways.

To be honest none are especially likely because the normal case is to
register once during boot and never unregister. However we can trigger
register/unregister from userspace so I think they can happen
in parallel.

Double unregister can lead to some especially nasty schedules...
although they still remain pretty unlikely since we need the double
unregister to coincide with a breakpoint:


kgdb_unregister_callbacks()	kgdb_unregister_callbacks()
  .				.	
  test flag			.
  set flag to 0			.
  .				test flag
  .				spin_lock()
*** kgdb trap ***		  .
  . paranoid dbg_io_ops check     .
  .				dbg_io_ops = NULL
  . stop other CPUs
  . try to use NULL dbg_io_ops


I have drawn the kgdb trap in the first column because otherwise things
get too wide but the trap could trigger on any CPU in the system and
provoke the problem.


> 

> Looking at it closely, I _think_ the only race in this case is if the

> one we're trying to unregister had a deinit() function and we going to

> replace it?  If it didn't have a deinit function:

> 

> cpu1 (unregister)                 cpu2 (register):

> -----------------                 ----------------------

> kgdb_unregister_callbacks()

>                                   spin_lock() <got>

> spin_lock() <blocked>

>                                   if (old_dbg_io_ops) <true>

>                                     if (has dinit) <false>

>                                       print error

>                                       spin_unlock()

>                                       return -EBUSY

> <finish unregister>

> 

> The above is fine and is the same thing that would happen if the

> whole register function ran before the unregister even started, right?

> 

> Also: if the unregister won the race that should also be fine.

> 

> So really the problem is this:

> 

> cpu1 (unregister)                 cpu2 (register):

> -----------------                 ----------------------

> kgdb_unregister_callbacks()

>                                   spin_lock() <got>

> spin_lock() <blocked>

>                                   if (old_dbg_io_ops) <true>

>                                     if (has dinit) <true>

>                                       print Replacing

>                                   init new IO ops

>                                   spin_unlock()

>                                   if (old_dbg_io_ops) <true>

>                                     finish deinit of old

>                                     return true

> WARN_ON() <hits and shouts!>

> dbg_io_ops = NULL

> spin_unlock()

> if (deinit) <true>

>   double-call to deinit of old

> 

> So in this case we'll hit a WARN_ON(), incorrectly unregister the new

> IO ops, and call deinit twice.


To be honest I was simply working on "it is racy" and "there's not a
good reason to allow that", especially as we start to develop tools to
bring races to the surfaces someone will yell at us about it sooner or
later ;-).

Of course, implementing it correctly would have been better...


Daniel.
Doug Anderson June 30, 2020, 10:26 p.m. UTC | #3
Hi,

On Tue, Jun 30, 2020 at 8:05 AM Daniel Thompson
<daniel.thompson@linaro.org> wrote:
>

> On Mon, Jun 29, 2020 at 02:03:52PM -0700, Doug Anderson wrote:

> > Hi,

> >

> > On Mon, Jun 29, 2020 at 10:15 AM Daniel Thompson

> > <daniel.thompson@linaro.org> wrote:

> > >

> > > Currently kgdb_register_callbacks() and kgdb_unregister_callbacks()

> > > are called outside the scope of the kgdb_registration_lock. This

> > > allows them to race with each other. This could do all sorts of crazy

> > > things up to and including dbg_io_ops becoming NULL partway through the

> > > execution of the kgdb trap handler (which isn't allowed and would be

> > > fatal).

> > >

> > > Fix this by bringing the trap handler setup and teardown into the scope

> > > of the registration lock.

> > >

> > > Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>

> > > ---

> > >  kernel/debug/debug_core.c | 8 +++++---

> > >  1 file changed, 5 insertions(+), 3 deletions(-)

> > >

> > > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c

> > > index 9e5934780f41..9799f2c6dc94 100644

> > > --- a/kernel/debug/debug_core.c

> > > +++ b/kernel/debug/debug_core.c

> > > @@ -1117,9 +1117,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)

> > >

> > >         dbg_io_ops = new_dbg_io_ops;

> > >

> > > -       spin_unlock(&kgdb_registration_lock);

> > > -

> > >         if (old_dbg_io_ops) {

> > > +               spin_unlock(&kgdb_registration_lock);

> > >                 old_dbg_io_ops->deinit();

> > >                 return 0;

> > >         }

> > > @@ -1129,6 +1128,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)

> > >         /* Arm KGDB now. */

> > >         kgdb_register_callbacks();

> > >

> > > +       spin_unlock(&kgdb_registration_lock);

> >

> > From looking at code paths, I think this is illegal, isn't it?  You're

> > now calling kgdb_register_callbacks() while holding a spinlock, but:

> >

> > kgdb_register_callbacks()

> > -> register_console()

> > --> console_lock()

> > ---> might_sleep()

> > ----> <boom!>

>

> Thanks.

>

> I very nearly didn't press "Send" yesterday because I was worried I was

> rushing it too much (in order to avoid forgetting it ;-) ). Should have

> listened to myself!

>

>

> > I'm a little curious about the exact race we're trying to solve.

> > Calling unregister on an IO module before register even finished seems

> > like an error on the caller, so I guess it would be calling register

> > from a 2nd thread for a different IO module while the first thread was

> > partway through unregistering?  Even that seems awfully sketchy since

> > you're risking registering a 2nd IO ops while the first is still there

> > and that's illegal enough that we do a pr_err() for it (though we

> > don't crash), but let's say we're trying to solve that one.

>

> I didn't follow all the possible paths. Utlimately the

> (un)register_callbacks() functions use a flag variable without a lock

> and that can interact in lots of different ways.

>

> To be honest none are especially likely because the normal case is to

> register once during boot and never unregister. However we can trigger

> register/unregister from userspace so I think they can happen

> in parallel.


This is for kgdboc or one of the other IO modules?  I do know that, at
least for kgdboc, we have the "config_mutex".  I won't promise that
there are no bugs there but in the very least it should mostly prevent
a host of these types of issues.  ...so I guess you'd have to in
parallel be spamming a register of a non kgdboc IO module together
with an unregister of kgdboc?


> Double unregister can lead to some especially nasty schedules...

> although they still remain pretty unlikely since we need the double

> unregister to coincide with a breakpoint:

>

>

> kgdb_unregister_callbacks()     kgdb_unregister_callbacks()

>   .                             .

>   test flag                     .

>   set flag to 0                 .

>   .                             test flag

>   .                             spin_lock()

> *** kgdb trap ***                 .

>   . paranoid dbg_io_ops check     .

>   .                             dbg_io_ops = NULL

>   . stop other CPUs

>   . try to use NULL dbg_io_ops

>

>

> I have drawn the kgdb trap in the first column because otherwise things

> get too wide but the trap could trigger on any CPU in the system and

> provoke the problem.

>

>

> >

> > Looking at it closely, I _think_ the only race in this case is if the

> > one we're trying to unregister had a deinit() function and we going to

> > replace it?  If it didn't have a deinit function:

> >

> > cpu1 (unregister)                 cpu2 (register):

> > -----------------                 ----------------------

> > kgdb_unregister_callbacks()

> >                                   spin_lock() <got>

> > spin_lock() <blocked>

> >                                   if (old_dbg_io_ops) <true>

> >                                     if (has dinit) <false>

> >                                       print error

> >                                       spin_unlock()

> >                                       return -EBUSY

> > <finish unregister>

> >

> > The above is fine and is the same thing that would happen if the

> > whole register function ran before the unregister even started, right?

> >

> > Also: if the unregister won the race that should also be fine.

> >

> > So really the problem is this:

> >

> > cpu1 (unregister)                 cpu2 (register):

> > -----------------                 ----------------------

> > kgdb_unregister_callbacks()

> >                                   spin_lock() <got>

> > spin_lock() <blocked>

> >                                   if (old_dbg_io_ops) <true>

> >                                     if (has dinit) <true>

> >                                       print Replacing

> >                                   init new IO ops

> >                                   spin_unlock()

> >                                   if (old_dbg_io_ops) <true>

> >                                     finish deinit of old

> >                                     return true

> > WARN_ON() <hits and shouts!>

> > dbg_io_ops = NULL

> > spin_unlock()

> > if (deinit) <true>

> >   double-call to deinit of old

> >

> > So in this case we'll hit a WARN_ON(), incorrectly unregister the new

> > IO ops, and call deinit twice.

>

> To be honest I was simply working on "it is racy" and "there's not a

> good reason to allow that", especially as we start to develop tools to

> bring races to the surfaces someone will yell at us about it sooner or

> later ;-).

>

> Of course, implementing it correctly would have been better...


Yeah, still wouldn't hurt to try to figure out how to make it cleaner.  :-)

-Doug
diff mbox series

Patch

diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index 9e5934780f41..9799f2c6dc94 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -1117,9 +1117,8 @@  int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)

 	dbg_io_ops = new_dbg_io_ops;

-	spin_unlock(&kgdb_registration_lock);
-
 	if (old_dbg_io_ops) {
+		spin_unlock(&kgdb_registration_lock);
 		old_dbg_io_ops->deinit();
 		return 0;
 	}
@@ -1129,6 +1128,8 @@  int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)
 	/* Arm KGDB now. */
 	kgdb_register_callbacks();

+	spin_unlock(&kgdb_registration_lock);
+
 	if (kgdb_break_asap &&
 	    (!dbg_is_early || IS_ENABLED(CONFIG_ARCH_HAS_EARLY_DEBUG)))
 		kgdb_initial_breakpoint();
@@ -1147,13 +1148,14 @@  void kgdb_unregister_io_module(struct kgdb_io *old_dbg_io_ops)
 {
 	BUG_ON(kgdb_connected);

+	spin_lock(&kgdb_registration_lock);
+
 	/*
 	 * KGDB is no longer able to communicate out, so
 	 * unregister our callbacks and reset state.
 	 */
 	kgdb_unregister_callbacks();

-	spin_lock(&kgdb_registration_lock);

 	WARN_ON_ONCE(dbg_io_ops != old_dbg_io_ops);
 	dbg_io_ops = NULL;