diff mbox

[RFC,v3,1/9] sysrq: Implement __handle_sysrq_nolock to avoid recursive locking in kdb

Message ID 1398443370-12668-2-git-send-email-daniel.thompson@linaro.org
State New
Headers show

Commit Message

Daniel Thompson April 25, 2014, 4:29 p.m. UTC
If kdb is triggered using SysRq-g then any use of the sr command results
in the SysRq key table lock being recursively acquired, killing the debug
session. That patch resolves the problem by introducing a _nolock
alternative for __handle_sysrq.

Strictly speaking this approach risks racing on the key table when kdb is
triggered by something other than SysRq-g however in that case any other
CPU involved should release the spin lock before kgdb parks the slave
CPUs.

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
---
 drivers/tty/sysrq.c         | 11 ++++++++---
 include/linux/sysrq.h       |  1 +
 kernel/debug/kdb/kdb_main.c |  2 +-
 3 files changed, 10 insertions(+), 4 deletions(-)

Comments

Steven Rostedt April 25, 2014, 4:45 p.m. UTC | #1
On Fri, 25 Apr 2014 17:29:22 +0100
Daniel Thompson <daniel.thompson@linaro.org> wrote:

> If kdb is triggered using SysRq-g then any use of the sr command results
> in the SysRq key table lock being recursively acquired, killing the debug
> session. That patch resolves the problem by introducing a _nolock
> alternative for __handle_sysrq.
> 
> Strictly speaking this approach risks racing on the key table when kdb is
> triggered by something other than SysRq-g however in that case any other
> CPU involved should release the spin lock before kgdb parks the slave
> CPUs.

Is that case documented somewhere in the code comments?

-- Steve
Daniel Thompson April 28, 2014, 10:24 a.m. UTC | #2
On 25/04/14 17:45, Steven Rostedt wrote:
> On Fri, 25 Apr 2014 17:29:22 +0100
> Daniel Thompson <daniel.thompson@linaro.org> wrote:
> 
>> If kdb is triggered using SysRq-g then any use of the sr command results
>> in the SysRq key table lock being recursively acquired, killing the debug
>> session. That patch resolves the problem by introducing a _nolock
>> alternative for __handle_sysrq.
>>
>> Strictly speaking this approach risks racing on the key table when kdb is
>> triggered by something other than SysRq-g however in that case any other
>> CPU involved should release the spin lock before kgdb parks the slave
>> CPUs.
> 
> Is that case documented somewhere in the code comments?

Perhaps not near enough to the _nolock but the primary bit of comment is
here (and in same file as kdb_sr).
--- cut here ---
 * kdb_main_loop - After initial setup and assignment of the
 *	controlling cpu, all cpus are in this loop.  One cpu is in
 *	control and will issue the kdb prompt, the others will spin
 *	until 'go' or cpu switch.
--- cut here ---

The mechanism kgdb uses to quiesce other CPUs means other CPUs cannot be
in irqsave critical sections.
Colin Cross April 28, 2014, 5:44 p.m. UTC | #3
On Mon, Apr 28, 2014 at 3:24 AM, Daniel Thompson
<daniel.thompson@linaro.org> wrote:
> On 25/04/14 17:45, Steven Rostedt wrote:
>> On Fri, 25 Apr 2014 17:29:22 +0100
>> Daniel Thompson <daniel.thompson@linaro.org> wrote:
>>
>>> If kdb is triggered using SysRq-g then any use of the sr command results
>>> in the SysRq key table lock being recursively acquired, killing the debug
>>> session. That patch resolves the problem by introducing a _nolock
>>> alternative for __handle_sysrq.
>>>
>>> Strictly speaking this approach risks racing on the key table when kdb is
>>> triggered by something other than SysRq-g however in that case any other
>>> CPU involved should release the spin lock before kgdb parks the slave
>>> CPUs.
>>
>> Is that case documented somewhere in the code comments?
>
> Perhaps not near enough to the _nolock but the primary bit of comment is
> here (and in same file as kdb_sr).
> --- cut here ---
>  * kdb_main_loop - After initial setup and assignment of the
>  *      controlling cpu, all cpus are in this loop.  One cpu is in
>  *      control and will issue the kdb prompt, the others will spin
>  *      until 'go' or cpu switch.
> --- cut here ---
>
> The mechanism kgdb uses to quiesce other CPUs means other CPUs cannot be
> in irqsave critical sections.
>
>

One of the advantages of FIQ debugger is that it can be triggered from
an FIQ (NMI for those in x86 land), and Jason and I have discussed
using FIQs for kgdb to allow interrupting cpus stuck in critical
sections.  If that gets implemented the above assumption will no
longer be correct.
Daniel Thompson April 28, 2014, 8:12 p.m. UTC | #4
On 28/04/14 18:44, Colin Cross wrote:
>>> Is that case documented somewhere in the code comments?
>>
>> Perhaps not near enough to the _nolock but the primary bit of comment is
>> here (and in same file as kdb_sr).
>> --- cut here ---
>>  * kdb_main_loop - After initial setup and assignment of the
>>  *      controlling cpu, all cpus are in this loop.  One cpu is in
>>  *      control and will issue the kdb prompt, the others will spin
>>  *      until 'go' or cpu switch.
>> --- cut here ---
>>
>> The mechanism kgdb uses to quiesce other CPUs means other CPUs cannot be
>> in irqsave critical sections.
>>
>>
> 
> One of the advantages of FIQ debugger is that it can be triggered from
> an FIQ (NMI for those in x86 land), and Jason and I have discussed
> using FIQs for kgdb to allow interrupting cpus stuck in critical
> sections.  If that gets implemented the above assumption will no
> longer be correct.

Quite so (I've got Anton's old FIQ patches running on latest kernel and
am trying to port to a GICv2-without-trustzone qemu model I've written
in order to kick the idea about a bit on an ARM multi-arch kernel).

This patch has therefore pained me a little bit to not complete cover
this case in the patch. As posted I deliberately ignore the problem. In
this particular case the SysRq table is so infrequently updated the
chances of an badly timed NMI are vanishingly small and, at that point,
even if we did actually hit that tiny window its *still* better to have
the new behaviour (risk of race) than the old behaviour (guaranteed
deadlock).

I'd very much welcome other ideas (I have tried out quite a few in my
head but none solve the problem of NMI "gratuitiously" hitting critical
sections). However when NMI/FIQ finally comes along I'd be tempted to
borrow the "bounce to normal interrupt mode" idea from FIQ debugger and
ensure commands like "sr" command do not run from the NMI handler.


Daniel.
Daniel Thompson April 29, 2014, 8:59 a.m. UTC | #5
On 28/04/14 18:44, Colin Cross wrote:
>>> Is that case documented somewhere in the code comments?
>>
>> Perhaps not near enough to the _nolock but the primary bit of comment is
>> here (and in same file as kdb_sr).
>> --- cut here ---
>>  * kdb_main_loop - After initial setup and assignment of the
>>  *      controlling cpu, all cpus are in this loop.  One cpu is in
>>  *      control and will issue the kdb prompt, the others will spin
>>  *      until 'go' or cpu switch.
>> --- cut here ---
>>
>> The mechanism kgdb uses to quiesce other CPUs means other CPUs cannot be
>> in irqsave critical sections.
>>
>>
> 
> One of the advantages of FIQ debugger is that it can be triggered from
> an FIQ (NMI for those in x86 land), and Jason and I have discussed
> using FIQs for kgdb to allow interrupting cpus stuck in critical
> sections.  If that gets implemented the above assumption will no
> longer be correct.

Reviewing this I realized I missed one of the most critical points in
the above.

Today kdb, even if triggered by FIQ/NMI, would still be likely to wedge
waiting for the IPI interrupts to be delivered to other processors.

Did you and Jason discuss getting the active CPU to quiesce the other
processors using FIQ/NMI, or to allow the active CPU to timeout while
waiting for them the stop?


Daniel.
Colin Cross April 29, 2014, 4:33 p.m. UTC | #6
On Tue, Apr 29, 2014 at 1:59 AM, Daniel Thompson
<daniel.thompson@linaro.org> wrote:
> On 28/04/14 18:44, Colin Cross wrote:
>>>> Is that case documented somewhere in the code comments?
>>>
>>> Perhaps not near enough to the _nolock but the primary bit of comment is
>>> here (and in same file as kdb_sr).
>>> --- cut here ---
>>>  * kdb_main_loop - After initial setup and assignment of the
>>>  *      controlling cpu, all cpus are in this loop.  One cpu is in
>>>  *      control and will issue the kdb prompt, the others will spin
>>>  *      until 'go' or cpu switch.
>>> --- cut here ---
>>>
>>> The mechanism kgdb uses to quiesce other CPUs means other CPUs cannot be
>>> in irqsave critical sections.
>>>
>>>
>>
>> One of the advantages of FIQ debugger is that it can be triggered from
>> an FIQ (NMI for those in x86 land), and Jason and I have discussed
>> using FIQs for kgdb to allow interrupting cpus stuck in critical
>> sections.  If that gets implemented the above assumption will no
>> longer be correct.
>
> Reviewing this I realized I missed one of the most critical points in
> the above.
>
> Today kdb, even if triggered by FIQ/NMI, would still be likely to wedge
> waiting for the IPI interrupts to be delivered to other processors.
>
> Did you and Jason discuss getting the active CPU to quiesce the other
> processors using FIQ/NMI, or to allow the active CPU to timeout while
> waiting for them the stop?
>
>
> Daniel.

Yes, all cpus would have to get an FIQ/NMI.
diff mbox

Patch

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index ce396ec..7b47b2d 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -505,14 +505,12 @@  static void __sysrq_put_key_op(int key, struct sysrq_key_op *op_p)
                 sysrq_key_table[i] = op_p;
 }
 
-void __handle_sysrq(int key, bool check_mask)
+void __handle_sysrq_nolock(int key, bool check_mask)
 {
 	struct sysrq_key_op *op_p;
 	int orig_log_level;
 	int i;
-	unsigned long flags;
 
-	spin_lock_irqsave(&sysrq_key_table_lock, flags);
 	/*
 	 * Raise the apparent loglevel to maximum so that the sysrq header
 	 * is shown to provide the user with positive feedback.  We do not
@@ -554,6 +552,13 @@  void __handle_sysrq(int key, bool check_mask)
 		printk("\n");
 		console_loglevel = orig_log_level;
 	}
+}
+
+void __handle_sysrq(int key, bool check_mask)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&sysrq_key_table_lock, flags);
+	__handle_sysrq_nolock(key, check_mask);
 	spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
 }
 
diff --git a/include/linux/sysrq.h b/include/linux/sysrq.h
index 387fa7d..1d51d64 100644
--- a/include/linux/sysrq.h
+++ b/include/linux/sysrq.h
@@ -44,6 +44,7 @@  struct sysrq_key_op {
 
 void handle_sysrq(int key);
 void __handle_sysrq(int key, bool check_mask);
+void __handle_sysrq_nolock(int key, bool check_mask);
 int register_sysrq_key(int key, struct sysrq_key_op *op);
 int unregister_sysrq_key(int key, struct sysrq_key_op *op);
 struct sysrq_key_op *__sysrq_get_key_op(int key);
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index 0b097c8..f39f926 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -1924,7 +1924,7 @@  static int kdb_sr(int argc, const char **argv)
 	if (argc != 1)
 		return KDB_ARGCOUNT;
 	kdb_trap_printk++;
-	__handle_sysrq(*argv[1], false);
+	__handle_sysrq_nolock(*argv[1], false);
 	kdb_trap_printk--;
 
 	return 0;