diff mbox

[RFC,v2] arm64: kgdb: fix single stepping

Message ID 1411732453-5437-1-git-send-email-takahiro.akashi@linaro.org
State New
Headers show

Commit Message

AKASHI Takahiro Sept. 26, 2014, 11:54 a.m. UTC
I tried to verify kgdb in vanilla kernel on fast model, but it seems that
the single stepping with kgdb doesn't work correctly since its first
appearance at v3.15.

On v3.15, 'stepi' command after breaking the kernel at some breakpoint
steps forward to the next instruction, but the succeeding 'stepi' never
goes beyond that.
On v3.16, 'stepi' moves forward and stops at the next instruction just
after enable_dbg in el1_dbg, and never goes beyond that. This variance of
behavior seems to come in with the following patch in v3.16:

    commit 2a2830703a23 ("arm64: debug: avoid accessing mdscr_el1 on fault
    paths where possible")

This patch
(1) moves kgdb_disable_single_step() from 'c' command handling to single
    step handler.
    This makes sure that single stepping gets effective at every 's' command.
    Please note that, under the current implementation, single step bit in
    spsr, which is cleared by the first single stepping, will not be set
    again for the consecutive 's' commands because single step bit in mdscr
    is still kept on (that is, kernel_active_single_step() in
    kgdb_arch_handle_exception() is true).
(2) removes 'enable_dbg' in el1_dbg.
    Single step bit in mdscr is turned on in do_handle_exception()->
    kgdb_handle_expection() before returning to debugged context, and if
    debug exception is enabled in el1_dbg, we will see unexpected single-
    stepping in el1_dbg.
(3) masks interrupts while single-stepping one instruction.
    If an interrupt is caught during processing a single-stepping, debug
    exception is unintentionally enabled by el1_irq's 'enable_dbg' before
    returning to debugged context.
    Thus, like in (2), we will see unexpected single-stepping in el1_irq.

Basically (1) is for v3.15, (2) and (3) with (1) for v3.16.

With those changes, we will see another problem if a breakpoint is set
at interrupt-sensible places, like gic_handle_irq():

    KGDB: re-enter error: breakpoint removed ffffffc000081258
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 650 at kernel/debug/debug_core.c:435
					kgdb_handle_exception+0x1dc/0x1f4()
    Modules linked in:
    CPU: 0 PID: 650 Comm: sh Not tainted 3.17.0-rc2+ #177
    Call trace:
    [<ffffffc000087fac>] dump_backtrace+0x0/0x130
    [<ffffffc0000880ec>] show_stack+0x10/0x1c
    [<ffffffc0004d683c>] dump_stack+0x74/0xb8
    [<ffffffc0000ab824>] warn_slowpath_common+0x8c/0xb4
    [<ffffffc0000ab90c>] warn_slowpath_null+0x14/0x20
    [<ffffffc000121bfc>] kgdb_handle_exception+0x1d8/0x1f4
    [<ffffffc000092ffc>] kgdb_brk_fn+0x18/0x28
    [<ffffffc0000821c8>] brk_handler+0x9c/0xe8
    [<ffffffc0000811e8>] do_debug_exception+0x3c/0xac
    Exception stack(0xffffffc07e027650 to 0xffffffc07e027770)
    ...
    [<ffffffc000083cac>] el1_dbg+0x14/0x68
    [<ffffffc00012178c>] kgdb_cpu_enter+0x464/0x5c0
    [<ffffffc000121bb4>] kgdb_handle_exception+0x190/0x1f4
    [<ffffffc000092ffc>] kgdb_brk_fn+0x18/0x28
    [<ffffffc0000821c8>] brk_handler+0x9c/0xe8
    [<ffffffc0000811e8>] do_debug_exception+0x3c/0xac
    Exception stack(0xffffffc07e027ac0 to 0xffffffc07e027be0)
    ...
    [<ffffffc000083cac>] el1_dbg+0x14/0x68
    [<ffffffc00032e4b4>] __handle_sysrq+0x11c/0x190
    [<ffffffc00032e93c>] write_sysrq_trigger+0x4c/0x60
    [<ffffffc0001e7d58>] proc_reg_write+0x54/0x84
    [<ffffffc000192fa4>] vfs_write+0x98/0x1c8
    [<ffffffc0001939b0>] SyS_write+0x40/0xa0

Once some interrupt occurs, a breakpoint at gic_handle_irq() triggers kgdb.
Kgdb then calls kgdb_roundup_cpus() to sync with other cpus.
Current kgdb_roundup_cpus() unmasks interrupts temporarily to
use smp_call_function().
This eventually allows another interrupt to occur and likely results in
hitting a breakpoint at gic_handle_irq() again since debug exception is
always enabled in el1_irq.

We can avoid this issue by specifying "nokgdbroundup" in kernel parameter,
but this will also leave other cpus be in unknown state in terms of kgdb,
and may result in interfering with kgdb activity.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
---
 arch/arm64/kernel/entry.S |    1 -
 arch/arm64/kernel/kgdb.c  |   29 +++++++++++++++++++----------
 2 files changed, 19 insertions(+), 11 deletions(-)

Comments

Will Deacon Oct. 3, 2014, 4:03 p.m. UTC | #1
Hi Akashi,

On Fri, Sep 26, 2014 at 12:54:13PM +0100, AKASHI Takahiro wrote:
> I tried to verify kgdb in vanilla kernel on fast model, but it seems that
> the single stepping with kgdb doesn't work correctly since its first
> appearance at v3.15.
> 
> On v3.15, 'stepi' command after breaking the kernel at some breakpoint
> steps forward to the next instruction, but the succeeding 'stepi' never
> goes beyond that.
> On v3.16, 'stepi' moves forward and stops at the next instruction just
> after enable_dbg in el1_dbg, and never goes beyond that. This variance of
> behavior seems to come in with the following patch in v3.16:
> 
>     commit 2a2830703a23 ("arm64: debug: avoid accessing mdscr_el1 on fault
>     paths where possible")
> 
> This patch
> (1) moves kgdb_disable_single_step() from 'c' command handling to single
>     step handler.
>     This makes sure that single stepping gets effective at every 's' command.
>     Please note that, under the current implementation, single step bit in
>     spsr, which is cleared by the first single stepping, will not be set
>     again for the consecutive 's' commands because single step bit in mdscr
>     is still kept on (that is, kernel_active_single_step() in
>     kgdb_arch_handle_exception() is true).
> (2) removes 'enable_dbg' in el1_dbg.
>     Single step bit in mdscr is turned on in do_handle_exception()->
>     kgdb_handle_expection() before returning to debugged context, and if
>     debug exception is enabled in el1_dbg, we will see unexpected single-
>     stepping in el1_dbg.
> (3) masks interrupts while single-stepping one instruction.
>     If an interrupt is caught during processing a single-stepping, debug
>     exception is unintentionally enabled by el1_irq's 'enable_dbg' before
>     returning to debugged context.
>     Thus, like in (2), we will see unexpected single-stepping in el1_irq.
> 
> Basically (1) is for v3.15, (2) and (3) with (1) for v3.16.
> 
> With those changes, we will see another problem if a breakpoint is set
> at interrupt-sensible places, like gic_handle_irq():

So it seems to me like kgdb is a complete mess in this area. The low-level
debug exception code for arm64 will single-step *into* interrupt handlers. I
believe that this is the correct behaviour, as otherwise we're artifically
restricting what you can and can't debug (for example, leaving debug
exceptions masked on the interrupt path means that you can't put breakpoints
in interrupt handlers).

>     KGDB: re-enter error: breakpoint removed ffffffc000081258
>     ------------[ cut here ]------------
>     WARNING: CPU: 0 PID: 650 at kernel/debug/debug_core.c:435
> 					kgdb_handle_exception+0x1dc/0x1f4()
>     Modules linked in:
>     CPU: 0 PID: 650 Comm: sh Not tainted 3.17.0-rc2+ #177
>     Call trace:
>     [<ffffffc000087fac>] dump_backtrace+0x0/0x130
>     [<ffffffc0000880ec>] show_stack+0x10/0x1c
>     [<ffffffc0004d683c>] dump_stack+0x74/0xb8
>     [<ffffffc0000ab824>] warn_slowpath_common+0x8c/0xb4
>     [<ffffffc0000ab90c>] warn_slowpath_null+0x14/0x20
>     [<ffffffc000121bfc>] kgdb_handle_exception+0x1d8/0x1f4
>     [<ffffffc000092ffc>] kgdb_brk_fn+0x18/0x28
>     [<ffffffc0000821c8>] brk_handler+0x9c/0xe8
>     [<ffffffc0000811e8>] do_debug_exception+0x3c/0xac
>     Exception stack(0xffffffc07e027650 to 0xffffffc07e027770)
>     ...
>     [<ffffffc000083cac>] el1_dbg+0x14/0x68
>     [<ffffffc00012178c>] kgdb_cpu_enter+0x464/0x5c0
>     [<ffffffc000121bb4>] kgdb_handle_exception+0x190/0x1f4
>     [<ffffffc000092ffc>] kgdb_brk_fn+0x18/0x28
>     [<ffffffc0000821c8>] brk_handler+0x9c/0xe8
>     [<ffffffc0000811e8>] do_debug_exception+0x3c/0xac
>     Exception stack(0xffffffc07e027ac0 to 0xffffffc07e027be0)
>     ...
>     [<ffffffc000083cac>] el1_dbg+0x14/0x68
>     [<ffffffc00032e4b4>] __handle_sysrq+0x11c/0x190
>     [<ffffffc00032e93c>] write_sysrq_trigger+0x4c/0x60
>     [<ffffffc0001e7d58>] proc_reg_write+0x54/0x84
>     [<ffffffc000192fa4>] vfs_write+0x98/0x1c8
>     [<ffffffc0001939b0>] SyS_write+0x40/0xa0
> 
> Once some interrupt occurs, a breakpoint at gic_handle_irq() triggers kgdb.
> Kgdb then calls kgdb_roundup_cpus() to sync with other cpus.
> Current kgdb_roundup_cpus() unmasks interrupts temporarily to
> use smp_call_function().
> This eventually allows another interrupt to occur and likely results in
> hitting a breakpoint at gic_handle_irq() again since debug exception is
> always enabled in el1_irq.
> 
> We can avoid this issue by specifying "nokgdbroundup" in kernel parameter,
> but this will also leave other cpus be in unknown state in terms of kgdb,
> and may result in interfering with kgdb activity.

Yuck. This really sounds like kgdb is broken in its SMP synchronisation
for arm64. On x86, they use a NMI and powerpc uses an IPI which can run
with irqs disabled. Since we don't have an NMI, how about we do the
following to avoid the panic?

  (1) Change our kgdb_roundup_cpus to use smp_call_function_single_async,
      which will avoid the need to enable interrupts

  (2) Introduce a timeout into the waiting loop in kgdb_cpu_enter, where
      we spin on &slaves_in_kgdb and warn if the timeout expires.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
AKASHI Takahiro Oct. 6, 2014, 4:59 a.m. UTC | #2
On 10/04/2014 01:03 AM, Will Deacon wrote:
> Hi Akashi,
>
> On Fri, Sep 26, 2014 at 12:54:13PM +0100, AKASHI Takahiro wrote:
>> I tried to verify kgdb in vanilla kernel on fast model, but it seems that
>> the single stepping with kgdb doesn't work correctly since its first
>> appearance at v3.15.
>>
>> On v3.15, 'stepi' command after breaking the kernel at some breakpoint
>> steps forward to the next instruction, but the succeeding 'stepi' never
>> goes beyond that.
>> On v3.16, 'stepi' moves forward and stops at the next instruction just
>> after enable_dbg in el1_dbg, and never goes beyond that. This variance of
>> behavior seems to come in with the following patch in v3.16:
>>
>>      commit 2a2830703a23 ("arm64: debug: avoid accessing mdscr_el1 on fault
>>      paths where possible")
>>
>> This patch
>> (1) moves kgdb_disable_single_step() from 'c' command handling to single
>>      step handler.
>>      This makes sure that single stepping gets effective at every 's' command.
>>      Please note that, under the current implementation, single step bit in
>>      spsr, which is cleared by the first single stepping, will not be set
>>      again for the consecutive 's' commands because single step bit in mdscr
>>      is still kept on (that is, kernel_active_single_step() in
>>      kgdb_arch_handle_exception() is true).
>> (2) removes 'enable_dbg' in el1_dbg.
>>      Single step bit in mdscr is turned on in do_handle_exception()->
>>      kgdb_handle_expection() before returning to debugged context, and if
>>      debug exception is enabled in el1_dbg, we will see unexpected single-
>>      stepping in el1_dbg.
>> (3) masks interrupts while single-stepping one instruction.
>>      If an interrupt is caught during processing a single-stepping, debug
>>      exception is unintentionally enabled by el1_irq's 'enable_dbg' before
>>      returning to debugged context.
>>      Thus, like in (2), we will see unexpected single-stepping in el1_irq.
>>
>> Basically (1) is for v3.15, (2) and (3) with (1) for v3.16.
>>
>> With those changes, we will see another problem if a breakpoint is set
>> at interrupt-sensible places, like gic_handle_irq():
>
> So it seems to me like kgdb is a complete mess in this area. The low-level
> debug exception code for arm64 will single-step *into* interrupt handlers. I
> believe that this is the correct behaviour, as otherwise we're artifically
> restricting what you can and can't debug (for example, leaving debug
> exceptions masked on the interrupt path means that you can't put breakpoints
> in interrupt handlers).

I agree that we should be able to debug even in an interrupt context.

>>      KGDB: re-enter error: breakpoint removed ffffffc000081258
>>      ------------[ cut here ]------------
>>      WARNING: CPU: 0 PID: 650 at kernel/debug/debug_core.c:435
>> 					kgdb_handle_exception+0x1dc/0x1f4()
>>      Modules linked in:
>>      CPU: 0 PID: 650 Comm: sh Not tainted 3.17.0-rc2+ #177
>>      Call trace:
>>      [<ffffffc000087fac>] dump_backtrace+0x0/0x130
>>      [<ffffffc0000880ec>] show_stack+0x10/0x1c
>>      [<ffffffc0004d683c>] dump_stack+0x74/0xb8
>>      [<ffffffc0000ab824>] warn_slowpath_common+0x8c/0xb4
>>      [<ffffffc0000ab90c>] warn_slowpath_null+0x14/0x20
>>      [<ffffffc000121bfc>] kgdb_handle_exception+0x1d8/0x1f4
>>      [<ffffffc000092ffc>] kgdb_brk_fn+0x18/0x28
>>      [<ffffffc0000821c8>] brk_handler+0x9c/0xe8
>>      [<ffffffc0000811e8>] do_debug_exception+0x3c/0xac
>>      Exception stack(0xffffffc07e027650 to 0xffffffc07e027770)
>>      ...
>>      [<ffffffc000083cac>] el1_dbg+0x14/0x68
>>      [<ffffffc00012178c>] kgdb_cpu_enter+0x464/0x5c0
>>      [<ffffffc000121bb4>] kgdb_handle_exception+0x190/0x1f4
>>      [<ffffffc000092ffc>] kgdb_brk_fn+0x18/0x28
>>      [<ffffffc0000821c8>] brk_handler+0x9c/0xe8
>>      [<ffffffc0000811e8>] do_debug_exception+0x3c/0xac
>>      Exception stack(0xffffffc07e027ac0 to 0xffffffc07e027be0)
>>      ...
>>      [<ffffffc000083cac>] el1_dbg+0x14/0x68
>>      [<ffffffc00032e4b4>] __handle_sysrq+0x11c/0x190
>>      [<ffffffc00032e93c>] write_sysrq_trigger+0x4c/0x60
>>      [<ffffffc0001e7d58>] proc_reg_write+0x54/0x84
>>      [<ffffffc000192fa4>] vfs_write+0x98/0x1c8
>>      [<ffffffc0001939b0>] SyS_write+0x40/0xa0
>>
>> Once some interrupt occurs, a breakpoint at gic_handle_irq() triggers kgdb.
>> Kgdb then calls kgdb_roundup_cpus() to sync with other cpus.
>> Current kgdb_roundup_cpus() unmasks interrupts temporarily to
>> use smp_call_function().
>> This eventually allows another interrupt to occur and likely results in
>> hitting a breakpoint at gic_handle_irq() again since debug exception is
>> always enabled in el1_irq.
>>
>> We can avoid this issue by specifying "nokgdbroundup" in kernel parameter,
>> but this will also leave other cpus be in unknown state in terms of kgdb,
>> and may result in interfering with kgdb activity.
>
> Yuck. This really sounds like kgdb is broken in its SMP synchronisation
> for arm64. On x86, they use a NMI and powerpc uses an IPI which can run
> with irqs disabled. Since we don't have an NMI, how about we do the
> following to avoid the panic?
>
>    (1) Change our kgdb_roundup_cpus to use smp_call_function_single_async,
>        which will avoid the need to enable interrupts

It seems that we will have to implement, some sort of, async-version of
smp_call_function() here.

>    (2) Introduce a timeout into the waiting loop in kgdb_cpu_enter, where
>        we spin on &slaves_in_kgdb and warn if the timeout expires.

Before trying this, I need understand more about smp_call_function(), especially
why we can't call it with interrupts disabled.

-Takahiro AKASHI

> Will
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index fdd6eae..a935d5f 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -325,7 +325,6 @@  el1_dbg:
 	mrs	x0, far_el1
 	mov	x2, sp				// struct pt_regs
 	bl	do_debug_exception
-	enable_dbg
 	kernel_exit 1
 el1_inv:
 	// TODO: add support for undefined instructions in kernel mode
diff --git a/arch/arm64/kernel/kgdb.c b/arch/arm64/kernel/kgdb.c
index 75c9cf1..f1fc1d8 100644
--- a/arch/arm64/kernel/kgdb.c
+++ b/arch/arm64/kernel/kgdb.c
@@ -22,6 +22,7 @@ 
 #include <linux/irq.h>
 #include <linux/kdebug.h>
 #include <linux/kgdb.h>
+#include <asm/percpu.h>
 #include <asm/traps.h>
 
 struct dbg_reg_def_t dbg_reg_def[DBG_MAX_REG_NUM] = {
@@ -95,6 +96,8 @@  struct dbg_reg_def_t dbg_reg_def[DBG_MAX_REG_NUM] = {
 	{ "fpcr", 4, -1 },
 };
 
+static DEFINE_PER_CPU(unsigned int, kgdb_pstate);
+
 char *dbg_get_reg(int regno, void *mem, struct pt_regs *regs)
 {
 	if (regno >= DBG_MAX_REG_NUM || regno < 0)
@@ -176,18 +179,14 @@  int kgdb_arch_handle_exception(int exception_vector, int signo,
 		 * over and over again.
 		 */
 		kgdb_arch_update_addr(linux_regs, remcom_in_buffer);
-		atomic_set(&kgdb_cpu_doing_single_step, -1);
-		kgdb_single_step =  0;
-
-		/*
-		 * Received continue command, disable single step
-		 */
-		if (kernel_active_single_step())
-			kernel_disable_single_step();
 
 		err = 0;
 		break;
 	case 's':
+		/* mask interrupts while single stepping */
+		__this_cpu_write(kgdb_pstate, linux_regs->pstate);
+		linux_regs->pstate |= (1 << 7);
+
 		/*
 		 * Update step address value with address passed
 		 * with step packet.
@@ -198,8 +197,6 @@  int kgdb_arch_handle_exception(int exception_vector, int signo,
 		 */
 		kgdb_arch_update_addr(linux_regs, remcom_in_buffer);
 		atomic_set(&kgdb_cpu_doing_single_step, raw_smp_processor_id());
-		kgdb_single_step =  1;
-
 		/*
 		 * Enable single step handling
 		 */
@@ -229,6 +226,18 @@  static int kgdb_compiled_brk_fn(struct pt_regs *regs, unsigned int esr)
 
 static int kgdb_step_brk_fn(struct pt_regs *regs, unsigned int esr)
 {
+	unsigned int pstate;
+
+	kernel_disable_single_step();
+	atomic_set(&kgdb_cpu_doing_single_step, -1);
+
+	/* restore interrupt mask status */
+	pstate = __this_cpu_read(kgdb_pstate);
+	if (pstate & (1 << 7))
+		regs->pstate |= (1 << 7);
+	else
+		regs->pstate &= ~(1 << 7);
+
 	kgdb_handle_exception(1, SIGTRAP, 0, regs);
 	return 0;
 }