[15/20] ARM/hw_breakpoint: Convert to hotplug state machine

Message ID 20170104135644.GA29212@leverpostej
State New
Headers show

Commit Message

Mark Rutland Jan. 4, 2017, 1:56 p.m.
On Tue, Jan 03, 2017 at 09:33:36AM +0000, Mark Rutland wrote:
> Hi,

> 

> On Mon, Jan 02, 2017 at 09:15:29PM +0100, Linus Walleij wrote:

> > On Mon, Jan 2, 2017 at 4:00 PM, Russell King - ARM Linux

> > <linux@armlinux.org.uk> wrote:

> > > On Mon, Jan 02, 2017 at 03:34:32PM +0100, Linus Walleij wrote:

> > >> in the first line of arch_hw_breakpoint_init() in

> > >> arch/arm/kernel/hw_breakpoint.c

> > >>

> > >> I suspect that is not an accepable solution ...

> > >>

> > >> It hangs at PC is at write_wb_reg+0x20c/0x330

> > >> Which is c03101dc, and looks like this in objdump -d:

> > >>

> > >> c031020c:       ee001eba        mcr     14, 0, r1, cr0, cr10, {5}

> > >> c0310210:       eaffffb3        b       c03100e4 <write_wb_reg+0x114>

> > >

> > > ... and this is several instructions after the address you mention above.

> > > Presumably c03101dc is accessing a higher numbered register?

> > 

> > Ah sorry. It looks like this:

> > 

> > c03101dc:       ee001ed0        mcr     14, 0, r1, cr0, cr0, {6}

> > c03101e0:       eaffffbf        b       c03100e4 <write_wb_reg+0x114>

> > c03101e4:       ee001ebf        mcr     14, 0, r1, cr0, cr15, {5}

> > c03101e8:       eaffffbd        b       c03100e4 <write_wb_reg+0x114>

> > c03101ec:       ee001ebe        mcr     14, 0, r1, cr0, cr14, {5}

> > c03101f0:       eaffffbb        b       c03100e4 <write_wb_reg+0x114>

> > c03101f4:       ee001ebd        mcr     14, 0, r1, cr0, cr13, {5}

> > c03101f8:       eaffffb9        b       c03100e4 <write_wb_reg+0x114>

> 

> FWIW, I was tracking an issue in this area before the holiday.

> 

> It looked like DBGPRSR.SPD is set unexpectedly over the default idle

> path (i.e. WFI), causing the (otherwise valid) register accesses above

> to be handled as undefined.

> 

> I haven't looked at the patch in detail, but I guess that it allows idle

> to occur between reset_ctrl_regs() and arch_hw_breakpoint_init().


I've just reproduced this locally on my dragonboard APQ8060.

It looks like the write_wb_reg() call that's exploding is from
get_max_wp_len(), which we call immediately after registering the
dbg_reset_online callback. Clearing DBGPRSR.SPD before the write_wb_reg() hides
the problem, so I suspect we're seeing the issue I mentioned above -- it just
so happens that we go idle in a new place.

The below hack allows boot to continue, but is not a real fix. I'm not
immediately sure what to do.

Linus, I wasn't able to get ethernet working. Do I need anything on top
of v4.10-rc2 && multi_v7_defconfig?

Thanks,
Mark.

---->8----

Comments

Will Deacon Jan. 4, 2017, 2:32 p.m. | #1
On Wed, Jan 04, 2017 at 01:56:44PM +0000, Mark Rutland wrote:
> On Tue, Jan 03, 2017 at 09:33:36AM +0000, Mark Rutland wrote:

> > On Mon, Jan 02, 2017 at 09:15:29PM +0100, Linus Walleij wrote:

> > > On Mon, Jan 2, 2017 at 4:00 PM, Russell King - ARM Linux

> > > <linux@armlinux.org.uk> wrote:

> > > > On Mon, Jan 02, 2017 at 03:34:32PM +0100, Linus Walleij wrote:

> > > >> in the first line of arch_hw_breakpoint_init() in

> > > >> arch/arm/kernel/hw_breakpoint.c

> > > >>

> > > >> I suspect that is not an accepable solution ...

> > > >>

> > > >> It hangs at PC is at write_wb_reg+0x20c/0x330

> > > >> Which is c03101dc, and looks like this in objdump -d:

> > > >>

> > > >> c031020c:       ee001eba        mcr     14, 0, r1, cr0, cr10, {5}

> > > >> c0310210:       eaffffb3        b       c03100e4 <write_wb_reg+0x114>

> > > >

> > > > ... and this is several instructions after the address you mention above.

> > > > Presumably c03101dc is accessing a higher numbered register?

> > > 

> > > Ah sorry. It looks like this:

> > > 

> > > c03101dc:       ee001ed0        mcr     14, 0, r1, cr0, cr0, {6}

> > > c03101e0:       eaffffbf        b       c03100e4 <write_wb_reg+0x114>

> > > c03101e4:       ee001ebf        mcr     14, 0, r1, cr0, cr15, {5}

> > > c03101e8:       eaffffbd        b       c03100e4 <write_wb_reg+0x114>

> > > c03101ec:       ee001ebe        mcr     14, 0, r1, cr0, cr14, {5}

> > > c03101f0:       eaffffbb        b       c03100e4 <write_wb_reg+0x114>

> > > c03101f4:       ee001ebd        mcr     14, 0, r1, cr0, cr13, {5}

> > > c03101f8:       eaffffb9        b       c03100e4 <write_wb_reg+0x114>

> > 

> > FWIW, I was tracking an issue in this area before the holiday.

> > 

> > It looked like DBGPRSR.SPD is set unexpectedly over the default idle

> > path (i.e. WFI), causing the (otherwise valid) register accesses above

> > to be handled as undefined.

> > 

> > I haven't looked at the patch in detail, but I guess that it allows idle

> > to occur between reset_ctrl_regs() and arch_hw_breakpoint_init().

> 

> I've just reproduced this locally on my dragonboard APQ8060.

> 

> It looks like the write_wb_reg() call that's exploding is from

> get_max_wp_len(), which we call immediately after registering the

> dbg_reset_online callback. Clearing DBGPRSR.SPD before the write_wb_reg() hides

> the problem, so I suspect we're seeing the issue I mentioned above -- it just

> so happens that we go idle in a new place.


When you say "go idle", are we just executing a WFI, or is the power
controller coming into play and we're actually powering down the non-debug
logic? In the case of the latter, the PM notifier should clear SPD in
reset_ctrl_regs, so this sounds like a hardware bug where the SPD bit is
set unconditionally on WFI.

In that case, this code has always been dodgy -- what happens if you try
to use hardware breakpoints in GDB in the face of WFI-based idle?

> The below hack allows boot to continue, but is not a real fix. I'm not

> immediately sure what to do.


If it's never worked, I suggest we blacklist the MIDR until somebody from
Qualcomm can help us further.

Will
Linus Walleij Jan. 5, 2017, 3:26 p.m. | #2
On Wed, Jan 4, 2017 at 2:56 PM, Mark Rutland <mark.rutland@arm.com> wrote:

> I've just reproduced this locally on my dragonboard APQ8060.


Sweet!

> It looks like the write_wb_reg() call that's exploding is from

> get_max_wp_len(), which we call immediately after registering the

> dbg_reset_online callback. Clearing DBGPRSR.SPD before the write_wb_reg() hides

> the problem, so I suspect we're seeing the issue I mentioned above -- it just

> so happens that we go idle in a new place.

>

> The below hack allows boot to continue, but is not a real fix. I'm not

> immediately sure what to do.


Me neither. But Will's suggestion to simply blacklist this chip might be
best.

> Linus, I wasn't able to get ethernet working. Do I need anything on top

> of v4.10-rc2 && multi_v7_defconfig?


I haven't tried it with multi_v7 but I should probably try that and patch
up the defconfigs, those are probably the root of the problem.

I do this on top of qcom_defconfig:

scripts/config --file .config \
        --enable QCOM_EBI2 \
        --enable ETHERNET \
        --enable NET_VENDOR_SMSC \
        --enable SMSC911X

Maybe you are missing the EBI2 config?

Yours,
Linus Walleij
Mark Rutland Jan. 5, 2017, 3:57 p.m. | #3
On Wed, Jan 04, 2017 at 02:32:06PM +0000, Will Deacon wrote:
> On Wed, Jan 04, 2017 at 01:56:44PM +0000, Mark Rutland wrote:

> > On Tue, Jan 03, 2017 at 09:33:36AM +0000, Mark Rutland wrote:

> > > On Mon, Jan 02, 2017 at 09:15:29PM +0100, Linus Walleij wrote:

> > > > On Mon, Jan 2, 2017 at 4:00 PM, Russell King - ARM Linux

> > > > <linux@armlinux.org.uk> wrote:

> > > > > On Mon, Jan 02, 2017 at 03:34:32PM +0100, Linus Walleij wrote:

> > > > >> in the first line of arch_hw_breakpoint_init() in

> > > > >> arch/arm/kernel/hw_breakpoint.c

> > > > >>

> > > > >> I suspect that is not an accepable solution ...

> > > > >>

> > > > >> It hangs at PC is at write_wb_reg+0x20c/0x330

> > > > >> Which is c03101dc, and looks like this in objdump -d:

> > > > >>

> > > > >> c031020c:       ee001eba        mcr     14, 0, r1, cr0, cr10, {5}

> > > > >> c0310210:       eaffffb3        b       c03100e4 <write_wb_reg+0x114>

> > > > >

> > > > > ... and this is several instructions after the address you mention above.

> > > > > Presumably c03101dc is accessing a higher numbered register?

> > > > 

> > > > Ah sorry. It looks like this:

> > > > 

> > > > c03101dc:       ee001ed0        mcr     14, 0, r1, cr0, cr0, {6}

> > > > c03101e0:       eaffffbf        b       c03100e4 <write_wb_reg+0x114>

> > > > c03101e4:       ee001ebf        mcr     14, 0, r1, cr0, cr15, {5}

> > > > c03101e8:       eaffffbd        b       c03100e4 <write_wb_reg+0x114>

> > > > c03101ec:       ee001ebe        mcr     14, 0, r1, cr0, cr14, {5}

> > > > c03101f0:       eaffffbb        b       c03100e4 <write_wb_reg+0x114>

> > > > c03101f4:       ee001ebd        mcr     14, 0, r1, cr0, cr13, {5}

> > > > c03101f8:       eaffffb9        b       c03100e4 <write_wb_reg+0x114>

> > > 

> > > FWIW, I was tracking an issue in this area before the holiday.

> > > 

> > > It looked like DBGPRSR.SPD is set unexpectedly over the default idle

> > > path (i.e. WFI), causing the (otherwise valid) register accesses above

> > > to be handled as undefined.

> > > 

> > > I haven't looked at the patch in detail, but I guess that it allows idle

> > > to occur between reset_ctrl_regs() and arch_hw_breakpoint_init().

> > 

> > I've just reproduced this locally on my dragonboard APQ8060.

> > 

> > It looks like the write_wb_reg() call that's exploding is from

> > get_max_wp_len(), which we call immediately after registering the

> > dbg_reset_online callback. Clearing DBGPRSR.SPD before the write_wb_reg() hides

> > the problem, so I suspect we're seeing the issue I mentioned above -- it just

> > so happens that we go idle in a new place.

> 

> When you say "go idle", are we just executing a WFI, 


From my prior experiments, just executing a WFI as we happen to do in
the default cpu_v7_do_idle. I tried passing cpuidle.off=1, but that
didn't help. NOPing the WFI in cpu_v7_do_idle did mask the issue, from
what I recall.

> or is the power controller coming into play and we're actually

> powering down the non-debug logic?


As far as I can see, that isn't happening. We don't save/restore any
other CPU state in the default idle path, and the kernel is otherwise
happy.

> In the case of the latter, the PM notifier should clear SPD in

> reset_ctrl_regs, so this sounds like a hardware bug where the SPD bit is

> set unconditionally on WFI.

> 

> In that case, this code has always been dodgy -- what happens if you try

> to use hardware breakpoints in GDB in the face of WFI-based idle?


The kernel blows up similiarly to Linus's original report when the
kernel tries to program the breakpoint registers.

I also believe this has always been dodgy.

> > The below hack allows boot to continue, but is not a real fix. I'm not

> > immediately sure what to do.

> 

> If it's never worked, I suggest we blacklist the MIDR until somebody from

> Qualcomm can help us further.


I'll see about putting a patch together.

Thanks,
Mark.

> 

> Will

Patch

diff --git a/arch/arm/kernel/hw_breakpoint.c b/arch/arm/kernel/hw_breakpoint.c
index 188180b..a0982ab 100644
--- a/arch/arm/kernel/hw_breakpoint.c
+++ b/arch/arm/kernel/hw_breakpoint.c
@@ -302,7 +302,7 @@  int hw_breakpoint_slots(int type)
  */
 static u8 get_max_wp_len(void)
 {
-       u32 ctrl_reg;
+       u32 ctrl_reg, val;
        struct arch_hw_breakpoint_ctrl ctrl;
        u8 size = 4;
 
@@ -313,6 +313,9 @@  static u8 get_max_wp_len(void)
        ctrl.len = ARM_BREAKPOINT_LEN_8;
        ctrl_reg = encode_ctrl_reg(ctrl);
 
+       /* HACK: CLEAR SPD */
+       ARM_DBG_READ(c1, c5, 4, val);
+
        write_wb_reg(ARM_BASE_WVR, 0);
        write_wb_reg(ARM_BASE_WCR, ctrl_reg);
        if ((read_wb_reg(ARM_BASE_WCR) & ctrl_reg) == ctrl_reg)