mbox series

[v3,0/5] coresight: enable debug module

Message ID 1488520809-31670-1-git-send-email-leo.yan@linaro.org
Headers show
Series coresight: enable debug module | expand

Message

Leo Yan March 3, 2017, 6 a.m. UTC
ARMv8 architecture reference manual (ARM DDI 0487A.k) Chapter H7 "The
Sample-based Profiling Extension" has description for sampling
registers, we can utilize these registers to check program counter
value with combined CPU exception level, secure state, etc. So this is
helpful for CPU lockup bugs, e.g. if one CPU has run into infinite loop
with IRQ disabled; the 'hang' CPU cannot switch context and handle any
interrupt, so it cannot handle SMP call for stack dump, etc.

This patch series is to enable coresight debug module with sample-based
registers and register call back notifier for PCSR register dumping
when panic happens, so we can see below dumping info for panic; and
this patch series has considered the conditions for access permission
for debug registers self, so this can avoid access debug registers when
CPU power domain is off; the driver also try to figure out the CPU is
in secure or non-secure state.

The last two patches in this series is to enable debug unit on 96boards
Hikey, the first patch is to add apb clock for debug unit and the second
patch is to add DT nodes for debug unit. As result we can below log
after input command: echo c > /proc/sysrq-trigger:

ARM external debug module:
CPU[0]:
 EDPRSR:  0000000b (Power:On DLK:Unlock)
 EDPCSR:  [<ffff00000808eb54>] handle_IPI+0xe4/0x150
 EDCIDSR: 00000000
 EDVIDSR: 90000000 (State:Non-secure Mode:EL1/0 Width:64bits VMID:0)
CPU[1]:
 EDPRSR:  0000000b (Power:On DLK:Unlock)
 EDPCSR:  [<ffff0000087a64c0>] debug_notifier_call+0x108/0x288
 EDCIDSR: 00000000
 EDVIDSR: 90000000 (State:Non-secure Mode:EL1/0 Width:64bits VMID:0)

[...]

Changes from v2:
* According to Mathieu Poirier suggestion, applied some minor fixes.
* Added two extra patches for enabling debug module on Hikey.

Changes from v1:
* According to Mike Leach suggestion, removed the binding for debug
  module clocks which have been directly provided by CPU clocks.
* According to Mathieu Poirier suggestion, added function
  of_coresight_get_cpu() and some minor refactors for debug module
  driver.

Changes from RFC:
* According to Mike Leach suggestion, added check for EDPRSR to avoid
  lockup; added supporting EDVIDSR and EDCIDSR registers.
* According to Mark Rutland and Mathieu Poirier suggestion, rewrote
  the documentation for DT binding.
* According to Mark and Mathieu suggestion, refined debug driver.


Leo Yan (5):
  coresight: bindings for debug module
  coresight: refactor with function of_coresight_get_cpu
  coresight: add support for debug module
  clk: hi6220: add debug APB clock
  arm64: dts: hi6220: register debug module

 .../devicetree/bindings/arm/coresight-debug.txt    |  40 +++
 arch/arm64/boot/dts/hisilicon/hi6220.dtsi          |  64 ++++
 drivers/clk/hisilicon/clk-hi6220.c                 |   1 +
 drivers/hwtracing/coresight/Kconfig                |  10 +
 drivers/hwtracing/coresight/Makefile               |   1 +
 drivers/hwtracing/coresight/coresight-debug.c      | 377 +++++++++++++++++++++
 drivers/hwtracing/coresight/of_coresight.c         |  37 +-
 include/dt-bindings/clock/hi6220-clock.h           |   5 +-
 include/linux/coresight.h                          |   2 +
 9 files changed, 524 insertions(+), 13 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/arm/coresight-debug.txt
 create mode 100644 drivers/hwtracing/coresight/coresight-debug.c

-- 
2.7.4

Comments

Sudeep Holla March 21, 2017, 3:39 p.m. UTC | #1
On 03/03/17 06:00, Leo Yan wrote:
> Coresight includes debug module and usually the module connects with CPU

> debug logic. ARMv8 architecture reference manual (ARM DDI 0487A.k) has

> description for related info in "Part H: External Debug".

> 

> Chapter H7 "The Sample-based Profiling Extension" introduces several

> sampling registers, e.g. we can check program counter value with

> combined CPU exception level, secure state, etc. So this is helpful for

> analysis CPU lockup scenarios, e.g. if one CPU has run into infinite

> loop with IRQ disabled. In this case the CPU cannot switch context and

> handle any interrupt (including IPIs), as the result it cannot handle

> SMP call for stack dump.

> 

> This patch is to enable coresight debug module, so firstly this driver

> is to bind apb clock for debug module and this is to ensure the debug

> module can be accessed from program or external debugger. And the driver

> uses sample-based registers for debug purpose, e.g. when system detects

> the CPU lockup and trigger panic, the driver will dump program counter

> and combined context registers (EDCIDSR, EDVIDSR); by parsing context

> registers so can quickly get to know CPU secure state, exception level,

> etc.

> 

> Some of the debug module registers are located in CPU power domain, so

> in the driver it has checked the power state for CPU before accessing

> registers within CPU power domain. For most safe way to use this driver,

> it's suggested to disable CPU low power states, this can simply set

> "nohlt" in kernel command line.

> 


I disagree with this approach. One of the main usefulness of such self
hosted debug feature is to debug issues around features like cpuidle.
Adding constraints like "cpuidle needs to be disabled" is not good IMO.
There are ways to make it work with cpuidle enabled. Please explore
them. In particular refer H9.2.39 EDPRCR, External Debug Power/Reset
Control Register.

So, "nohlt" option is not an option. I prefer some sysfs option like
Suzuki suggested to enable this feature on demand if power saving in
normal usecase is the concern. Using "nohlt" just disables idle and
doesn't ensure the debug power domain is ON. Using the flag directly in
this driver to enable debug power domain also sounds misuse of that flag
for me.

-- 
Regards,
Sudeep
Leo Yan March 22, 2017, 4:01 p.m. UTC | #2
On Wed, Mar 22, 2017 at 02:07:47PM +0000, Sudeep Holla wrote:
> On 22/03/17 12:54, Mike Leach wrote:

> > On 21 March 2017 at 15:39, Sudeep Holla <sudeep.holla@arm.com 

> > <mailto:sudeep.holla@arm.com>> wrote:

> > 

> [...]

> 

> > I disagree with this approach. One of the main usefulness of such

> > self hosted debug feature is to debug issues around features like

> > cpuidle. Adding constraints like "cpuidle needs to be disabled" is

> > not good IMO. There are ways to make it work with cpuidle enabled.

> > Please explore them. In particular refer H9.2.39 EDPRCR, External

> > Debug Power/Reset Control Register.

> > 

> > So, "nohlt" option is not an option. I prefer some sysfs option like 

> > Suzuki suggested to enable this feature on demand if power saving in 

> > normal usecase is the concern. Using "nohlt" just disables idle and 

> > doesn't ensure the debug power domain is ON. Using the flag directly

> > in this driver to enable debug power domain also sounds misuse of

> > that flag for me.

> > 

> > I think the key issue to remember here is that experience with

> > external debug shows that CPU Idle means different things to

> > different SoC designs / power management schemes. (and we are using

> > external debug in a self hosted way here).

>

> Yes agreed on the point that meaning of "cpuidle" differs on each SoC.


Very appreciate for Mike's summary. It's shame for me this is one thing
I should do better :)

This good summary is quite important.

> > Some designs will power down an entire cluster if all CPUs on the 

> > cluster are powered down - including the parts of the debug

> > registers that should remain powered in the debug power domain.

> 

> Interesting, at-least ETMv4 or some other coresight specification

> clearly classify the power domains and the register access. The actual

> power domain itself may vary depending on implementation.

> 

> > The bits in EDPRCR are not respected in these cases - these designs

> > do not really support debug over power down in the way that the

> > CoreSight / Debug designers anticipated. This means that even

> > checking EDPRSR has the potential to cause a bus hang if the target

> > register is unpowered. (and if the debug power domain is unpowered

> > then the PC data is also lost).

> > 

> 

> Agreed, but can we start supporting the sane designs in sane way first.

> We can always add compatible and handle deviations. I agree we may need

> to support such deviations but starting with that seems setting a bad

> example.

> 

> > In these cases, accessing to the debug registers while they are not 

> > powered is a recipe for disaster - so preventing CPUIdle ​and the

> > subsequent cluster power down ​ allows investigation on this class of

> > system - ​and allowing the CPUs of interest be interrogated without

> > hanging the crash log process.​

> > 

> 

> Agreed. But my point is that many issues are around cpuidle and some

> usecase and just eliminating that use-case sounds bad. For me,

> core-sight was most useful to debug issues around cpu power management

> and lockups where we can't stop cores but examine these registers.

> There are other alternatives for other use-cases IMO.

> 

> > 

> > ​On systems that do behave correctly with respect to debug power 

> > domains, then disabling CPUIdle is unnecessary - these can be

> > controlled by ​EDPRCR - perhaps; per the specification it is

> > "implementation defined" if writing bits to this register have an

> > effect on the system anyway even if the debug domain is correctly

> > powered.

> > 

> 

> We can always do that unconditionally. If implementations don't honor

> those bits, it's different. If they hang on accessing something which is

> on debug power domain and not on core power domain, then you have much

> bigger issue to solve. How can you even trust and make any other

> register accesses that are in debug power domain then ?


So we can add below code before really access another other registers
are possible in CPU power domain:

        /*
         * Force to power on CPU power domain and assert
         * DBGPWRUPREQ signal
         */
        val = readl(drvdata->base + EDPRCR);
        val |= BIT(3);
        writel(val, drvdata->base + EDPRCR);

> > ​While it is true to say that disabling CPUIdle does not guarantee

> > that the debug power domain is on, it does in a certain class of

> > designs prevent it being powered off (Juno historically - not sure if

> > that is still the case.).

> > 

> 

> Again it's completely platform specific. All you need to care is that

> the debug power domain is on or not. Disabling CPUIdle to achieve that

> is simply wrong and may work only on few platforms.

> 

> > However, I do agree that the use of the driver should not be

> > triggered _only_ on the existence of /nohlt on the command line - ​

> > there is a class of designs where this will not be required.

> > 

> 

> Thanks

> 

> > When enabing the driver as a kernel config the user needs to

> > decide:- 1) do I need this to debug the issue I am seeing 2) does the

> > power management on my system require I use /nohlt as well.

> 

> Please don't *misuse* nohlt to disable idle. There are other ways to

> do the same either from the user-space or from the driver.

> 

> > 

> > I think that the use of /nohlt as an option, and the reasons why it 

> > might be needed should be part of the configuration help in this

> > case.

> > 

> > There is also a case for considering if there should be an option to 

> > configure it to be enabled or disabled at boot time. It is easy to 

> > imagine cases I want to have this running from the start as a crash 

> > happens early - and cases I can enable it on demand later.

> > 

> 

> Also consider with cpuidle enabled ;). I can help testing if needed.


I tried to digest these info and below are my understanding from your
suggestion:

### For boot time: add two command line flags

- coresight.cpu_debug: this flag is used to enable cpu debug module at
  boot time, and it relys on sane hardware design (like PRCR can works
  well) to access registers;

- coresight.cpu_debug_pwrup: this flag is used to enable cpu debug
  module at boot time, and it cannot relys on PRCR anymore so we need
  manually constraint CPU power states;

### For runtime: use one sysfs node

- Create sysfs node:
  /sys/kernel/debug/coresight_cpu_debug/enable_debug

  echo 1 > /sys/kernel/debug/coresight_cpu_debug/enable_debug: same
  functionality with boot time's 'coresight.cpu_debug';

  echo 2 > /sys/kernel/debug/coresight_cpu_debug/enable_debug: same
  functionality with boot time's 'coresight.cpu_debug_pwrup';

  echo 0 > /sys/kernel/debug/coresight_cpu_debug/enable_debug: disable
  debug functionality.

Does this make sense?

Thanks,
Leo Yan
Sudeep Holla March 22, 2017, 4:53 p.m. UTC | #3
On 22/03/17 16:01, Leo Yan wrote:
> On Wed, Mar 22, 2017 at 02:07:47PM +0000, Sudeep Holla wrote:


[...]

>>

>> We can always do that unconditionally. If implementations don't honor

>> those bits, it's different. If they hang on accessing something which is

>> on debug power domain and not on core power domain, then you have much

>> bigger issue to solve. How can you even trust and make any other

>> register accesses that are in debug power domain then ?

> 

> So we can add below code before really access another other registers

> are possible in CPU power domain:

> 

>         /*

>          * Force to power on CPU power domain and assert

>          * DBGPWRUPREQ signal

>          */

>         val = readl(drvdata->base + EDPRCR);

>         val |= BIT(3);

>         writel(val, drvdata->base + EDPRCR);

> 


Yes worth trying it out.

[...]

> 

> I tried to digest these info and below are my understanding from your

> suggestion:

> 

> ### For boot time: add two command line flags

> 


I am not really sure about boot flags as there are dependency on power
domains and expecting them to be powered on quite earlier is too much to
ask. I am not sure if we need special case for boot time. But that's
just my opinion. If someone has found it *really* useful and no other
alternative exists, then go for it.

[...]

> ### For runtime: use one sysfs node

> 

> - Create sysfs node:

>   /sys/kernel/debug/coresight_cpu_debug/enable_debug

> 

>   echo 1 > /sys/kernel/debug/coresight_cpu_debug/enable_debug: same

>   functionality with boot time's 'coresight.cpu_debug';

> 


My argument was this to be default without any need for flags.
We can skip it as and when we find broken implementation if required.

>   echo 2 > /sys/kernel/debug/coresight_cpu_debug/enable_debug: same

>   functionality with boot time's 'coresight.cpu_debug_pwrup';

> 

>   echo 0 > /sys/kernel/debug/coresight_cpu_debug/enable_debug: disable

>   debug functionality.

>


So it can be simple boolean to force setup the power domain requirements
for it to work whenever you need to activate it. I may be missing some
use-case, but IIUC simple boolean flag should be fine as suggested
initially.

-- 
Regards,
Sudeep
Leo Yan March 23, 2017, 5:43 a.m. UTC | #4
On Wed, Mar 22, 2017 at 05:25:50PM +0000, Sudeep Holla wrote:
> 

> 

> On 22/03/17 17:09, Suzuki K Poulose wrote:

> > On 22/03/17 16:17, Sudeep Holla wrote:

> 

> [...]

> 

> >> 

> >> Point taken. So we could just specify that all necessary power

> >> domains need to be on for proper functionality for this feature and

> >> that it's highly platform specific instead of mixing cpu/cluster

> >> idle details here.

> >> 

> >>> The key point is that the caveat in using this driver is that

> >>> the power management has to be considered on a platform specific

> >>> basis before it is configured; and appropriate actions may be

> >>> needed for it to work correctly. Without this then the driver

> >>> could cause more issues than it debugs. A user selecting this

> >>> _must_ be told about these issues

> >>> 

> > 

> > So given all the possible caveats, I think we :

> > 

> > 1) Shouldn't enable the driver by default at runtime even if it is 

> > built-in. 

> > 2) Should provide mechanisms to turn it on at boot (via

> > kernel commandline) or anytime later (via sysfs), which kind of puts

> > the responsibility back on the user : "You know what you are doing". 

> > 3) Shouldn't turn the driver on based on "nohlt" which the user

> > could use it for some other purposes, without explicit intention of

> > turning this driver on).

> > 4) Should document the fact that, on some

> > platforms, the user may have to disable CPUidle explicitly to get the

> > driver working. But let us not make it the default. The user with a

> > not so ideal platform could add "nohlt" and get it working.

> > 

> 

> Agreed on all points and well summarized.

> I would like to highlight (3) and (4) as it needs to be well understood.

> 

> "nohlt" has a *different* meaning already, so using that in this

> driver for something else is simple wrong as it affects the system in

> unintended ways. And yes if user (mis)uses it to get things working,

> it's fine but shouldn't be recommended way.


Understand this point.

I will try to use general way to constraint CPUIdle like other
drivers.

Thanks all for these good suggestions :)

Thanks,
Leo Yan