mbox series

[v4,0/5] Add support for the ARMv8.2 Statistical Profiling Extension

Message ID 1496676177-29356-1-git-send-email-will.deacon@arm.com
Headers show
Series Add support for the ARMv8.2 Statistical Profiling Extension | expand

Message

Will Deacon June 5, 2017, 3:22 p.m. UTC
Hi all,

This is the sixth posting of the patches previously posted here:

  rfcv1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/476450.html
  rfcv2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/479387.html
     v1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/483684.html
     v2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-April/499938.html
     v3: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-May/507132.html

The main change since v3 is that I have reworked and fixed the CPU hotplug
and notifier bits, in light of review comments from tglx.

The architecture documentation is available here:

  https://developer.arm.com/products/architecture/a-profile/docs/ddi0586/latest/arm-architecture-reference-manual-supplement-statistical-profiling-extension-for-armv8-a

and there's a high-level overview on this official ARM blog:

  https://community.arm.com/processors/b/blog/posts/statistical-profiling-extension-for-armv8-a

All comments welcome,

Will

Will Deacon (5):
  genirq: export irq_get_percpu_devid_partition to modules
  perf/core: Export AUX buffer helpers to modules
  perf/core: Add PERF_AUX_FLAG_COLLISION to report colliding samples
  drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension
  dt-bindings: Document devicetree binding for ARM SPE

 Documentation/devicetree/bindings/arm/spe-pmu.txt |   20 +
 drivers/perf/Kconfig                              |    8 +
 drivers/perf/Makefile                             |    1 +
 drivers/perf/arm_spe_pmu.c                        | 1243 +++++++++++++++++++++
 include/uapi/linux/perf_event.h                   |    1 +
 kernel/events/ring_buffer.c                       |    4 +
 kernel/irq/irqdesc.c                              |    1 +
 7 files changed, 1278 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/spe-pmu.txt
 create mode 100644 drivers/perf/arm_spe_pmu.c

-- 
2.1.4

Comments

Mark Rutland June 12, 2017, 11:08 a.m. UTC | #1
On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:
> Hi all,

>

> This is the sixth posting of the patches previously posted here:

>

>   rfcv1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/476450.html

>   rfcv2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/479387.html

>      v1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/483684.html

>      v2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-April/499938.html

>      v3: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-May/507132.html

>

> The main change since v3 is that I have reworked and fixed the CPU hotplug

> and notifier bits, in light of review comments from tglx.

>

> The architecture documentation is available here:

>

>   https://developer.arm.com/products/architecture/a-profile/docs/ddi0586/latest/arm-architecture-reference-manual-supplement-statistical-profiling-extension-for-armv8-a


Kim, do you have any version of the userspace side that we could look
at?

For review, it would be really helpful to have something that can poke
the PMU, even if it's incomplete or lacking polish.

Thanks,
Mark.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Kim Phillips June 12, 2017, 4:20 p.m. UTC | #2
On Mon, 12 Jun 2017 12:08:23 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:

> > This is the sixth posting of the patches previously posted here:

> > 

> >   rfcv1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/476450.html

> >   rfcv2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/479387.html

> >      v1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/483684.html

> >      v2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-April/499938.html

> >      v3: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-May/507132.html

> > 

> > The main change since v3 is that I have reworked and fixed the CPU hotplug

> > and notifier bits, in light of review comments from tglx.

> > 

> > The architecture documentation is available here:

> > 

> >   https://developer.arm.com/products/architecture/a-profile/docs/ddi0586/latest/arm-architecture-reference-manual-supplement-statistical-profiling-extension-for-armv8-a

> 

> Kim, do you have any version of the userspace side that we could look

> at?

> 

> For review, it would be really helpful to have something that can poke

> the PMU, even if it's incomplete or lacking polish.


Here's the latest push, based on a a couple of prior versions of this
driver:

http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1

I don't seem to be able to get any SPE data output after rebasing on
this version of the driver.  Still don't know why at the moment...

Kim
Kim Phillips June 15, 2017, 3:57 p.m. UTC | #3
On Mon, 12 Jun 2017 11:20:48 -0500
Kim Phillips <kim.phillips@arm.com> wrote:

> On Mon, 12 Jun 2017 12:08:23 +0100

> Mark Rutland <mark.rutland@arm.com> wrote:

> 

> > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:

> > > This is the sixth posting of the patches previously posted here:

...
> > Kim, do you have any version of the userspace side that we could look

> > at?

> > 

> > For review, it would be really helpful to have something that can poke

> > the PMU, even if it's incomplete or lacking polish.

> 

> Here's the latest push, based on a a couple of prior versions of this

> driver:

> 

> http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1

> 

> I don't seem to be able to get any SPE data output after rebasing on

> this version of the driver.  Still don't know why at the moment...


Bisected to commit e38ba76deef "perf tools: force uncore events to
system wide monitoring".  So, using record with specifying a -C
<cpu> explicitly now produces SPE data, but only a couple of valid
records at the beginning of each buffer; the rest is filled with
PADding (0's).

I see Mark's latest comments have found a possible issue in the perf
aux buffer handling code in the driver, and that the driver does some
memset of padding (0's) itself; could that be responsible for the above
behaviour?

Thanks,

Kim
Will Deacon June 21, 2017, 3:31 p.m. UTC | #4
On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote:
> On Mon, 12 Jun 2017 11:20:48 -0500

> Kim Phillips <kim.phillips@arm.com> wrote:

> 

> > On Mon, 12 Jun 2017 12:08:23 +0100

> > Mark Rutland <mark.rutland@arm.com> wrote:

> > 

> > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:

> > > > This is the sixth posting of the patches previously posted here:

> ...

> > > Kim, do you have any version of the userspace side that we could look

> > > at?

> > > 

> > > For review, it would be really helpful to have something that can poke

> > > the PMU, even if it's incomplete or lacking polish.

> > 

> > Here's the latest push, based on a a couple of prior versions of this

> > driver:

> > 

> > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1

> > 

> > I don't seem to be able to get any SPE data output after rebasing on

> > this version of the driver.  Still don't know why at the moment...

> 

> Bisected to commit e38ba76deef "perf tools: force uncore events to

> system wide monitoring".  So, using record with specifying a -C

> <cpu> explicitly now produces SPE data, but only a couple of valid

> records at the beginning of each buffer; the rest is filled with

> PADding (0's).

> 

> I see Mark's latest comments have found a possible issue in the perf

> aux buffer handling code in the driver, and that the driver does some

> memset of padding (0's) itself; could that be responsible for the above

> behaviour?


Possibly. Do you know how big you're mapping the aux buffer and what (if
any) value you're passing as aux_watermark?

Will
Kim Phillips June 22, 2017, 3:56 p.m. UTC | #5
On Wed, 21 Jun 2017 16:31:09 +0100
Will Deacon <will.deacon@arm.com> wrote:

> On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote:

> > On Mon, 12 Jun 2017 11:20:48 -0500

> > Kim Phillips <kim.phillips@arm.com> wrote:

> > 

> > > On Mon, 12 Jun 2017 12:08:23 +0100

> > > Mark Rutland <mark.rutland@arm.com> wrote:

> > > 

> > > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:

> > > > > This is the sixth posting of the patches previously posted here:

> > ...

> > > > Kim, do you have any version of the userspace side that we could look

> > > > at?

> > > > 

> > > > For review, it would be really helpful to have something that can poke

> > > > the PMU, even if it's incomplete or lacking polish.

> > > 

> > > Here's the latest push, based on a a couple of prior versions of this

> > > driver:

> > > 

> > > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1

> > > 

> > > I don't seem to be able to get any SPE data output after rebasing on

> > > this version of the driver.  Still don't know why at the moment...

> > 

> > Bisected to commit e38ba76deef "perf tools: force uncore events to

> > system wide monitoring".  So, using record with specifying a -C

> > <cpu> explicitly now produces SPE data, but only a couple of valid

> > records at the beginning of each buffer; the rest is filled with

> > PADding (0's).

> > 

> > I see Mark's latest comments have found a possible issue in the perf

> > aux buffer handling code in the driver, and that the driver does some

> > memset of padding (0's) itself; could that be responsible for the above

> > behaviour?

> 

> Possibly. Do you know how big you're mapping the aux buffer


4MiB.

> and what (if any) value you're passing as aux_watermark?


None passed, but it looks like 4KiB was used since the AUXTRACE size
was 4MiB - 4KiB.

I'm not seeing the issue with a simple bts-based version I'm
working on...yet.  We can revisit if I'm able to reproduce again; the
problem could have been on the userspace side.

Meanwhile, when using fvp-base.dtb, my model setup stops booting the
kernel after "smp: Bringing up secondary CPUs ...".  If I however take
the second SPE node from fvp-base.dts and add it to my working device
tree, I get this during the driver probe:

[    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
[    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
[    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)
[    1.043784] arm_spe_pmu: probe of spe-pmu@1 failed with error -16

spe-pmu@0 is useable, but doubt spe-pmu@1 is.  btw, that 16 is EBUSY
"Device or resource busy".

Kim
Will Deacon June 22, 2017, 6:36 p.m. UTC | #6
On Thu, Jun 22, 2017 at 10:56:40AM -0500, Kim Phillips wrote:
> On Wed, 21 Jun 2017 16:31:09 +0100

> Will Deacon <will.deacon@arm.com> wrote:

> 

> > On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote:

> > > On Mon, 12 Jun 2017 11:20:48 -0500

> > > Kim Phillips <kim.phillips@arm.com> wrote:

> > > 

> > > > On Mon, 12 Jun 2017 12:08:23 +0100

> > > > Mark Rutland <mark.rutland@arm.com> wrote:

> > > > 

> > > > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:

> > > > > > This is the sixth posting of the patches previously posted here:

> > > ...

> > > > > Kim, do you have any version of the userspace side that we could look

> > > > > at?

> > > > > 

> > > > > For review, it would be really helpful to have something that can poke

> > > > > the PMU, even if it's incomplete or lacking polish.

> > > > 

> > > > Here's the latest push, based on a a couple of prior versions of this

> > > > driver:

> > > > 

> > > > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1

> > > > 

> > > > I don't seem to be able to get any SPE data output after rebasing on

> > > > this version of the driver.  Still don't know why at the moment...

> > > 

> > > Bisected to commit e38ba76deef "perf tools: force uncore events to

> > > system wide monitoring".  So, using record with specifying a -C

> > > <cpu> explicitly now produces SPE data, but only a couple of valid

> > > records at the beginning of each buffer; the rest is filled with

> > > PADding (0's).

> > > 

> > > I see Mark's latest comments have found a possible issue in the perf

> > > aux buffer handling code in the driver, and that the driver does some

> > > memset of padding (0's) itself; could that be responsible for the above

> > > behaviour?

> > 

> > Possibly. Do you know how big you're mapping the aux buffer

> 

> 4MiB.

> 

> > and what (if any) value you're passing as aux_watermark?

> 

> None passed, but it looks like 4KiB was used since the AUXTRACE size

> was 4MiB - 4KiB.

> 

> I'm not seeing the issue with a simple bts-based version I'm

> working on...yet.  We can revisit if I'm able to reproduce again; the

> problem could have been on the userspace side.

> 

> Meanwhile, when using fvp-base.dtb, my model setup stops booting the

> kernel after "smp: Bringing up secondary CPUs ...".  If I however take

> the second SPE node from fvp-base.dts and add it to my working device

> tree, I get this during the driver probe:

> 

> [    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]

> [    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]

> [    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)


Looks like you've screwed up your IRQ partitions, so you are effectively
registering the same device twice, which then blows up due to lack of shared
irqs.

Either remove one of the devices, or use IRQ partitions to restrict them
to unique sets of CPUs.

Will
Kim Phillips June 27, 2017, 9:07 p.m. UTC | #7
On Thu, 22 Jun 2017 19:36:20 +0100
Will Deacon <will.deacon@arm.com> wrote:

> On Thu, Jun 22, 2017 at 10:56:40AM -0500, Kim Phillips wrote:

> > On Wed, 21 Jun 2017 16:31:09 +0100

> > Will Deacon <will.deacon@arm.com> wrote:

> > 

> > > On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote:

> > > > On Mon, 12 Jun 2017 11:20:48 -0500

> > > > Kim Phillips <kim.phillips@arm.com> wrote:

> > > > 

> > > > > On Mon, 12 Jun 2017 12:08:23 +0100

> > > > > Mark Rutland <mark.rutland@arm.com> wrote:

> > > > > 

> > > > > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:

> > > > > > > This is the sixth posting of the patches previously posted here:

> > > > ...

> > > > > > Kim, do you have any version of the userspace side that we could look

> > > > > > at?

> > > > > > 

> > > > > > For review, it would be really helpful to have something that can poke

> > > > > > the PMU, even if it's incomplete or lacking polish.

> > > > > 

> > > > > Here's the latest push, based on a a couple of prior versions of this

> > > > > driver:

> > > > > 

> > > > > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1

> > > > > 

> > > > > I don't seem to be able to get any SPE data output after rebasing on

> > > > > this version of the driver.  Still don't know why at the moment...

> > > > 

> > > > Bisected to commit e38ba76deef "perf tools: force uncore events to

> > > > system wide monitoring".  So, using record with specifying a -C

> > > > <cpu> explicitly now produces SPE data, but only a couple of valid

> > > > records at the beginning of each buffer; the rest is filled with

> > > > PADding (0's).

> > > > 

> > > > I see Mark's latest comments have found a possible issue in the perf

> > > > aux buffer handling code in the driver, and that the driver does some

> > > > memset of padding (0's) itself; could that be responsible for the above

> > > > behaviour?

> > > 

> > > Possibly. Do you know how big you're mapping the aux buffer

> > 

> > 4MiB.

> > 

> > > and what (if any) value you're passing as aux_watermark?

> > 

> > None passed, but it looks like 4KiB was used since the AUXTRACE size

> > was 4MiB - 4KiB.

> > 

> > I'm not seeing the issue with a simple bts-based version I'm

> > working on...yet.  We can revisit if I'm able to reproduce again; the

> > problem could have been on the userspace side.


I'm close to finishing the bts version of userspace, and have been
testing a bit more thoroughly, so now I consistently see the excessive
PADding when recording a CPU that's idle. I.e., when I taskset the perf
record to the same CPU I specify to record's -C (taskset -c n perf
record -C n), I get max. twenty-odd number of PAD bytes at the end of
the AUX buffers in the perf.data file.  If, OTOH, I taskset -c n perf
record -C m, where m != n, I get a couple of valid event records in the
buffer, and the rest of the buffer is filled with PADding.

It wouldn't be a problem except that it's wastes too much space
sometimes.  Here is a good output buffer sample from a --mmap-pages=,12
run, with only 4 PADs tacked onto the end:

0xd190 [0x30]: PERF_RECORD_AUXTRACE size: 0x48  offset: 0  ref: 0xe914f7e3ce  idx: 0  tid: -1  cpu: 2
.
. ... ARM SPE data: size 72 bytes
.  00000000:  4a 01                                           B COND
.  00000002:  b1 00 00 00 00 00 00 00 c0                      TGT 0 el2 ns=1
.  0000000b:  42 42                                           RETIRED NOT-TAKEN 
.  0000000d:  b0 f4 4e 10 08 00 00 ff ff                      PC ff000008104ef4 el3 ns=1
.  00000016:  98 00 00                                        LAT 0 TOT
.  00000019:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645
.  00000022:  4a 02                                           B IND
.  00000024:  b1 54 51 11 08 00 00 ff ff                      TGT ff000008115154 el3 ns=1
.  0000002d:  42 02                                           RETIRED 
.  0000002f:  b0 68 36 11 08 00 00 ff ff                      PC ff000008113668 el3 ns=1
.  00000038:  98 00 00                                        LAT 0 TOT
.  0000003b:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645
.  00000044:  00                                              PAD
.  00000045:  00                                              PAD
.  00000046:  00                                              PAD
.  00000047:  00                                              PAD

whereas this one - from later on in the same run - is over 99% PADs: 

0xd250 [0x30]: PERF_RECORD_AUXTRACE size: 0x5fc0  offset: 0xfffff4ae0044  ref: 0xe91cead1dd  idx: 0  tid: -1  cpu: 2
.
. ... ARM SPE data: size 24512 bytes
.  00000000:  4a 00                                           B
.  00000002:  b1 cc 4e 10 08 00 00 ff ff                      TGT ff000008104ecc el3 ns=1
.  0000000b:  42 02                                           RETIRED 
.  0000000d:  b0 90 4e 10 08 00 00 ff ff                      PC ff000008104e90 el3 ns=1
.  00000016:  98 00 00                                        LAT 0 TOT
.  00000019:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645
.  00000022:  49 01                                           ST
.  00000024:  b2 e0 2e f5 7d 00 80 ff ff                      VA ffff80007df52ee0
.  0000002d:  b3 e0 2e f5 fd 00 00 00 80                      PA fdf52ee0 ns=1
.  00000036:  9a 00 00                                        LAT 0 XLAT
.  00000039:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS 
.  0000003b:  b0 e8 41 39 08 00 00 ff ff                      PC ff0000083941e8 el3 ns=1
.  00000044:  98 00 00                                        LAT 0 TOT
.  00000047:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645
.  00000050:  4a 00                                           B
.  00000052:  b1 58 f2 0f 08 00 00 ff ff                      TGT ff0000080ff258 el3 ns=1
.  0000005b:  42 02                                           RETIRED 
.  0000005d:  b0 90 de 0d 08 00 00 ff ff                      PC ff0000080dde90 el3 ns=1
.  00000066:  98 00 00                                        LAT 0 TOT
.  00000069:  71 8f 4e e1 14 e9 00 00 00                      TS 1001077689999
.  00000072:  48 00                                           INSN-OTHER
.  00000074:  42 02                                           RETIRED 
.  00000076:  b0 f8 16 61 08 00 00 ff ff                      PC ff0000086116f8 el3 ns=1
.  0000007f:  98 00 00                                        LAT 0 TOT
.  00000082:  71 8f 4e e1 14 e9 00 00 00                      TS 1001077689999
.  0000008b:  49 00                                           LD
.  0000008d:  b2 10 34 ba 7b 00 80 ff ff                      VA ffff80007bba3410
.  00000096:  b3 10 34 ba fb 00 00 00 80                      PA fbba3410 ns=1
.  0000009f:  9a 00 00                                        LAT 0 XLAT
.  000000a2:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS 
.  000000a4:  b0 8c be 7a 08 00 00 ff ff                      PC ff0000087abe8c el3 ns=1
.  000000ad:  98 00 00                                        LAT 0 TOT
.  000000b0:  71 8f 4e e1 14 e9 00 00 00                      TS 1001077689999
.  000000b9:  00                                              PAD
...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...
.  00005fbf:  00                                              PAD

So maybe there's an offset counter that isn't being reset properly;
hopefully the parallel discussion with Mark will help find the problem.

FWIW, there is also this one I saw with mmap-pages set to 5
(pages), which gets rounded up to 8 pages: the driver started
memsetting places it shouldn't?:

$ sudo ./perf record -c 512 -C 0 -e arm_spe/branch_filter=0,ts_enable=1,pa_enable=1,event_filter=0,load_filter=0,jitter=1,store_filter=0,min_latency=0/ --mmap-pages=,5 dd if=/dev/urandom of=/dev/null count=10000
rounding mmap pages size to 32K (8 pages)
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB) copied, 1.3391 s, 3.8 MB/s
[ 1885.042803] Unable to handle kernel paging request at virtual address ffff00000adac000
[ 1885.042873] pgd = ffff00000ad48000
[ 1885.042899] [ffff00000adac000] *pgd=00000000fdffe003, *pud=00000000fdffd003, *pmd=00000000fdff8003, *pte=0000000000000000
[ 1885.043083] Internal error: Oops: 96000047 [#1] PREEMPT SMP
[ 1885.043131] Modules linked in:
[ 1885.043200] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.12.0-rc5-00039-gf1d4a187881e #34
[ 1885.043299] Hardware name: FVP_Base_AEMv8A-AEMv8A (DT)
[ 1885.043364] task: ffff000008c21a80 task.stack: ffff000008c10000
[ 1885.043436] PC is at __memset+0x1ac/0x1d0
[ 1885.043499] LR is at __arm_spe_pmu_next_off+0x6c/0xd8
[ 1885.043600] pc : [<ffff00000837dbac>] lr : [<ffff0000086771e4>] pstate: 204001c9
[ 1885.043600] sp : ffff80007df22d10
[ 1885.043733] x29: ffff80007df22d10 x28: ffff000008c21a80 
[ 1885.043819] x27: ffff000008c37768 x26: ffff80007df30280 
[ 1885.043910] x25: ffff80007a109800 x24: 0000001d507d1906 
[ 1885.044012] x23: ffff80007a601018 x22: ffff80007a3ebb00 
[ 1885.044102] x21: ffff80007df36ab0 x20: ffff80007a3ebb00 
[ 1885.044196] x19: ffff80007df36ab0 x18: 0000000000000000 
[ 1885.044253] x17: 0000000000000000 x16: 0000000000000000 
[ 1885.044339] x15: 0000000000000000 x14: ffff000008c21a80 
[ 1885.044456] x13: 000080007532d000 x12: 000000003445d91d 
[ 1885.044557] x11: 0000000000000000 x10: 0000000000001000 
[ 1885.044600] x9 : 0000000000000000 x8 : ffff00000adac000 
[ 1885.044729] x7 : 0000000000000000 x6 : 00000000000003ff 
[ 1885.044800] x5 : 0000000000000400 x4 : 0000000000000000 
[ 1885.044911] x3 : 0000000000000008 x2 : 0000000000003bff 
[ 1885.045000] x1 : 0000000000000000 x0 : ffff00000ada8000 
[ 1885.045100] Process swapper/0 (pid: 0, stack limit = 0xffff000008c10000)
[ 1885.045179] Stack: (0xffff80007df22d10 to 0xffff000008c14000)
[ 1885.045239] Call trace:
[ 1885.045300] Exception stack(0xffff80007df22b40 to 0xffff80007df22c70)
[ 1885.045400] 2b40: ffff80007df36ab0 0001000000000000 ffff80007df22d10 ffff00000837dbac
[ 1885.045505] 2b60: 0000000000000000 0000000000000000 ffff80007bb4b520 ffff00000837eac0
[ 1885.045605] 2b80: ffff80007df22d10 ffff0000080d6b58 0000000100060b21 ffff80007bb4afa8
[ 1885.045712] 2ba0: ffff80007bb4af20 ffff80007bb4af00 0000000000000000 ffff000008c19f04
[ 1885.045815] 2bc0: ffff000008bff000 ffff000008c17000 ffff80007bb53f00 000000000002fe89
[ 1885.045916] 2be0: ffff00000ada8000 0000000000000000 0000000000003bff 0000000000000008
[ 1885.046013] 2c00: 0000000000000000 0000000000000400 00000000000003ff 0000000000000000
[ 1885.046126] 2c20: ffff00000adac000 0000000000000000 0000000000001000 0000000000000000
[ 1885.046224] 2c40: 000000003445d91d 000080007532d000 ffff000008c21a80 0000000000000000
[ 1885.046326] 2c60: 0000000000000000 0000000000000000
[ 1885.046408] [<ffff00000837dbac>] __memset+0x1ac/0x1d0
[ 1885.046499] [<ffff00000867729c>] arm_spe_perf_aux_output_begin+0x4c/0x1b8
[ 1885.046599] [<ffff000008678424>] arm_spe_pmu_start+0x34/0xf0
[ 1885.046695] [<ffff000008678548>] arm_spe_pmu_add+0x68/0x98
[ 1885.046733] [<ffff00000814da54>] event_sched_in.isra.61+0xcc/0x218
[ 1885.046879] [<ffff00000814dc08>] group_sched_in+0x68/0x1a0
[ 1885.046981] [<ffff00000814dfd0>] ctx_sched_in+0x290/0x468
[ 1885.047080] [<ffff00000814e23c>] perf_event_sched_in+0x94/0xa8
[ 1885.047179] [<ffff00000814e2b4>] ctx_resched+0x64/0xb0
[ 1885.047268] [<ffff00000814e500>] __perf_event_enable+0x200/0x238
[ 1885.047366] [<ffff000008147118>] event_function+0x90/0xf0
[ 1885.047452] [<ffff0000081499e8>] remote_function+0x60/0x70
[ 1885.047514] [<ffff0000081194fc>] flush_smp_call_function_queue+0x9c/0x168
[ 1885.047637] [<ffff00000811a2a0>] generic_smp_call_function_single_interrupt+0x10/0x18
[ 1885.047733] [<ffff00000808e928>] handle_IPI+0xc0/0x1d8
[ 1885.047799] [<ffff000008081700>] gic_handle_irq+0x80/0x178
[ 1885.047799] Exception stack(0xffff000008c13d80 to 0xffff000008c13eb0)
[ 1885.047984] 3d80: 0000000000000000 ffff000008c21a80 00000000000003e8 ffff000008651430
[ 1885.048087] 3da0: 000000001999999a 0000000000000020 0000002bedec501b 0000000000000000
[ 1885.048190] 3dc0: 000001b2b5103510 ffff000008081800 0000000000001000 0000000000000000
[ 1885.048300] 3de0: 000000003445d91d 000080007532d000 ffff000008c21a80 0000000000000000
[ 1885.048400] 3e00: 0000000000000000 0000000000000000 0000000000000000 000001b6e54dc796
[ 1885.048505] 3e20: 0000000000000002 ffff80007a983c00 0000000000000002 ffff000008cdc130
[ 1885.048600] 3e40: 000001b6e5424132 ffff000008c21a80 0000000000000000 00000000fef7c684
[ 1885.048716] 3e60: 0000000080b10018 ffff000008c13eb0 ffff00000861014c ffff000008c13eb0
[ 1885.048817] 3e80: ffff00000861018c 0000000040c00149 00000000fef7c684 0000000000000002
[ 1885.048900] 3ea0: ffffffffffffffff ffff00000861014c
[ 1885.048900] [<ffff0000080827f4>] el1_irq+0xb4/0x128
[ 1885.049009] [<ffff00000861018c>] cpuidle_enter_state+0x154/0x200
[ 1885.049126] [<ffff000008610270>] cpuidle_enter+0x18/0x20
[ 1885.049207] [<ffff0000080ddd08>] call_cpuidle+0x18/0x30
[ 1885.049332] [<ffff0000080ddf44>] do_idle+0x19c/0x1d8
[ 1885.049400] [<ffff0000080de114>] cpu_startup_entry+0x24/0x28
[ 1885.049453] [<ffff0000087a6b30>] rest_init+0x80/0x90
[ 1885.049500] [<ffff000008b10b3c>] start_kernel+0x374/0x388
[ 1885.049617] [<ffff000008b101e0>] __primary_switched+0x64/0x6c
[ 1885.049699] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) 
[ 1885.049800] ---[ end trace 31b9a9f27da95f58 ]---
[ 1885.049900] Kernel panic - not syncing: Fatal exception in interrupt
[ 1885.050000] SMP: stopping secondary CPUs
[ 1885.050204] Kernel Offset: disabled
[ 1885.050240] Memory Limit: none
[ 1885.050327] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

It's not easily reproduced.

> > Meanwhile, when using fvp-base.dtb, my model setup stops booting the

> > kernel after "smp: Bringing up secondary CPUs ...".  If I however take

> > the second SPE node from fvp-base.dts and add it to my working device

> > tree, I get this during the driver probe:

> > 

> > [    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]

> > [    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]

> > [    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)

> 

> Looks like you've screwed up your IRQ partitions, so you are effectively

> registering the same device twice, which then blows up due to lack of shared

> irqs.

> 

> Either remove one of the devices, or use IRQ partitions to restrict them

> to unique sets of CPUs.


Right, but since I want to get parity with what you're running -
fvp_base.dtb - I tried to debug the hang after "smp: Bringing up
secondary CPUs ..." problem, and could only debug it to the PSCI driver
hitting one of these cases:

case PSCI_RET_INVALID_PARAMS:
case PSCI_RET_INVALID_ADDRESS:

Note: it's yet another place I have to manually instrument the error
path in a kernel driver in lieu of it being more naturally verbose by
itself; I *implore* you to reconsider adding proper user messaging to
arm_spe_pmu_event_init().

I can't tell which part of the fvp-base device tree is not liked by the
firmware; I tried different combinations of the PSCI node, different CPU
enumerations (cpu@100 vs cpu@1), removing idle-states properties...any
hints appreciated.

Thanks,

Kim
Mark Rutland June 28, 2017, 11:26 a.m. UTC | #8
On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:
> I'm close to finishing the bts version of userspace, and have been

> testing a bit more thoroughly, so now I consistently see the excessive

> PADding when recording a CPU that's idle. I.e., when I taskset the perf

> record to the same CPU I specify to record's -C (taskset -c n perf

> record -C n), I get max. twenty-odd number of PAD bytes at the end of

> the AUX buffers in the perf.data file.  If, OTOH, I taskset -c n perf

> record -C m, where m != n, I get a couple of valid event records in the

> buffer, and the rest of the buffer is filled with PADding.

> 

> It wouldn't be a problem except that it's wastes too much space

> sometimes.  Here is a good output buffer sample from a --mmap-pages=,12

> run, with only 4 PADs tacked onto the end:

> 

> 0xd190 [0x30]: PERF_RECORD_AUXTRACE size: 0x48  offset: 0  ref: 0xe914f7e3ce  idx: 0  tid: -1  cpu: 2

> .

> . ... ARM SPE data: size 72 bytes

> .  00000000:  4a 01                                           B COND


[...]

> .  0000003b:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645

> .  00000044:  00                                              PAD

> .  00000045:  00                                              PAD

> .  00000046:  00                                              PAD

> .  00000047:  00                                              PAD

> 

> whereas this one - from later on in the same run - is over 99% PADs: 

> 

> 0xd250 [0x30]: PERF_RECORD_AUXTRACE size: 0x5fc0  offset: 0xfffff4ae0044  ref: 0xe91cead1dd  idx: 0  tid: -1  cpu: 2

> .

> . ... ARM SPE data: size 24512 bytes

> .  00000000:  4a 00                                           B


[...]

> .  000000b0:  71 8f 4e e1 14 e9 00 00 00                      TS 1001077689999

> .  000000b9:  00                                              PAD

> ...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...

> .  00005fbf:  00                                              PAD


Interesting.

If you cat /proc/interrupts, do you see many more SPE interrupts on CPU
n than on m?

Otherwise, I wonder if this is some odd interaction with idle. Can you
try to forcefully load that other CPU?

e.g. run something like:

	taskset -c <n> sh -c 'while true; do done'

... in parallel with the tracer.

For reference, what was your event sample period (i.e. the value of
perf_event_attr::sample_period)?

Did you modify that at all with PERF_EVENT_IOC_PERIOD?

> So maybe there's an offset counter that isn't being reset properly;

> hopefully the parallel discussion with Mark will help find the problem.

> 

> FWIW, there is also this one I saw with mmap-pages set to 5

> (pages), which gets rounded up to 8 pages:


Sorry, *what* does the rounding upwards? Userspace, perf core, or the
driver? Where?

> the driver started memsetting places it shouldn't?:

>

> $ sudo ./perf record -c 512 -C 0 -e arm_spe/branch_filter=0,ts_enable=1,pa_enable=1,event_filter=0,load_filter=0,jitter=1,store_filter=0,min_latency=0/ --mmap-pages=,5 dd if=/dev/urandom of=/dev/null count=10000

> rounding mmap pages size to 32K (8 pages)

> 10000+0 records in

> 10000+0 records out

> 5120000 bytes (5.1 MB) copied, 1.3391 s, 3.8 MB/s

> [ 1885.042803] Unable to handle kernel paging request at virtual address ffff00000adac000

> [ 1885.042873] pgd = ffff00000ad48000

> [ 1885.042899] [ffff00000adac000] *pgd=00000000fdffe003, *pud=00000000fdffd003, *pmd=00000000fdff8003, *pte=0000000000000000

> [ 1885.043083] Internal error: Oops: 96000047 [#1] PREEMPT SMP

> [ 1885.043131] Modules linked in:

> [ 1885.043200] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.12.0-rc5-00039-gf1d4a187881e #34

> [ 1885.043299] Hardware name: FVP_Base_AEMv8A-AEMv8A (DT)

> [ 1885.043364] task: ffff000008c21a80 task.stack: ffff000008c10000

> [ 1885.043436] PC is at __memset+0x1ac/0x1d0

> [ 1885.043499] LR is at __arm_spe_pmu_next_off+0x6c/0xd8

> [ 1885.043600] pc : [<ffff00000837dbac>] lr : [<ffff0000086771e4>] pstate: 204001c9

> [ 1885.043600] sp : ffff80007df22d10

> [ 1885.043733] x29: ffff80007df22d10 x28: ffff000008c21a80 

> [ 1885.043819] x27: ffff000008c37768 x26: ffff80007df30280 

> [ 1885.043910] x25: ffff80007a109800 x24: 0000001d507d1906 

> [ 1885.044012] x23: ffff80007a601018 x22: ffff80007a3ebb00 

> [ 1885.044102] x21: ffff80007df36ab0 x20: ffff80007a3ebb00 

> [ 1885.044196] x19: ffff80007df36ab0 x18: 0000000000000000 

> [ 1885.044253] x17: 0000000000000000 x16: 0000000000000000 

> [ 1885.044339] x15: 0000000000000000 x14: ffff000008c21a80 

> [ 1885.044456] x13: 000080007532d000 x12: 000000003445d91d 

> [ 1885.044557] x11: 0000000000000000 x10: 0000000000001000 

> [ 1885.044600] x9 : 0000000000000000 x8 : ffff00000adac000 

> [ 1885.044729] x7 : 0000000000000000 x6 : 00000000000003ff 

> [ 1885.044800] x5 : 0000000000000400 x4 : 0000000000000000 

> [ 1885.044911] x3 : 0000000000000008 x2 : 0000000000003bff 

> [ 1885.045000] x1 : 0000000000000000 x0 : ffff00000ada8000 

> [ 1885.045100] Process swapper/0 (pid: 0, stack limit = 0xffff000008c10000)

> [ 1885.045179] Stack: (0xffff80007df22d10 to 0xffff000008c14000)

> [ 1885.045239] Call trace:


> [ 1885.046408] [<ffff00000837dbac>] __memset+0x1ac/0x1d0

> [ 1885.046499] [<ffff00000867729c>] arm_spe_perf_aux_output_begin+0x4c/0x1b8

> [ 1885.046599] [<ffff000008678424>] arm_spe_pmu_start+0x34/0xf0

> [ 1885.046695] [<ffff000008678548>] arm_spe_pmu_add+0x68/0x98

> [ 1885.046733] [<ffff00000814da54>] event_sched_in.isra.61+0xcc/0x218

> [ 1885.046879] [<ffff00000814dc08>] group_sched_in+0x68/0x1a0

> [ 1885.046981] [<ffff00000814dfd0>] ctx_sched_in+0x290/0x468

> [ 1885.047080] [<ffff00000814e23c>] perf_event_sched_in+0x94/0xa8

> [ 1885.047179] [<ffff00000814e2b4>] ctx_resched+0x64/0xb0

> [ 1885.047268] [<ffff00000814e500>] __perf_event_enable+0x200/0x238

> [ 1885.047366] [<ffff000008147118>] event_function+0x90/0xf0

> [ 1885.047452] [<ffff0000081499e8>] remote_function+0x60/0x70

> [ 1885.047514] [<ffff0000081194fc>] flush_smp_call_function_queue+0x9c/0x168

> [ 1885.047637] [<ffff00000811a2a0>] generic_smp_call_function_single_interrupt+0x10/0x18

> [ 1885.047733] [<ffff00000808e928>] handle_IPI+0xc0/0x1d8

> [ 1885.047799] [<ffff000008081700>] gic_handle_irq+0x80/0x178


> [ 1885.048900] [<ffff0000080827f4>] el1_irq+0xb4/0x128

> [ 1885.049009] [<ffff00000861018c>] cpuidle_enter_state+0x154/0x200

> [ 1885.049126] [<ffff000008610270>] cpuidle_enter+0x18/0x20

> [ 1885.049207] [<ffff0000080ddd08>] call_cpuidle+0x18/0x30

> [ 1885.049332] [<ffff0000080ddf44>] do_idle+0x19c/0x1d8

> [ 1885.049400] [<ffff0000080de114>] cpu_startup_entry+0x24/0x28

> [ 1885.049453] [<ffff0000087a6b30>] rest_init+0x80/0x90

> [ 1885.049500] [<ffff000008b10b3c>] start_kernel+0x374/0x388

> [ 1885.049617] [<ffff000008b101e0>] __primary_switched+0x64/0x6c

> [ 1885.049699] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) 

> [ 1885.049800] ---[ end trace 31b9a9f27da95f58 ]---

> [ 1885.049900] Kernel panic - not syncing: Fatal exception in interrupt

> [ 1885.050000] SMP: stopping secondary CPUs

> [ 1885.050204] Kernel Offset: disabled

> [ 1885.050240] Memory Limit: none

> [ 1885.050327] ---[ end Kernel panic - not syncing: Fatal exception in interrupt


That's worrying. I'll see if I can reproduce this.

> It's not easily reproduced.

> 

> > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the

> > > kernel after "smp: Bringing up secondary CPUs ...".  If I however take

> > > the second SPE node from fvp-base.dts and add it to my working device

> > > tree, I get this during the driver probe:

> > > 

> > > [    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]

> > > [    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]

> > > [    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)

> > 

> > Looks like you've screwed up your IRQ partitions, so you are effectively

> > registering the same device twice, which then blows up due to lack of shared

> > irqs.

> > 

> > Either remove one of the devices, or use IRQ partitions to restrict them

> > to unique sets of CPUs.

> 

> Right, but since I want to get parity with what you're running -

> fvp_base.dtb - I tried to debug the hang after "smp: Bringing up

> secondary CPUs ..." problem, and could only debug it to the PSCI driver

> hitting one of these cases:

> 

> case PSCI_RET_INVALID_PARAMS:

> case PSCI_RET_INVALID_ADDRESS:


Sounds like your DT is describing CPUs that don't exist (or perhaps the
same CPU several times). Certainly, PSCI and the kernel disagree on
which CPUS exist.

What exact DT are you using?

Are you using the bootwrapper, or ATF? I'm guessing you're using the
bootwrapper.

Which version of the bootwrapepr are you using? If it doesn't have
commit:

  ccdc936924b3682d ("Dynamically determine the set of CPUs")

... have you configured it appropriately with --with-cpu-ids?

How is your model configured? Which CPU IDs does it think exist?

> Note: it's yet another place I have to manually instrument the error

> path in a kernel driver in lieu of it being more naturally verbose by

> itself; I *implore* you to reconsider adding proper user messaging to

> arm_spe_pmu_event_init().


Given this is a FW configuration issue (i.e. a system-level error), I'm
more than happy to make the PSCI driver messages more helpful where
possible.

That's completely orthogonal to the SPE debug messages for requests made
by the user.

> I can't tell which part of the fvp-base device tree is not liked by the

> firmware; I tried different combinations of the PSCI node, different CPU

> enumerations (cpu@100 vs cpu@1), removing idle-states properties...any

> hints appreciated.


The bootwrapper doesn't support idle. So no idle-states should be in the
DT.

If you can share your DT, bootwrapper configuration, and model
configuration, I can try to debug this with you.

Thanks,
Mark.
Mark Rutland June 28, 2017, 11:32 a.m. UTC | #9
On Wed, Jun 28, 2017 at 12:26:02PM +0100, Mark Rutland wrote:
> On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:

> > FWIW, there is also this one I saw with mmap-pages set to 5

> > (pages), which gets rounded up to 8 pages:

> 

> Sorry, *what* does the rounding upwards? Userspace, perf core, or the

> driver? Where?

> 

> > the driver started memsetting places it shouldn't?:

> >

> > $ sudo ./perf record -c 512 -C 0 -e arm_spe/branch_filter=0,ts_enable=1,pa_enable=1,event_filter=0,load_filter=0,jitter=1,store_filter=0,min_latency=0/ --mmap-pages=,5 dd if=/dev/urandom of=/dev/null count=10000

> > rounding mmap pages size to 32K (8 pages)

> > 10000+0 records in

> > 10000+0 records out

> > 5120000 bytes (5.1 MB) copied, 1.3391 s, 3.8 MB/s

> > [ 1885.042803] Unable to handle kernel paging request at virtual address ffff00000adac000

> > [ 1885.042873] pgd = ffff00000ad48000

> > [ 1885.042899] [ffff00000adac000] *pgd=00000000fdffe003, *pud=00000000fdffd003, *pmd=00000000fdff8003, *pte=0000000000000000

> > [ 1885.043083] Internal error: Oops: 96000047 [#1] PREEMPT SMP

> > [ 1885.043131] Modules linked in:

> > [ 1885.043200] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.12.0-rc5-00039-gf1d4a187881e #34

> > [ 1885.043299] Hardware name: FVP_Base_AEMv8A-AEMv8A (DT)

> > [ 1885.043364] task: ffff000008c21a80 task.stack: ffff000008c10000

> > [ 1885.043436] PC is at __memset+0x1ac/0x1d0

> > [ 1885.043499] LR is at __arm_spe_pmu_next_off+0x6c/0xd8

> > [ 1885.043600] pc : [<ffff00000837dbac>] lr : [<ffff0000086771e4>] pstate: 204001c9

> > [ 1885.043600] sp : ffff80007df22d10

> > [ 1885.043733] x29: ffff80007df22d10 x28: ffff000008c21a80 

> > [ 1885.043819] x27: ffff000008c37768 x26: ffff80007df30280 

> > [ 1885.043910] x25: ffff80007a109800 x24: 0000001d507d1906 

> > [ 1885.044012] x23: ffff80007a601018 x22: ffff80007a3ebb00 

> > [ 1885.044102] x21: ffff80007df36ab0 x20: ffff80007a3ebb00 

> > [ 1885.044196] x19: ffff80007df36ab0 x18: 0000000000000000 

> > [ 1885.044253] x17: 0000000000000000 x16: 0000000000000000 

> > [ 1885.044339] x15: 0000000000000000 x14: ffff000008c21a80 

> > [ 1885.044456] x13: 000080007532d000 x12: 000000003445d91d 

> > [ 1885.044557] x11: 0000000000000000 x10: 0000000000001000 

> > [ 1885.044600] x9 : 0000000000000000 x8 : ffff00000adac000 

> > [ 1885.044729] x7 : 0000000000000000 x6 : 00000000000003ff 

> > [ 1885.044800] x5 : 0000000000000400 x4 : 0000000000000000 

> > [ 1885.044911] x3 : 0000000000000008 x2 : 0000000000003bff 

> > [ 1885.045000] x1 : 0000000000000000 x0 : ffff00000ada8000 

> > [ 1885.045100] Process swapper/0 (pid: 0, stack limit = 0xffff000008c10000)

> > [ 1885.045179] Stack: (0xffff80007df22d10 to 0xffff000008c14000)

> > [ 1885.045239] Call trace:

> 

> > [ 1885.046408] [<ffff00000837dbac>] __memset+0x1ac/0x1d0

> > [ 1885.046499] [<ffff00000867729c>] arm_spe_perf_aux_output_begin+0x4c/0x1b8

> > [ 1885.046599] [<ffff000008678424>] arm_spe_pmu_start+0x34/0xf0

> > [ 1885.046695] [<ffff000008678548>] arm_spe_pmu_add+0x68/0x98

> > [ 1885.046733] [<ffff00000814da54>] event_sched_in.isra.61+0xcc/0x218

> > [ 1885.046879] [<ffff00000814dc08>] group_sched_in+0x68/0x1a0

> > [ 1885.046981] [<ffff00000814dfd0>] ctx_sched_in+0x290/0x468

> > [ 1885.047080] [<ffff00000814e23c>] perf_event_sched_in+0x94/0xa8

> > [ 1885.047179] [<ffff00000814e2b4>] ctx_resched+0x64/0xb0

> > [ 1885.047268] [<ffff00000814e500>] __perf_event_enable+0x200/0x238

> > [ 1885.047366] [<ffff000008147118>] event_function+0x90/0xf0

> > [ 1885.047452] [<ffff0000081499e8>] remote_function+0x60/0x70

> > [ 1885.047514] [<ffff0000081194fc>] flush_smp_call_function_queue+0x9c/0x168

> > [ 1885.047637] [<ffff00000811a2a0>] generic_smp_call_function_single_interrupt+0x10/0x18

> > [ 1885.047733] [<ffff00000808e928>] handle_IPI+0xc0/0x1d8

> > [ 1885.047799] [<ffff000008081700>] gic_handle_irq+0x80/0x178

> 

> > [ 1885.048900] [<ffff0000080827f4>] el1_irq+0xb4/0x128

> > [ 1885.049009] [<ffff00000861018c>] cpuidle_enter_state+0x154/0x200

> > [ 1885.049126] [<ffff000008610270>] cpuidle_enter+0x18/0x20

> > [ 1885.049207] [<ffff0000080ddd08>] call_cpuidle+0x18/0x30

> > [ 1885.049332] [<ffff0000080ddf44>] do_idle+0x19c/0x1d8

> > [ 1885.049400] [<ffff0000080de114>] cpu_startup_entry+0x24/0x28

> > [ 1885.049453] [<ffff0000087a6b30>] rest_init+0x80/0x90

> > [ 1885.049500] [<ffff000008b10b3c>] start_kernel+0x374/0x388

> > [ 1885.049617] [<ffff000008b101e0>] __primary_switched+0x64/0x6c

> > [ 1885.049699] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) 

> > [ 1885.049800] ---[ end trace 31b9a9f27da95f58 ]---

> > [ 1885.049900] Kernel panic - not syncing: Fatal exception in interrupt

> > [ 1885.050000] SMP: stopping secondary CPUs

> > [ 1885.050204] Kernel Offset: disabled

> > [ 1885.050240] Memory Limit: none

> > [ 1885.050327] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

> 

> That's worrying. I'll see if I can reproduce this.


Actually, this might be down to the IDX2OFF() macro being borked for non
power-of-two buffer sizes.

Do you have Will's latest fixes? In his tree there's a commit:

  4f331cd62531dce2 ("squash! drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")

... which should fix the IDX2OFF() bug.

It's be good to reproduce the issue if we can, regardless.

Thanks,
Mark.
Kim Phillips June 29, 2017, 12:59 a.m. UTC | #10
On Wed, 28 Jun 2017 12:26:02 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:

> > I'm close to finishing the bts version of userspace, and have been

> > testing a bit more thoroughly, so now I consistently see the excessive

> > PADding when recording a CPU that's idle. I.e., when I taskset the perf

> > record to the same CPU I specify to record's -C (taskset -c n perf

> > record -C n), I get max. twenty-odd number of PAD bytes at the end of

> > the AUX buffers in the perf.data file.  If, OTOH, I taskset -c n perf

> > record -C m, where m != n, I get a couple of valid event records in the

> > buffer, and the rest of the buffer is filled with PADding.

> > 

> > It wouldn't be a problem except that it's wastes too much space

> > sometimes.  Here is a good output buffer sample from a --mmap-pages=,12

> > run, with only 4 PADs tacked onto the end:

> > 

> > 0xd190 [0x30]: PERF_RECORD_AUXTRACE size: 0x48  offset: 0  ref: 0xe914f7e3ce  idx: 0  tid: -1  cpu: 2

> > .

> > . ... ARM SPE data: size 72 bytes

> > .  00000000:  4a 01                                           B COND

> 

> [...]

> 

> > .  0000003b:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645

> > .  00000044:  00                                              PAD

> > .  00000045:  00                                              PAD

> > .  00000046:  00                                              PAD

> > .  00000047:  00                                              PAD

> > 

> > whereas this one - from later on in the same run - is over 99% PADs: 

> > 

> > 0xd250 [0x30]: PERF_RECORD_AUXTRACE size: 0x5fc0  offset: 0xfffff4ae0044  ref: 0xe91cead1dd  idx: 0  tid: -1  cpu: 2

> > .

> > . ... ARM SPE data: size 24512 bytes

> > .  00000000:  4a 00                                           B

> 

> [...]

> 

> > .  000000b0:  71 8f 4e e1 14 e9 00 00 00                      TS 1001077689999

> > .  000000b9:  00                                              PAD

> > ...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...

> > .  00005fbf:  00                                              PAD

> 

> Interesting.

> 

> If you cat /proc/interrupts, do you see many more SPE interrupts on CPU

> n than on m?


When n == m, I see approx. 1 IRQ per SPE buffer full.

When n != m, I see neither CPU n or m incur SPE interrupts; the
workload ran but didn't get recorded, or, rather, 'idleness' got
recorded instead.

> Otherwise, I wonder if this is some odd interaction with idle. Can you

> try to forcefully load that other CPU?

> 

> e.g. run something like:

> 

> 	taskset -c <n> sh -c 'while true; do done'

> 

> ... in parallel with the tracer.


If I do a:

taskset -c 1 sh -c 'while true; do echo blah > /dev/null' & 
taskset -c 0 perf record -C 1 ...

then non-idleness and non-PADdingness get recorded.

> For reference, what was your event sample period (i.e. the value of

> perf_event_attr::sample_period)?

> 

> Did you modify that at all with PERF_EVENT_IOC_PERIOD?


If that's the same as 'perf record -c <period>', then, yes, I set
the period to values such as 512, 1024.

> > > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the

> > > > kernel after "smp: Bringing up secondary CPUs ...".  If I however take

> > > > the second SPE node from fvp-base.dts and add it to my working device

> > > > tree, I get this during the driver probe:

> > > > 

> > > > [    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]

> > > > [    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]

> > > > [    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)

> > > 

> > > Looks like you've screwed up your IRQ partitions, so you are effectively

> > > registering the same device twice, which then blows up due to lack of shared

> > > irqs.

> > > 

> > > Either remove one of the devices, or use IRQ partitions to restrict them

> > > to unique sets of CPUs.

> > 

> > Right, but since I want to get parity with what you're running -

> > fvp_base.dtb - I tried to debug the hang after "smp: Bringing up

> > secondary CPUs ..." problem, and could only debug it to the PSCI driver

> > hitting one of these cases:

> > 

> > case PSCI_RET_INVALID_PARAMS:

> > case PSCI_RET_INVALID_ADDRESS:

> 

> Sounds like your DT is describing CPUs that don't exist (or perhaps the

> same CPU several times). Certainly, PSCI and the kernel disagree on

> which CPUS exist.

> 

> What exact DT are you using?


the one this commit to linux-will's perf/spe branch provides:

commit 2a73de57eaf61cdfd61be1e20a46e4a2c326775f
Author: Marc Zyngier <marc.zyngier@arm.com>
Date:   Tue Mar 11 18:14:45 2014 +0000

    arm64: dts: add model device-tree
    
    Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>

    Signed-off-by: Will Deacon <will.deacon@arm.com>


> Are you using the bootwrapper, or ATF? I'm guessing you're using the

> bootwrapper.


I'm using the wrapper to wrap arm-trusted-firmware (ATF?) objects, so,
both?  I noticed the wrapper I was using was pretty old, so I updated
it.

arm-trusted-firmware, btw, has just been updated to enable SPE at lower
ELs, so I don't have to use a hacked-up version anymore.

I also updated my BL33 to the latest upstream u-boot
vexpress_aemv8a_dram_defconfig, and at least now the kernel continues
to boot, even though it can't bring up 6 of the 7 secondary CPUs.

> Which version of the bootwrapepr are you using? If it doesn't have

> commit:

> 

>   ccdc936924b3682d ("Dynamically determine the set of CPUs")

> 

> ... have you configured it appropriately with --with-cpu-ids?

> 

> How is your model configured?


CLUSTER0_NUM_CORES=4
CLUSTER1_NUM_CORES=4

> Which CPU IDs does it think exist?


1,2,3,4,0x100,0x101,0x102,0x103

...which are different from the above device tree!:

0,0x100,0x200,0x300,0x10000,0x10100,0x10200,0x10300

So I imagine that's the problem, thanks!

I don't see how to tell the model to put the CPUs at different
addresses, only a lot of GICv3 redistributor switches?  btw, where can
I get updates to the run-model.sh scripts?  Answer off-list?

> > Note: it's yet another place I have to manually instrument the error

> > path in a kernel driver in lieu of it being more naturally verbose by

> > itself; I *implore* you to reconsider adding proper user messaging to

> > arm_spe_pmu_event_init().

> 

> Given this is a FW configuration issue (i.e. a system-level error), I'm

> more than happy to make the PSCI driver messages more helpful where

> possible.

> 

> That's completely orthogonal to the SPE debug messages for requests made

> by the user.


I respectfully disagree, given the current state of the interfaces
involved.

> > I can't tell which part of the fvp-base device tree is not liked by the

> > firmware; I tried different combinations of the PSCI node, different CPU

> > enumerations (cpu@100 vs cpu@1), removing idle-states properties...any

> > hints appreciated.

> 

> The bootwrapper doesn't support idle. So no idle-states should be in the

> DT.

> 

> If you can share your DT, bootwrapper configuration, and model

> configuration, I can try to debug this with you.


I reverted the wrapper's ccdc936924b3682d ("Dynamically determine the
set of CPUs") commit you mentioned above, and specified the cpu-ids
manually, and am now running with 8 CPUs, although linux enumerates
them as 0,1,8,9,10,11,12,13?

Thanks for your continued support,

Kim
Kim Phillips June 29, 2017, 1:16 a.m. UTC | #11
On Wed, 28 Jun 2017 12:32:50 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> On Wed, Jun 28, 2017 at 12:26:02PM +0100, Mark Rutland wrote:

> > On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:

> > > FWIW, there is also this one I saw with mmap-pages set to 5

> > > (pages), which gets rounded up to 8 pages:

> > 

> > Sorry, *what* does the rounding upwards? Userspace, perf core, or the

> > driver? Where?


SPE implementations may vary from the minimum buffer alignment of the
smallest available page size, so I left the bts userspace tool code's
upwards-rounding code intact for now.  

I'll take this opportunity to submit the SPE perf tool patch in the
form of a reply to this email:  Look for the rounding code in
tools/perf/arch/arm64/util/arm-spe.c:arm_spe_recording_options().

> > That's worrying. I'll see if I can reproduce this.

> 

> Actually, this might be down to the IDX2OFF() macro being borked for non

> power-of-two buffer sizes.

> 

> Do you have Will's latest fixes? In his tree there's a commit:

> 

>   4f331cd62531dce2 ("squash! drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")

> 

> ... which should fix the IDX2OFF() bug.


yes, I've been running with that squash! commit since a couple of days
after I noticed it over a week ago.

> It's be good to reproduce the issue if we can, regardless.


FWIW, I couldn't the little I tried today.

Kim
Mark Rutland June 29, 2017, 11:11 a.m. UTC | #12
On Wed, Jun 28, 2017 at 07:59:53PM -0500, Kim Phillips wrote:
> On Wed, 28 Jun 2017 12:26:02 +0100

> Mark Rutland <mark.rutland@arm.com> wrote:

> > On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:


> > > > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the

> > > > > kernel after "smp: Bringing up secondary CPUs ...".  If I however take

> > > > > the second SPE node from fvp-base.dts and add it to my working device

> > > > > tree, I get this during the driver probe:

> > > > > 

> > > > > [    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]

> > > > > [    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]

> > > > > [    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)

> > > > 

> > > > Looks like you've screwed up your IRQ partitions, so you are effectively

> > > > registering the same device twice, which then blows up due to lack of shared

> > > > irqs.

> > > > 

> > > > Either remove one of the devices, or use IRQ partitions to restrict them

> > > > to unique sets of CPUs.

> > > 

> > > Right, but since I want to get parity with what you're running -

> > > fvp_base.dtb - I tried to debug the hang after "smp: Bringing up

> > > secondary CPUs ..." problem, and could only debug it to the PSCI driver

> > > hitting one of these cases:

> > > 

> > > case PSCI_RET_INVALID_PARAMS:

> > > case PSCI_RET_INVALID_ADDRESS:

> > 

> > Sounds like your DT is describing CPUs that don't exist (or perhaps the

> > same CPU several times). Certainly, PSCI and the kernel disagree on

> > which CPUS exist.

> > 

> > What exact DT are you using?

> 

> the one this commit to linux-will's perf/spe branch provides:

> 

> commit 2a73de57eaf61cdfd61be1e20a46e4a2c326775f

> Author: Marc Zyngier <marc.zyngier@arm.com>

> Date:   Tue Mar 11 18:14:45 2014 +0000

> 

>     arm64: dts: add model device-tree

>     

>     Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>

>     Signed-off-by: Will Deacon <will.deacon@arm.com>

> 

> > Are you using the bootwrapper, or ATF? I'm guessing you're using the

> > bootwrapper.

> 

> I'm using the wrapper to wrap arm-trusted-firmware (ATF?) objects, so,

> both?  I noticed the wrapper I was using was pretty old, so I updated

> it.


Ok. So what's likely happening is that ATF and the bootwrapper's FDT
disagree as to the set of CPUs. You're using ATF's PSCI implementation,
and not the boot-wrapper's.

I don't know how ATF enumerates CPUs on the model, so I can't offer much
guidance here other than fixing your DT to match whatever ATF believes
exists.

> arm-trusted-firmware, btw, has just been updated to enable SPE at lower

> ELs, so I don't have to use a hacked-up version anymore.

> 

> I also updated my BL33 to the latest upstream u-boot

> vexpress_aemv8a_dram_defconfig, and at least now the kernel continues

> to boot, even though it can't bring up 6 of the 7 secondary CPUs.


Do you mean that you replaced the bootwrapper with u-boot?

I'm a little confused here.

Regardless, it sounds like whatever DT is passed to the kernel still
doesn't match the real model configuration.

> > Which version of the bootwrapepr are you using? If it doesn't have

> > commit:

> > 

> >   ccdc936924b3682d ("Dynamically determine the set of CPUs")

> > 

> > ... have you configured it appropriately with --with-cpu-ids?

> > 

> > How is your model configured?

> 

> CLUSTER0_NUM_CORES=4

> CLUSTER1_NUM_CORES=4

> 

> > Which CPU IDs does it think exist?

> 

> 1,2,3,4,0x100,0x101,0x102,0x103

> 

> ...which are different from the above device tree!:

> 

> 0,0x100,0x200,0x300,0x10000,0x10100,0x10200,0x10300

> 

> So I imagine that's the problem, thanks!


Sounds like it, yes.

> I don't see how to tell the model to put the CPUs at different

> addresses, only a lot of GICv3 redistributor switches? 


I don't know how to do this, sorry.

> btw, where can I get updates to the run-model.sh scripts?  Answer

> off-list?


I don't know which script you're referring to. Contact whoever you got
it from initially?

[...]

> > > I can't tell which part of the fvp-base device tree is not liked by the

> > > firmware; I tried different combinations of the PSCI node, different CPU

> > > enumerations (cpu@100 vs cpu@1), removing idle-states properties...any

> > > hints appreciated.

> > 

> > The bootwrapper doesn't support idle. So no idle-states should be in the

> > DT.

> > 

> > If you can share your DT, bootwrapper configuration, and model

> > configuration, I can try to debug this with you.

> 

> I reverted the wrapper's ccdc936924b3682d ("Dynamically determine the

> set of CPUs") commit you mentioned above, and specified the cpu-ids

> manually, and am now running with 8 CPUs, although linux enumerates

> them as 0,1,8,9,10,11,12,13?


The --with-cpu-ids option *adds* CPU nodes, but leaves the broken ones,
and your CPU phandles (and PPI partitions for the SPE node(s)) will all
be wrong. Linux is still seeing those erroneous CPU nodes (presumably
taking Linux CPU ids 2-7).

Generally, --with-cpu-ids doesn't work as you'd expect, which is why it
got removed in favour of assuming an initally correct DT.

Please fix the DT instead. With a fixed DT, and commit ccdc936924b3682d,
the bootwrapper won't further mangle your DT.

Thanks,
Mark.
Mark Rutland June 30, 2017, 2:02 p.m. UTC | #13
Hi Kim,

On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote:
> 'perf record' and 'perf report --dump-raw-trace' supported in this

> release.

> 

> Example usage:

> 

> taskset -c 2 ./perf record -C 2 -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=1/ \

> 		dd if=/dev/zero of=/dev/null count=10000

> 

> perf report --dump-raw-trace

> 

> Note that the perf.data file is portable, so the report can be run on another

> architecture host if necessary.

> 

> Output will contain raw SPE data and its textual representation, such as:

> 

> 0xc7d0 [0x30]: PERF_RECORD_AUXTRACE size: 0x82f70  offset: 0  ref: 0x1e947e88189  idx: 0  tid: -1  cpu: 2

> .

> . ... ARM SPE data: size 536432 bytes

> .  00000000:  4a 01                                           B COND

> .  00000002:  b1 00 00 00 00 00 00 00 80                      TGT 0 el0 ns=1

> .  0000000b:  42 42                                           RETIRED NOT-TAKEN

> .  0000000d:  b0 20 41 c0 ad ff ff 00 80                      PC ffffadc04120 el0 ns=1

> .  00000016:  98 00 00                                        LAT 0 TOT

> .  00000019:  71 80 3e f7 46 e9 01 00 00                      TS 2101429616256

> .  00000022:  49 01                                           ST

> .  00000024:  b2 50 bd ba 73 00 80 ff ff                      VA ffff800073babd50

> .  0000002d:  b3 50 bd ba f3 00 00 00 80                      PA f3babd50 ns=1

> .  00000036:  9a 00 00                                        LAT 0 XLAT

> .  00000039:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS

> .  0000003b:  b0 8c b4 1e 08 00 00 ff ff                      PC ff0000081eb48c el3 ns=1

> .  00000044:  98 00 00                                        LAT 0 TOT

> .  00000047:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868

> .  00000050:  48 00                                           INSN-OTHER

> .  00000052:  42 02                                           RETIRED

> .  00000054:  b0 58 54 1f 08 00 00 ff ff                      PC ff0000081f5458 el3 ns=1

> .  0000005d:  98 00 00                                        LAT 0 TOT

> .  00000060:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868

> ...

> 

> Other release notes:

> 

> - applies to acme's perf/{core,urgent} branches, likely elsewhere

> 

> - Record requires Will's SPE driver, currently undergoing upstream review

> 

> - the intel-bts implementation was used as a starting point; its

>   min/default/max buffer sizes and power of 2 pages granularity need to be

>   revisited for ARM SPE

> 

> - not been tested on platforms with multiple SPE clusters/domains

> 

> - snapshot support (record -S), and conversion to native perf events

>   (e.g., via 'perf inject --itrace'), are still in development

> 

> - technically both cs-etm and spe can be used simultaneously, however

>   disabled for simplicity in this release

> 

> Signed-off-by: Kim Phillips <kim.phillips@arm.com>

> ---

>  tools/perf/arch/arm/util/auxtrace.c   |  20 +-

>  tools/perf/arch/arm/util/pmu.c        |   3 +

>  tools/perf/arch/arm64/util/Build      |   3 +-

>  tools/perf/arch/arm64/util/arm-spe.c  | 210 ++++++++++++++++

>  tools/perf/util/Build                 |   2 +

>  tools/perf/util/arm-spe-pkt-decoder.c | 448 ++++++++++++++++++++++++++++++++++

>  tools/perf/util/arm-spe-pkt-decoder.h |  52 ++++

>  tools/perf/util/arm-spe.c             | 318 ++++++++++++++++++++++++

>  tools/perf/util/arm-spe.h             |  39 +++

>  tools/perf/util/auxtrace.c            |   3 +

>  tools/perf/util/auxtrace.h            |   1 +

>  11 files changed, 1095 insertions(+), 4 deletions(-)

>  create mode 100644 tools/perf/arch/arm64/util/arm-spe.c

>  create mode 100644 tools/perf/util/arm-spe-pkt-decoder.c

>  create mode 100644 tools/perf/util/arm-spe-pkt-decoder.h

>  create mode 100644 tools/perf/util/arm-spe.c

>  create mode 100644 tools/perf/util/arm-spe.h

> 

> diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c

> index 8edf2cb71564..ec071609e8ac 100644

> --- a/tools/perf/arch/arm/util/auxtrace.c

> +++ b/tools/perf/arch/arm/util/auxtrace.c

> @@ -22,29 +22,43 @@

>  #include "../../util/evlist.h"

>  #include "../../util/pmu.h"

>  #include "cs-etm.h"

> +#include "arm-spe.h"

>  

>  struct auxtrace_record

>  *auxtrace_record__init(struct perf_evlist *evlist, int *err)

>  {

> -	struct perf_pmu	*cs_etm_pmu;

> +	struct perf_pmu	*cs_etm_pmu, *arm_spe_pmu;

>  	struct perf_evsel *evsel;

> -	bool found_etm = false;

> +	bool found_etm = false, found_spe = false;

>  

>  	cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME);

> +	arm_spe_pmu = perf_pmu__find(ARM_SPE_PMU_NAME);

>  

>  	if (evlist) {

>  		evlist__for_each_entry(evlist, evsel) {

>  			if (cs_etm_pmu &&

>  			    evsel->attr.type == cs_etm_pmu->type)

>  				found_etm = true;

> +			if (arm_spe_pmu &&

> +			    evsel->attr.type == arm_spe_pmu->type)

> +				found_spe = true;


Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all
SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1"
and so on).

Can we not find all PMUs with a "arm_spe_" prefix?

... or, taking a step back, do we need some sysfs "class" attribute to
identify multi-instance PMUs?

>  		}

>  	}

>  

> +	if (found_etm && found_spe) {

> +		pr_err("Concurrent ARM Coresight ETM and SPE operation not currently supported\n");

> +		*err = -EOPNOTSUPP;

> +		return NULL;

> +	}

> +

>  	if (found_etm)

>  		return cs_etm_record_init(err);

>  

> +	if (found_spe)

> +		return arm_spe_recording_init(err);


... so given the above, this will fail.

AFAICT, this means that perf record opens the event, but doesn't touch
the aux buffer at all.

> +

>  	/*

> -	 * Clear 'err' even if we haven't found a cs_etm event - that way perf

> +	 * Clear 'err' even if we haven't found an event - that way perf

>  	 * record can still be used even if tracers aren't present.  The NULL

>  	 * return value will take care of telling the infrastructure HW tracing

>  	 * isn't available.

> diff --git a/tools/perf/arch/arm/util/pmu.c b/tools/perf/arch/arm/util/pmu.c

> index 98d67399a0d6..71fb8f13b40a 100644

> --- a/tools/perf/arch/arm/util/pmu.c

> +++ b/tools/perf/arch/arm/util/pmu.c

> @@ -20,6 +20,7 @@

>  #include <linux/perf_event.h>

>  

>  #include "cs-etm.h"

> +#include "arm-spe.h"

>  #include "../../util/pmu.h"

>  

>  struct perf_event_attr

> @@ -31,6 +32,8 @@ struct perf_event_attr

>  		pmu->selectable = true;

>  		pmu->set_drv_config = cs_etm_set_drv_config;

>  	}

> +	if (!strcmp(pmu->name, ARM_SPE_PMU_NAME))

> +		pmu->selectable = true;


... likewise I here.

I guess we need an is_arm_spe_pmu() helper for both cases, iterating over
all PMUs.

>  #endif

>  	return NULL;

>  }

> diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build

> index cef6fb38d17e..f9969bb88ccb 100644

> --- a/tools/perf/arch/arm64/util/Build

> +++ b/tools/perf/arch/arm64/util/Build

> @@ -3,4 +3,5 @@ libperf-$(CONFIG_LOCAL_LIBUNWIND) += unwind-libunwind.o

>  

>  libperf-$(CONFIG_AUXTRACE) += ../../arm/util/pmu.o \

>  			      ../../arm/util/auxtrace.o \

> -			      ../../arm/util/cs-etm.o

> +			      ../../arm/util/cs-etm.o \

> +			      arm-spe.o

> diff --git a/tools/perf/arch/arm64/util/arm-spe.c b/tools/perf/arch/arm64/util/arm-spe.c

> new file mode 100644

> index 000000000000..07172764881c

> --- /dev/null

> +++ b/tools/perf/arch/arm64/util/arm-spe.c

> @@ -0,0 +1,210 @@

> +/*

> + * ARM Statistical Profiling Extensions (SPE) support

> + * Copyright (c) 2017, ARM Ltd.

> + *

> + * This program is free software; you can redistribute it and/or modify it

> + * under the terms and conditions of the GNU General Public License,

> + * version 2, as published by the Free Software Foundation.

> + *

> + * This program is distributed in the hope it will be useful, but WITHOUT

> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or

> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for

> + * more details.

> + *

> + */

> +

> +#include <linux/kernel.h>

> +#include <linux/types.h>

> +#include <linux/bitops.h>

> +#include <linux/log2.h>

> +

> +#include "../../util/cpumap.h"

> +#include "../../util/evsel.h"

> +#include "../../util/evlist.h"

> +#include "../../util/session.h"

> +#include "../../util/util.h"

> +#include "../../util/pmu.h"

> +#include "../../util/debug.h"

> +#include "../../util/tsc.h"

> +#include "../../util/auxtrace.h"

> +#include "../../util/arm-spe.h"

> +

> +#define KiB(x) ((x) * 1024)

> +#define MiB(x) ((x) * 1024 * 1024)

> +#define KiB_MASK(x) (KiB(x) - 1)

> +#define MiB_MASK(x) (MiB(x) - 1)


It's a shame we don't have a UAPI version of <linux/sizes.h>, though I
guess it's good to do the same thing as the x86 code.

I was a little worried that the *_MASK() helpers might be used with a
non power-of-two x, but it seems they're unused even for x86. Can we
please delete them?

> +struct arm_spe_recording {

> +	struct auxtrace_record		itr;

> +	struct perf_pmu			*arm_spe_pmu;

> +	struct perf_evlist		*evlist;

> +};


A user may wich to record trace on separate uarches simultaneously, so
having a singleton arm_spe_pmu for the entire evlist doesn't seem right.

e.g. I don't see why we should allow a user to do:

./perf record -c 1024 \
	-e arm_spe_0/ts_enable=1,pa_enable=0/ \
	-e arm_spe_1/ts_enable=1,pa_enable=0/ \
	${WORKLOAD}

... which perf-record seems to accept today, but I don't seem to get any
aux trace, regardless of whether I taskset the entire thing to any
particular CPU.

It also seems that the events aren't task bound, judging by -vv output:

------------------------------------------------------------
perf_event_attr:
  type                             7
  size                             112
  config                           0x1
  { sample_period, sample_freq }   1024
  sample_type                      IP|TID|TIME|CPU|IDENTIFIER
  read_format                      ID
  disabled                         1
  inherit                          1
  enable_on_exec                   1
  sample_id_all                    1
  exclude_guest                    1
------------------------------------------------------------
sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 4
sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 5
sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8 = 6
sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8 = 8
------------------------------------------------------------
perf_event_attr:
  type                             8
  size                             112
  config                           0x1
  { sample_period, sample_freq }   1024
  sample_type                      IP|TID|TIME|IDENTIFIER
  read_format                      ID
  disabled                         1
  inherit                          1
  enable_on_exec                   1
  sample_id_all                    1
  exclude_guest                    1
------------------------------------------------------------
sys_perf_event_open: pid -1  cpu 4  group_fd -1  flags 0x8 = 9
sys_perf_event_open: pid -1  cpu 5  group_fd -1  flags 0x8 = 10
sys_perf_event_open: pid -1  cpu 6  group_fd -1  flags 0x8 = 11
sys_perf_event_open: pid -1  cpu 7  group_fd -1  flags 0x8 = 12
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  config                           0x9
  { sample_period, sample_freq }   1024
  sample_type                      IP|TID|TIME|IDENTIFIER
  read_format                      ID
  disabled                         1
  inherit                          1
  exclude_kernel                   1
  exclude_hv                       1
  mmap                             1
  comm                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid 2181  cpu 0  group_fd -1  flags 0x8 = 13
sys_perf_event_open: pid 2181  cpu 1  group_fd -1  flags 0x8 = 14
sys_perf_event_open: pid 2181  cpu 2  group_fd -1  flags 0x8 = 15
sys_perf_event_open: pid 2181  cpu 3  group_fd -1  flags 0x8 = 16
sys_perf_event_open: pid 2181  cpu 4  group_fd -1  flags 0x8 = 17
sys_perf_event_open: pid 2181  cpu 5  group_fd -1  flags 0x8 = 18
sys_perf_event_open: pid 2181  cpu 6  group_fd -1  flags 0x8 = 19
sys_perf_event_open: pid 2181  cpu 7  group_fd -1  flags 0x8 = 20



I see something similar (i.e. perf doesn't try to bind the events to the
workload PID) when I try to record with only a single PMU. In that case,
perf-record blows up because it can't handle events on a subset of CPUs
(though it should be able to):

nanook@torsk:~$ ./perf record -vv -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=0/ true
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Problems creating module maps, continuing anyway...
------------------------------------------------------------
perf_event_attr:
  type                             7
  size                             112
  config                           0x1
  { sample_period, sample_freq }   1024
  sample_type                      IP|TID|TIME|CPU|IDENTIFIER
  read_format                      ID
  disabled                         1
  inherit                          1
  enable_on_exec                   1
  sample_id_all                    1
  exclude_guest                    1
------------------------------------------------------------
sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 4
sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 5
sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8 = 6
sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8 = 8
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  config                           0x9
  { sample_period, sample_freq }   1024
  sample_type                      IP|TID|TIME|IDENTIFIER
  read_format                      ID
  disabled                         1
  inherit                          1
  exclude_kernel                   1
  exclude_hv                       1
  mmap                             1
  comm                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid 2185  cpu 0  group_fd -1  flags 0x8 = 9
sys_perf_event_open: pid 2185  cpu 1  group_fd -1  flags 0x8 = 10
sys_perf_event_open: pid 2185  cpu 2  group_fd -1  flags 0x8 = 11
sys_perf_event_open: pid 2185  cpu 3  group_fd -1  flags 0x8 = 12
sys_perf_event_open: pid 2185  cpu 4  group_fd -1  flags 0x8 = 13
sys_perf_event_open: pid 2185  cpu 5  group_fd -1  flags 0x8 = 14
sys_perf_event_open: pid 2185  cpu 6  group_fd -1  flags 0x8 = 15
sys_perf_event_open: pid 2185  cpu 7  group_fd -1  flags 0x8 = 16
mmap size 266240B
AUX area mmap length 131072
perf event ring buffer mmapped per cpu
failed to mmap AUX area
failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0xffffc8596038, 512)=22)



... with a SW event, this works as expected, being bound to the workload PID:

nanook@torsk:~$ ./perf record -vvv -e context-switches true
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Problems creating module maps, continuing anyway...
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  config                           0x3
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|PERIOD
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  freq                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid 2220  cpu 0  group_fd -1  flags 0x8 = 4
sys_perf_event_open: pid 2220  cpu 1  group_fd -1  flags 0x8 = 5
sys_perf_event_open: pid 2220  cpu 2  group_fd -1  flags 0x8 = 6
sys_perf_event_open: pid 2220  cpu 3  group_fd -1  flags 0x8 = 8
sys_perf_event_open: pid 2220  cpu 4  group_fd -1  flags 0x8 = 9
sys_perf_event_open: pid 2220  cpu 5  group_fd -1  flags 0x8 = 10
sys_perf_event_open: pid 2220  cpu 6  group_fd -1  flags 0x8 = 11
sys_perf_event_open: pid 2220  cpu 7  group_fd -1  flags 0x8 = 12
mmap size 528384B
perf event ring buffer mmapped per cpu
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
[ perf record: Woken up 1 times to write data ]
overlapping maps in /lib/aarch64-linux-gnu/ld-2.19.so (disable tui for more info)
overlapping maps in [vdso] (disable tui for more info)
overlapping maps in /tmp/perf-2220.map (disable tui for more info)
Looking at the vmlinux_path (8 entries long)
No kallsyms or vmlinux with build-id cc083c873190ff1254624d3137142c6841c118c3 was found
[kernel.kallsyms] with build id cc083c873190ff1254624d3137142c6841c118c3 not found, continuing without symbols
overlapping maps in /etc/ld.so.cache (disable tui for more info)
overlapping maps in /lib/aarch64-linux-gnu/libc-2.19.so (disable tui for more info)
overlapping maps in /tmp/perf-2220.map (disable tui for more info)
overlapping maps in /tmp/perf-2220.map (disable tui for more info)
overlapping maps in /tmp/perf-2220.map (disable tui for more info)
overlapping maps in /tmp/perf-2220.map (disable tui for more info)
overlapping maps in /lib/aarch64-linux-gnu/libc-2.19.so (disable tui for more info)
overlapping maps in /bin/true (disable tui for more info)
overlapping maps in /lib/aarch64-linux-gnu/ld-2.19.so (disable tui for more info)
failed to write feature HEADER_CPUDESC
failed to write feature HEADER_CPUID
[ perf record: Captured and wrote 0.002 MB perf.data (4 samples) ]



... so I guess this has something to do with the way the tool tries to
use the cpumask, maknig the wrong assumption that this implies
system-wide collection is mandatory / expected.

> +

> +static size_t

> +arm_spe_info_priv_size(struct auxtrace_record *itr __maybe_unused,

> +			 struct perf_evlist *evlist __maybe_unused)

> +{

> +	return ARM_SPE_AUXTRACE_PRIV_SIZE;

> +}

> +

> +static int arm_spe_info_fill(struct auxtrace_record *itr,

> +			       struct perf_session *session,

> +			       struct auxtrace_info_event *auxtrace_info,

> +			       size_t priv_size)

> +{

> +	struct arm_spe_recording *sper =

> +			container_of(itr, struct arm_spe_recording, itr);

> +	struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu;

> +

> +	if (priv_size != ARM_SPE_AUXTRACE_PRIV_SIZE)

> +		return -EINVAL;

> +

> +	if (!session->evlist->nr_mmaps)

> +		return -EINVAL;

> +

> +	auxtrace_info->type = PERF_AUXTRACE_ARM_SPE;

> +	auxtrace_info->priv[ARM_SPE_PMU_TYPE] = arm_spe_pmu->type;

> +

> +	return 0;

> +}

> +

> +static int arm_spe_recording_options(struct auxtrace_record *itr,

> +				       struct perf_evlist *evlist,

> +				       struct record_opts *opts)

> +{

> +	struct arm_spe_recording *sper =

> +			container_of(itr, struct arm_spe_recording, itr);

> +	struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu;

> +	struct perf_evsel *evsel, *arm_spe_evsel = NULL;

> +	const struct cpu_map *cpus = evlist->cpus;

> +	bool privileged = geteuid() == 0 || perf_event_paranoid() < 0;

> +	struct perf_evsel *tracking_evsel;

> +	int err;

> +

> +	sper->evlist = evlist;

> +

> +	evlist__for_each_entry(evlist, evsel) {

> +		if (evsel->attr.type == arm_spe_pmu->type) {

> +			if (arm_spe_evsel) {

> +				pr_err("There may be only one " ARM_SPE_PMU_NAME " event\n");

> +				return -EINVAL;

> +			}

> +			evsel->attr.freq = 0;

> +			evsel->attr.sample_period = 1;

> +			arm_spe_evsel = evsel;

> +			opts->full_auxtrace = true;

> +		}

> +	}


Theoretically, we could ask for different events on different CPUs, but
otehrwise, this looks sane.

> +

> +	if (!opts->full_auxtrace)

> +		return 0;

> +

> +	/* We are in full trace mode but '-m,xyz' wasn't specified */

> +	if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) {

> +		if (privileged) {

> +			opts->auxtrace_mmap_pages = MiB(4) / page_size;

> +		} else {

> +			opts->auxtrace_mmap_pages = KiB(128) / page_size;

> +			if (opts->mmap_pages == UINT_MAX)

> +				opts->mmap_pages = KiB(256) / page_size;

> +		}

> +	}

> +

> +	/* Validate auxtrace_mmap_pages */

> +	if (opts->auxtrace_mmap_pages) {

> +		size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size;

> +		size_t min_sz = KiB(8);

> +

> +		if (sz < min_sz || !is_power_of_2(sz)) {

> +			pr_err("Invalid mmap size for ARM SPE: must be at least %zuKiB and a power of 2\n",

> +			       min_sz / 1024);

> +			return -EINVAL;

> +		}

> +	}

> +

> +	/*

> +	 * To obtain the auxtrace buffer file descriptor, the auxtrace event

> +	 * must come first.

> +	 */

> +	perf_evlist__to_front(evlist, arm_spe_evsel);


Huh? *what* needs the auxtrace buffer fd?

This seems really fragile. Can't we store this elsewhere?

> +

> +	/*

> +	 * In the case of per-cpu mmaps, we need the CPU on the

> +	 * AUX event.

> +	 */

> +	if (!cpu_map__empty(cpus))

> +		perf_evsel__set_sample_bit(arm_spe_evsel, CPU);

> +

> +	/* Add dummy event to keep tracking */

> +	err = parse_events(evlist, "dummy:u", NULL);

> +	if (err)

> +		return err;

> +

> +	tracking_evsel = perf_evlist__last(evlist);

> +	perf_evlist__set_tracking_event(evlist, tracking_evsel);

> +

> +	tracking_evsel->attr.freq = 0;

> +	tracking_evsel->attr.sample_period = 1;

> +

> +	/* In per-cpu case, always need the time of mmap events etc */

> +	if (!cpu_map__empty(cpus))

> +		perf_evsel__set_sample_bit(tracking_evsel, TIME);

> +

> +	return 0;

> +}

> +

> +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)

> +{

> +	u64 ts;

> +

> +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));

> +

> +	return ts;

> +}


I do not think it's a good idea to read the counter directly like this.

What is this "reference" intended to be meaningful relative to?

Why do we need to do this in userspace?

Can we not ask the kernel to output timestamps instead?

> +

> +static void arm_spe_recording_free(struct auxtrace_record *itr)

> +{

> +	struct arm_spe_recording *sper =

> +			container_of(itr, struct arm_spe_recording, itr);

> +

> +       free(sper);

> +}

> +

> +static int arm_spe_read_finish(struct auxtrace_record *itr, int idx)

> +{

> +	struct arm_spe_recording *sper =

> +			container_of(itr, struct arm_spe_recording, itr);

> +	struct perf_evsel *evsel;

> +

> +	evlist__for_each_entry(sper->evlist, evsel) {

> +		if (evsel->attr.type == sper->arm_spe_pmu->type)

> +			return perf_evlist__enable_event_idx(sper->evlist,

> +							     evsel, idx);

> +	}

> +	return -EINVAL;

> +}

> +

> +struct auxtrace_record *arm_spe_recording_init(int *err)

> +{

> +	struct perf_pmu *arm_spe_pmu = perf_pmu__find(ARM_SPE_PMU_NAME);

> +	struct arm_spe_recording *sper;

> +

> +	if (!arm_spe_pmu)

> +		return NULL;


No need to set *err here?

> +

> +	sper = zalloc(sizeof(struct arm_spe_recording));

> +	if (!sper) {

> +		*err = -ENOMEM;

> +		return NULL;

> +	}


... as we do here?

[...]

> +

> +	sper->arm_spe_pmu = arm_spe_pmu;

> +	sper->itr.recording_options = arm_spe_recording_options;

> +	sper->itr.info_priv_size = arm_spe_info_priv_size;

> +	sper->itr.info_fill = arm_spe_info_fill;

> +	sper->itr.free = arm_spe_recording_free;

> +	sper->itr.reference = arm_spe_reference;

> +	sper->itr.read_finish = arm_spe_read_finish;

> +	sper->itr.alignment = 0;

> +	return &sper->itr;

> +}

> diff --git a/tools/perf/util/Build b/tools/perf/util/Build

> index 79dea95a7f68..4ed31e88b8ee 100644

> --- a/tools/perf/util/Build

> +++ b/tools/perf/util/Build

> @@ -82,6 +82,8 @@ libperf-$(CONFIG_AUXTRACE) += auxtrace.o

>  libperf-$(CONFIG_AUXTRACE) += intel-pt-decoder/

>  libperf-$(CONFIG_AUXTRACE) += intel-pt.o

>  libperf-$(CONFIG_AUXTRACE) += intel-bts.o

> +libperf-$(CONFIG_AUXTRACE) += arm-spe.o

> +libperf-$(CONFIG_AUXTRACE) += arm-spe-pkt-decoder.o

>  libperf-y += parse-branch-options.o

>  libperf-y += dump-insn.o

>  libperf-y += parse-regs-options.o

> diff --git a/tools/perf/util/arm-spe-pkt-decoder.c b/tools/perf/util/arm-spe-pkt-decoder.c

> new file mode 100644

> index 000000000000..ca3813d5b91a

> --- /dev/null

> +++ b/tools/perf/util/arm-spe-pkt-decoder.c

> @@ -0,0 +1,448 @@

> +/*

> + * ARM Statistical Profiling Extensions (SPE) support

> + * Copyright (c) 2017, ARM Ltd.

> + *

> + * This program is free software; you can redistribute it and/or modify it

> + * under the terms and conditions of the GNU General Public License,

> + * version 2, as published by the Free Software Foundation.

> + *

> + * This program is distributed in the hope it will be useful, but WITHOUT

> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or

> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for

> + * more details.

> + *

> + */

> +

> +#include <stdio.h>

> +#include <string.h>

> +#include <endian.h>

> +#include <byteswap.h>

> +

> +#include "arm-spe-pkt-decoder.h"

> +

> +#define BIT(n)		(1 << (n))

> +

> +#define BIT61		((uint64_t)1 << 61)

> +#define BIT62		((uint64_t)1 << 62)

> +#define BIT63		((uint64_t)1 << 63)

> +

> +#define NS_FLAG		BIT63

> +#define EL_FLAG		(BIT62 | BIT61)

> +

> +#if __BYTE_ORDER == __BIG_ENDIAN

> +#define le16_to_cpu bswap_16

> +#define le32_to_cpu bswap_32

> +#define le64_to_cpu bswap_64

> +#define memcpy_le64(d, s, n) do { \

> +	memcpy((d), (s), (n));    \

> +	*(d) = le64_to_cpu(*(d)); \

> +} while (0)

> +#else

> +#define le16_to_cpu

> +#define le32_to_cpu

> +#define le64_to_cpu

> +#define memcpy_le64 memcpy

> +#endif

> +

> +static const char * const arm_spe_packet_name[] = {

> +	[ARM_SPE_PAD]		= "PAD",

> +	[ARM_SPE_END]		= "END",

> +	[ARM_SPE_TIMESTAMP]	= "TS",

> +	[ARM_SPE_ADDRESS]	= "ADDR",

> +	[ARM_SPE_COUNTER]	= "LAT",

> +	[ARM_SPE_CONTEXT]	= "CONTEXT",

> +	[ARM_SPE_INSN_TYPE]	= "INSN-TYPE",

> +	[ARM_SPE_EVENTS]	= "EVENTS",

> +	[ARM_SPE_DATA_SOURCE]	= "DATA-SOURCE",

> +};

> +

> +const char *arm_spe_pkt_name(enum arm_spe_pkt_type type)

> +{

> +	return arm_spe_packet_name[type];

> +}

> +

> +/* return ARM SPE payload size from its encoding:

> + * 00 : byte

> + * 01 : halfword (2)

> + * 10 : word (4)

> + * 11 : doubleword (8)

> + */

> +static int payloadlen(unsigned char byte)

> +{

> +	return 1 << ((byte & 0x30) >> 4);

> +}

> +

> +static int arm_spe_get_pad(struct arm_spe_pkt *packet)

> +{

> +	packet->type = ARM_SPE_PAD;

> +	return 1;

> +}

> +

> +static int arm_spe_get_alignment(const unsigned char *buf, size_t len,

> +				 struct arm_spe_pkt *packet)

> +{

> +	unsigned int alignment = 1 << ((buf[0] & 0xf) + 1);

> +

> +	if (len < alignment)

> +		return ARM_SPE_NEED_MORE_BYTES;

> +

> +	packet->type = ARM_SPE_PAD;

> +	return alignment - (((uint64_t)buf) & (alignment - 1));

> +}

> +

> +static int arm_spe_get_end(struct arm_spe_pkt *packet)

> +{

> +	packet->type = ARM_SPE_END;

> +	return 1;

> +}

> +

> +static int arm_spe_get_timestamp(const unsigned char *buf, size_t len,

> +				 struct arm_spe_pkt *packet)

> +{

> +	if (len < 8)

> +		return ARM_SPE_NEED_MORE_BYTES;

> +

> +	packet->type = ARM_SPE_TIMESTAMP;

> +	memcpy_le64(&packet->payload, buf + 1, 8);

> +

> +	return 1 + 8;

> +}

> +

> +static int arm_spe_get_events(const unsigned char *buf, size_t len,

> +			      struct arm_spe_pkt *packet)

> +{

> +	unsigned int events_len = payloadlen(buf[0]);

> +

> +	if (len < events_len)

> +		return ARM_SPE_NEED_MORE_BYTES;


Isn't len the size of the whole buffer? So isn't this failing to account
for the header byte?

> +

> +	packet->type = ARM_SPE_EVENTS;

> +	packet->index = events_len;


Huh? The events packet has no "index" field, so why do we need this?

> +	switch (events_len) {

> +	case 1: packet->payload = *(uint8_t *)(buf + 1); break;

> +	case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;

> +	case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;

> +	case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;

> +	default: return ARM_SPE_BAD_PACKET;

> +	}

> +

> +	return 1 + events_len;

> +}

> +

> +static int arm_spe_get_data_source(const unsigned char *buf,

> +				   struct arm_spe_pkt *packet)

> +{

> +	int len = payloadlen(buf[0]);

> +

> +	packet->type = ARM_SPE_DATA_SOURCE;

> +	if (len == 1)

> +		packet->payload = buf[1];

> +	else if (len == 2)

> +		packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1));

> +

> +	return 1 + len;

> +}


For those packets with a payload, the header has a uniform format
describing the payload size. Given that, can't we make the payload
extraction generic, regardless of the packet type?

e.g. something like:

static int arm_spe_get_payload(const unsigned char *buf, size_t len,
			       struct arm_spe_pkt *packet)
{
	<determine paylaod size>
	<length check>
	<switch>
	<return nr consumed bytes (inc header), or error>
}

static int arm_spe_get_events(const unsigned char *buf, size_t len,
			      struct arm_spe_pkt *packet)
{
	packet->type = ARM_SPE_EVENTS;
	return arm_spe_get_payload(buf, len, packet);
}

static int arm_spe_get_data_source(const unsigned char *buf,
				   struct arm_spe_pkt *packet)
{
	packet->type = ARM_SPE_DATA_SOURCE;
	return arm_spe_get_payload(buf, len, packet);
}

... and so on for the other packets with a payload.

> +static int arm_spe_do_get_packet(const unsigned char *buf, size_t len,

> +				 struct arm_spe_pkt *packet)

> +{

> +	unsigned int byte;

> +

> +	memset(packet, 0, sizeof(struct arm_spe_pkt));

> +

> +	if (!len)

> +		return ARM_SPE_NEED_MORE_BYTES;

> +

> +	byte = buf[0];

> +	if (byte == 0)

> +		return arm_spe_get_pad(packet);

> +	else if (byte == 1) /* no timestamp at end of record */

> +		return arm_spe_get_end(packet);

> +	else if (byte & 0xc0 /* 0y11000000 */) {

> +		if (byte & 0x80) {

> +			/* 0x38 is 0y00111000 */

> +			if ((byte & 0x38) == 0x30) /* address packet (short) */

> +				return arm_spe_get_addr(buf, len, 0, packet);

> +			if ((byte & 0x38) == 0x18) /* counter packet (short) */

> +				return arm_spe_get_counter(buf, len, 0, packet);

> +		} else

> +			if (byte == 0x71)

> +				return arm_spe_get_timestamp(buf, len, packet);

> +			else if ((byte & 0xf) == 0x2)

> +				return arm_spe_get_events(buf, len, packet);

> +			else if ((byte & 0xf) == 0x3)

> +				return arm_spe_get_data_source(buf, packet);

> +			else if ((byte & 0x3c) == 0x24)

> +				return arm_spe_get_context(buf, len, packet);

> +			else if ((byte & 0x3c) == 0x8)

> +				return arm_spe_get_insn_type(buf, packet);


Could we have some mnemonics for these?

e.g.

#define SPE_HEADER0_PAD		0x0
#define SPE_HEADER0_END		0x1

#define SPE_HEADER0_EVENTS	0x42
#define SPE_HEADER0_EVENTS_MASK	0xcf

if (byte == SPE_HEADER0_PAD) { 
	...
} else if (byte == SPE_HEADER0_END) {
	...
} else if ((byte & SPE_HEADER0_EVENTS_MASK) == SPE_HEADER0_EVENTS) {
	...
}

... which could even be turned into something table-driven.

> +	} else if ((byte & 0xe0) == 0x20 /* 0y00100000 */) {

> +		/* 16-bit header */

> +		byte = buf[1];

> +		if (byte == 0)

> +			return arm_spe_get_alignment(buf, len, packet);

> +		else if ((byte & 0xf8) == 0xb0)

> +			return arm_spe_get_addr(buf, len, 1, packet);

> +		else if ((byte & 0xf8) == 0x98)

> +			return arm_spe_get_counter(buf, len, 1, packet);

> +	}

> +

> +	return ARM_SPE_BAD_PACKET;

> +}

> +

> +int arm_spe_get_packet(const unsigned char *buf, size_t len,

> +		       struct arm_spe_pkt *packet)

> +{

> +	int ret;

> +

> +	ret = arm_spe_do_get_packet(buf, len, packet);

> +	if (ret > 0) {

> +		while (ret < 1 && len > (size_t)ret && !buf[ret])

> +			ret += 1;

> +	}


What is this trying to do?

> +	return ret;

> +}

> +

> +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,

> +		     size_t buf_len)

> +{

> +	int ret, ns, el, index = packet->index;

> +	unsigned long long payload = packet->payload;

> +	const char *name = arm_spe_pkt_name(packet->type);

> +

> +	switch (packet->type) {

> +	case ARM_SPE_BAD:

> +	case ARM_SPE_PAD:

> +	case ARM_SPE_END:

> +		return snprintf(buf, buf_len, "%s", name);

> +	case ARM_SPE_EVENTS: {


[...]

> +	case ARM_SPE_DATA_SOURCE:

> +	case ARM_SPE_TIMESTAMP:

> +		return snprintf(buf, buf_len, "%s %lld", name, payload);

> +	case ARM_SPE_ADDRESS:

> +		switch (index) {

> +		case 0:

> +		case 1: ns = !!(packet->payload & NS_FLAG);

> +			el = (packet->payload & EL_FLAG) >> 61;

> +			payload &= ~(0xffULL << 56);

> +			return snprintf(buf, buf_len, "%s %llx el%d ns=%d",

> +				        (index == 1) ? "TGT" : "PC", payload, el, ns);


For TTBR1 addresses, this ends up losing the leading 0xff, giving us
invalid addresses, which look odd.

Can we please sign-extend bit 55 so that this gives us valid addresses
regardless of TTBR?

Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so
that things get padded consistently?

Thanks,
Mark.
Kim Phillips July 6, 2017, 5:08 p.m. UTC | #14
On Thu, 29 Jun 2017 12:11:02 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> On Wed, Jun 28, 2017 at 07:59:53PM -0500, Kim Phillips wrote:

> > arm-trusted-firmware, btw, has just been updated to enable SPE at lower

> > ELs, so I don't have to use a hacked-up version anymore.

> > 

> > I also updated my BL33 to the latest upstream u-boot

> > vexpress_aemv8a_dram_defconfig, and at least now the kernel continues

> > to boot, even though it can't bring up 6 of the 7 secondary CPUs.

> 

> Do you mean that you replaced the bootwrapper with u-boot?


no, sorry, arm-trusted-firmware wants a BL33 image, which u-boot
provides.

Sorry but I guess I'm not using the bootwrapper, and we are launching
the model in completely different manners.

The bootwrapper input is a kernel and a dtb, and it emits a dtb and a
linux-system.axf file, the latter of which I don't see how to launch
the model with:  The model script I'm using uses a kernel, dtb, and an
fip.bin and bl1.bin.

Can you share how you invoke the model, presumably with the .axf file?  

> The --with-cpu-ids option *adds* CPU nodes, but leaves the broken ones,

> and your CPU phandles (and PPI partitions for the SPE node(s)) will all

> be wrong. Linux is still seeing those erroneous CPU nodes (presumably

> taking Linux CPU ids 2-7).

> 

> Generally, --with-cpu-ids doesn't work as you'd expect, which is why it

> got removed in favour of assuming an initally correct DT.

> 

> Please fix the DT instead. With a fixed DT, and commit ccdc936924b3682d,

> the bootwrapper won't further mangle your DT.


OK, changing the CPU IDs alone didn't work (kernel didn't even say hi),
but taking what commit ccdc936924b3682d does to the cpu_on/off
properties makes it work for my arm-trusted-firmware (non-boot-wrapper)
invocation, so I have to use the wrapper if I change my DT CPUs for the
time being.

So I'm OK now for at least the two-partition, four CPUs each setup, but
for topologies as described in Marc/Will's fvp-base.dts commit, I don't
see how to run without knowing how to make the axf file work with the
model, i.e., solely with the boot-wrapper.

Thanks,

Kim
Kim Phillips July 18, 2017, 12:48 a.m. UTC | #15
On Fri, 30 Jun 2017 15:02:41 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> Hi Kim,


Hi Mark,

> On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote:

<snip>
> >  	if (evlist) {

> >  		evlist__for_each_entry(evlist, evsel) {

> >  			if (cs_etm_pmu &&

> >  			    evsel->attr.type == cs_etm_pmu->type)

> >  				found_etm = true;

> > +			if (arm_spe_pmu &&

> > +			    evsel->attr.type == arm_spe_pmu->type)

> > +				found_spe = true;

> 

> Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all

> SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1"

> and so on).

> 

> Can we not find all PMUs with a "arm_spe_" prefix?

> 

> ... or, taking a step back, do we need some sysfs "class" attribute to

> identify multi-instance PMUs?


Since there is one SPE per core, and it looks like the buffer full
interrupt line is the only difference between the SPE device node
specification in the device tree, I guess I don't understand why the
driver doesn't accept a singular "arm_spe" from the tool, and manage
interrupt handling accordingly.  Also, if a set of CPUs are missing SPE
support, and the user doesn't explicitly define a CPU affinity to
outside that partition, then decline to run, stating why.

This would also obviously help a lot from an ease-of-use perspective.

<snip>

> > +struct arm_spe_recording {

> > +	struct auxtrace_record		itr;

> > +	struct perf_pmu			*arm_spe_pmu;

> > +	struct perf_evlist		*evlist;

> > +};

> 

> A user may wich to record trace on separate uarches simultaneously, so

> having a singleton arm_spe_pmu for the entire evlist doesn't seem right.

> 

> e.g. I don't see why we should allow a user to do:

> 

> ./perf record -c 1024 \

> 	-e arm_spe_0/ts_enable=1,pa_enable=0/ \

> 	-e arm_spe_1/ts_enable=1,pa_enable=0/ \

> 	${WORKLOAD}

> 

> ... which perf-record seems to accept today, but I don't seem to get any

> aux trace, regardless of whether I taskset the entire thing to any

> particular CPU.


The implementation-defined components of the SPE don't affect a
'record'/capture operation, so a single arm_spe should be fine with
separate uarch setups.

> It also seems that the events aren't task bound, judging by -vv output:


<snip>

> I see something similar (i.e. perf doesn't try to bind the events to the

> workload PID) when I try to record with only a single PMU. In that case,

> perf-record blows up because it can't handle events on a subset of CPUs

> (though it should be able to):

> 

> nanook@torsk:~$ ./perf record -vv -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=0/ true


<snip>

> mmap size 266240B

> AUX area mmap length 131072

> perf event ring buffer mmapped per cpu

> failed to mmap AUX area

> failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0xffffc8596038, 512)=22)


FWIW, that INTERNAL ERROR is fixed by this commit, btw:

commit 8a1898db51a3390241cd5fae267dc8aaa9db0f8b
Author: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Date:   Tue Jun 20 12:26:39 2017 +0200

    perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)

So now it should return:

failed to mmap with 95 (Operation not supported)

> ... with a SW event, this works as expected, being bound to the workload PID:


<snip>

> ... so I guess this has something to do with the way the tool tries to

> use the cpumask, maknig the wrong assumption that this implies

> system-wide collection is mandatory / expected.


Right, I'll take a look at it.

> > +	if (!opts->full_auxtrace)

> > +		return 0;

> > +

> > +	/* We are in full trace mode but '-m,xyz' wasn't specified */

> > +	if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) {

> > +		if (privileged) {

> > +			opts->auxtrace_mmap_pages = MiB(4) / page_size;

> > +		} else {

> > +			opts->auxtrace_mmap_pages = KiB(128) / page_size;

> > +			if (opts->mmap_pages == UINT_MAX)

> > +				opts->mmap_pages = KiB(256) / page_size;

> > +		}

> > +	}

> > +

> > +	/* Validate auxtrace_mmap_pages */

> > +	if (opts->auxtrace_mmap_pages) {

> > +		size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size;

> > +		size_t min_sz = KiB(8);

> > +

> > +		if (sz < min_sz || !is_power_of_2(sz)) {

> > +			pr_err("Invalid mmap size for ARM SPE: must be at least %zuKiB and a power of 2\n",

> > +			       min_sz / 1024);

> > +			return -EINVAL;

> > +		}

> > +	}

> > +

> > +	/*

> > +	 * To obtain the auxtrace buffer file descriptor, the auxtrace event

> > +	 * must come first.

> > +	 */

> > +	perf_evlist__to_front(evlist, arm_spe_evsel);

> 

> Huh? *what* needs the auxtrace buffer fd?

> 

> This seems really fragile. Can't we store this elsewhere?


It's copied from the bts code, and the other auxtrace record users do
the same; it looks like auxtrace record has implicit dependencies on it?

> > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)

> > +{

> > +	u64 ts;

> > +

> > +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));

> > +

> > +	return ts;

> > +}

> 

> I do not think it's a good idea to read the counter directly like this.

> 

> What is this "reference" intended to be meaningful relative to?


AFAICT, it's just a nonce the perf tool uses to track unique events,
and I thought this better than the ETM driver's heavier get-random
implementation.

> Why do we need to do this in userspace?

> 

> Can we not ask the kernel to output timestamps instead?


Why?  This gets the job done faster.

> > +static int arm_spe_get_events(const unsigned char *buf, size_t len,

> > +			      struct arm_spe_pkt *packet)

> > +{

> > +	unsigned int events_len = payloadlen(buf[0]);

> > +

> > +	if (len < events_len)

> > +		return ARM_SPE_NEED_MORE_BYTES;

> 

> Isn't len the size of the whole buffer? So isn't this failing to account

> for the header byte?


well spotted; I changed /events_len/1 + events_len/.

> > +	packet->type = ARM_SPE_EVENTS;

> > +	packet->index = events_len;

> 

> Huh? The events packet has no "index" field, so why do we need this?


To identify Events with a less number of comparisons in arm_spe_pkt_desc():
E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are
identified iff index > 1.

> > +	switch (events_len) {

> > +	case 1: packet->payload = *(uint8_t *)(buf + 1); break;

> > +	case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;

> > +	case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;

> > +	case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;

> > +	default: return ARM_SPE_BAD_PACKET;

> > +	}

> > +

> > +	return 1 + events_len;

> > +}

> > +

> > +static int arm_spe_get_data_source(const unsigned char *buf,

> > +				   struct arm_spe_pkt *packet)

> > +{

> > +	int len = payloadlen(buf[0]);

> > +

> > +	packet->type = ARM_SPE_DATA_SOURCE;

> > +	if (len == 1)

> > +		packet->payload = buf[1];

> > +	else if (len == 2)

> > +		packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1));

> > +

> > +	return 1 + len;

> > +}

> 

> For those packets with a payload, the header has a uniform format

> describing the payload size. Given that, can't we make the payload

> extraction generic, regardless of the packet type?

> 

> e.g. something like:

> 

> static int arm_spe_get_payload(const unsigned char *buf, size_t len,

> 			       struct arm_spe_pkt *packet)

> {

> 	<determine paylaod size>

> 	<length check>

> 	<switch>

> 	<return nr consumed bytes (inc header), or error>

> }

> 

> static int arm_spe_get_events(const unsigned char *buf, size_t len,

> 			      struct arm_spe_pkt *packet)

> {

> 	packet->type = ARM_SPE_EVENTS;

> 	return arm_spe_get_payload(buf, len, packet);

> }

> 

> static int arm_spe_get_data_source(const unsigned char *buf,

> 				   struct arm_spe_pkt *packet)

> {

> 	packet->type = ARM_SPE_DATA_SOURCE;

> 	return arm_spe_get_payload(buf, len, packet);

> }

> 

> ... and so on for the other packets with a payload.


done for TIMESTAMP, EVENTS, DATA_SOURCE, CONTEXT, INSN_TYPE.  It
wouldn't fit ADDR and COUNTER well since they can occur in an
extended-header, and their lengths are encoded differently, and are
fixed anyway.

> > +static int arm_spe_do_get_packet(const unsigned char *buf, size_t len,

> > +				 struct arm_spe_pkt *packet)

> > +{

> > +	unsigned int byte;

> > +

> > +	memset(packet, 0, sizeof(struct arm_spe_pkt));

> > +

> > +	if (!len)

> > +		return ARM_SPE_NEED_MORE_BYTES;

> > +

> > +	byte = buf[0];

> > +	if (byte == 0)

> > +		return arm_spe_get_pad(packet);

> > +	else if (byte == 1) /* no timestamp at end of record */

> > +		return arm_spe_get_end(packet);

> > +	else if (byte & 0xc0 /* 0y11000000 */) {

> > +		if (byte & 0x80) {

> > +			/* 0x38 is 0y00111000 */

> > +			if ((byte & 0x38) == 0x30) /* address packet (short) */

> > +				return arm_spe_get_addr(buf, len, 0, packet);

> > +			if ((byte & 0x38) == 0x18) /* counter packet (short) */

> > +				return arm_spe_get_counter(buf, len, 0, packet);

> > +		} else

> > +			if (byte == 0x71)

> > +				return arm_spe_get_timestamp(buf, len, packet);

> > +			else if ((byte & 0xf) == 0x2)

> > +				return arm_spe_get_events(buf, len, packet);

> > +			else if ((byte & 0xf) == 0x3)

> > +				return arm_spe_get_data_source(buf, packet);

> > +			else if ((byte & 0x3c) == 0x24)

> > +				return arm_spe_get_context(buf, len, packet);

> > +			else if ((byte & 0x3c) == 0x8)

> > +				return arm_spe_get_insn_type(buf, packet);

> 

> Could we have some mnemonics for these?

> 

> e.g.

> 

> #define SPE_HEADER0_PAD		0x0

> #define SPE_HEADER0_END		0x1

> 

> #define SPE_HEADER0_EVENTS	0x42

> #define SPE_HEADER0_EVENTS_MASK	0xcf

> 

> if (byte == SPE_HEADER0_PAD) { 

> 	...

> } else if (byte == SPE_HEADER0_END) {

> 	...

> } else if ((byte & SPE_HEADER0_EVENTS_MASK) == SPE_HEADER0_EVENTS) {

> 	...

> }

> 

> ... which could even be turned into something table-driven.


It'd be a pretty sparse table, so I doubt it'd be faster, but if it is,
I'd just as soon leave that type of space tradeoff decision to the
compiler, given its optimization directives.

I'll take a look at replacing the constants that have named equivalents
with their named versions, even though it was pretty clear already what
they denoted, given the name of the function each branch was calling,
and the comments.

> > +	} else if ((byte & 0xe0) == 0x20 /* 0y00100000 */) {

> > +		/* 16-bit header */

> > +		byte = buf[1];

> > +		if (byte == 0)

> > +			return arm_spe_get_alignment(buf, len, packet);

> > +		else if ((byte & 0xf8) == 0xb0)

> > +			return arm_spe_get_addr(buf, len, 1, packet);

> > +		else if ((byte & 0xf8) == 0x98)

> > +			return arm_spe_get_counter(buf, len, 1, packet);

> > +	}

> > +

> > +	return ARM_SPE_BAD_PACKET;

> > +}

> > +

> > +int arm_spe_get_packet(const unsigned char *buf, size_t len,

> > +		       struct arm_spe_pkt *packet)

> > +{

> > +	int ret;

> > +

> > +	ret = arm_spe_do_get_packet(buf, len, packet);

> > +	if (ret > 0) {

> > +		while (ret < 1 && len > (size_t)ret && !buf[ret])

> > +			ret += 1;

> > +	}

> 

> What is this trying to do?


Nothing!  I've since fixed it to prevent multiple contiguous
PADs from coming out on their own lines, and rather accumulate up to 16
(the width of the raw dump format) on one PAD-labeled line, like so:

.  00007ec9:  00 00 00 00 00 00 00 00 00 00                   PAD

instead of this:

.  00007ec9:  00                                              PAD
.  00007eca:  00                                              PAD
.  00007ecb:  00                                              PAD
.  00007ecc:  00                                              PAD
.  00007ecd:  00                                              PAD
.  00007ece:  00                                              PAD
.  00007ecf:  00                                              PAD
.  00007ed0:  00                                              PAD
.  00007ed1:  00                                              PAD
.  00007ed2:  00                                              PAD

thanks for pointing it out.

> > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,

> > +		     size_t buf_len)

> > +{

> > +	int ret, ns, el, index = packet->index;

> > +	unsigned long long payload = packet->payload;

> > +	const char *name = arm_spe_pkt_name(packet->type);

> > +

> > +	switch (packet->type) {

> > +	case ARM_SPE_BAD:

> > +	case ARM_SPE_PAD:

> > +	case ARM_SPE_END:

> > +		return snprintf(buf, buf_len, "%s", name);

> > +	case ARM_SPE_EVENTS: {

> 

> [...]

> 

> > +	case ARM_SPE_DATA_SOURCE:

> > +	case ARM_SPE_TIMESTAMP:

> > +		return snprintf(buf, buf_len, "%s %lld", name, payload);

> > +	case ARM_SPE_ADDRESS:

> > +		switch (index) {

> > +		case 0:

> > +		case 1: ns = !!(packet->payload & NS_FLAG);

> > +			el = (packet->payload & EL_FLAG) >> 61;

> > +			payload &= ~(0xffULL << 56);

> > +			return snprintf(buf, buf_len, "%s %llx el%d ns=%d",

> > +				        (index == 1) ? "TGT" : "PC", payload, el, ns);

> 

> For TTBR1 addresses, this ends up losing the leading 0xff, giving us

> invalid addresses, which look odd.

> 

> Can we please sign-extend bit 55 so that this gives us valid addresses

> regardless of TTBR?


I'll take a look at doing this once I get consistent output from an
implementation.

> Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so

> that things get padded consistently?


I've added the 0x prefix, but prefer to not fix the length to 016: I
don't see any direct benefit, rather see benefits to having the length
vary, for output size control and less obvious reasons, e.g., sorting
address lines by their length to get a sense of address groups caught
during the run.  FWIW, Intel doesn't do the 016 either.

If I've omitted a response to the other comments, it's because they are
being addressed.

Thanks!

Kim
Mark Rutland Aug. 18, 2017, 4:59 p.m. UTC | #16
Hi Kim,

Sorry for the late reply. I see you've send an updated version. I'm
replying here first so as to answer your queries, and I intend to look
at the updated patch shortly.

On Mon, Jul 17, 2017 at 07:48:22PM -0500, Kim Phillips wrote:
> On Fri, 30 Jun 2017 15:02:41 +0100 Mark Rutland <mark.rutland@arm.com> wrote:

> > On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote:

> <snip>

> > >  	if (evlist) {

> > >  		evlist__for_each_entry(evlist, evsel) {

> > >  			if (cs_etm_pmu &&

> > >  			    evsel->attr.type == cs_etm_pmu->type)

> > >  				found_etm = true;

> > > +			if (arm_spe_pmu &&

> > > +			    evsel->attr.type == arm_spe_pmu->type)

> > > +				found_spe = true;

> > 

> > Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all

> > SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1"

> > and so on).

> > 

> > Can we not find all PMUs with a "arm_spe_" prefix?

> > 

> > ... or, taking a step back, do we need some sysfs "class" attribute to

> > identify multi-instance PMUs?

> 

> Since there is one SPE per core, and it looks like the buffer full

> interrupt line is the only difference between the SPE device node

> specification in the device tree, I guess I don't understand why the

> driver doesn't accept a singular "arm_spe" from the tool, and manage

> interrupt handling accordingly.


There are also differences which can be probed from the device, which
are not described in the DT (but are described in sysfs). Some of these
are exposed under sysfs.

There may be further differences in subsequent revisions of the
architecture, too. So the safest bet is to expose them separately, as we
do for other CPU-affine PMUs in heterogeneous systems.

> Also, if a set of CPUs are missing SPE support, and the user doesn't

> explicitly define a CPU affinity to outside that partition, then

> decline to run, stating why.


It's possible for userspace to do this regardless; look for the set of
SPE PMUs, and then look at their masks.

[...]

> > > +	/*

> > > +	 * To obtain the auxtrace buffer file descriptor, the auxtrace event

> > > +	 * must come first.

> > > +	 */

> > > +	perf_evlist__to_front(evlist, arm_spe_evsel);

> > 

> > Huh? *what* needs the auxtrace buffer fd?

> > 

> > This seems really fragile. Can't we store this elsewhere?

> 

> It's copied from the bts code, and the other auxtrace record users do

> the same; it looks like auxtrace record has implicit dependencies on it?


Is it definitely required? What happens if this isn't done?

> > > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)

> > > +{

> > > +	u64 ts;

> > > +

> > > +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));

> > > +

> > > +	return ts;

> > > +}

> > 

> > I do not think it's a good idea to read the counter directly like this.

> > 

> > What is this "reference" intended to be meaningful relative to?

> 

> AFAICT, it's just a nonce the perf tool uses to track unique events,

> and I thought this better than the ETM driver's heavier get-random

> implementation.

> 

> > Why do we need to do this in userspace?

> > 

> > Can we not ask the kernel to output timestamps instead?

> 

> Why?  This gets the job done faster.


I had assumed that this needed to be correlated with the timestamps in
the event.

If this is a nonce, please don't read the counter directly in this way.
It may be trapped/emulated by a higher EL, making it very heavyweight.
The counter is only exposed so that the VDSO can use it, and that will
avoid using it in cases where it is unsafe.

[...]

> > > +static int arm_spe_get_events(const unsigned char *buf, size_t len,

> > > +			      struct arm_spe_pkt *packet)

> > > +{

> > > +	unsigned int events_len = payloadlen(buf[0]);

> > > +

> > > +	if (len < events_len)

> > > +		return ARM_SPE_NEED_MORE_BYTES;

> > 

> > Isn't len the size of the whole buffer? So isn't this failing to account

> > for the header byte?

> 

> well spotted; I changed /events_len/1 + events_len/.

> 

> > > +	packet->type = ARM_SPE_EVENTS;

> > > +	packet->index = events_len;

> > 

> > Huh? The events packet has no "index" field, so why do we need this?

> 

> To identify Events with a less number of comparisons in arm_spe_pkt_desc():

> E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are

> identified iff index > 1.


It would be clearer to do the additional comparisons there.

Does this make a measureable difference in practice?

> > > +	switch (events_len) {

> > > +	case 1: packet->payload = *(uint8_t *)(buf + 1); break;

> > > +	case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;

> > > +	case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;

> > > +	case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;

> > > +	default: return ARM_SPE_BAD_PACKET;

> > > +	}

> > > +

> > > +	return 1 + events_len;

> > > +}

> > > +

> > > +static int arm_spe_get_data_source(const unsigned char *buf,

> > > +				   struct arm_spe_pkt *packet)

> > > +{

> > > +	int len = payloadlen(buf[0]);

> > > +

> > > +	packet->type = ARM_SPE_DATA_SOURCE;

> > > +	if (len == 1)

> > > +		packet->payload = buf[1];

> > > +	else if (len == 2)

> > > +		packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1));

> > > +

> > > +	return 1 + len;

> > > +}

> > 

> > For those packets with a payload, the header has a uniform format

> > describing the payload size. Given that, can't we make the payload

> > extraction generic, regardless of the packet type?

> > 

> > e.g. something like:

> > 

> > static int arm_spe_get_payload(const unsigned char *buf, size_t len,

> > 			       struct arm_spe_pkt *packet)

> > {

> > 	<determine paylaod size>

> > 	<length check>

> > 	<switch>

> > 	<return nr consumed bytes (inc header), or error>

> > }

> > 

> > static int arm_spe_get_events(const unsigned char *buf, size_t len,

> > 			      struct arm_spe_pkt *packet)

> > {

> > 	packet->type = ARM_SPE_EVENTS;

> > 	return arm_spe_get_payload(buf, len, packet);

> > }

> > 

> > static int arm_spe_get_data_source(const unsigned char *buf,

> > 				   struct arm_spe_pkt *packet)

> > {

> > 	packet->type = ARM_SPE_DATA_SOURCE;

> > 	return arm_spe_get_payload(buf, len, packet);

> > }

> > 

> > ... and so on for the other packets with a payload.

> 

> done for TIMESTAMP, EVENTS, DATA_SOURCE, CONTEXT, INSN_TYPE.  It

> wouldn't fit ADDR and COUNTER well since they can occur in an

> extended-header, and their lengths are encoded differently, and are

> fixed anyway.


Ok. That sounds good to me.

[...]

> > > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,

> > > +		     size_t buf_len)

> > > +{

> > > +	int ret, ns, el, index = packet->index;

> > > +	unsigned long long payload = packet->payload;

> > > +	const char *name = arm_spe_pkt_name(packet->type);

> > > +

> > > +	switch (packet->type) {


> > > +	case ARM_SPE_ADDRESS:

> > > +		switch (index) {

> > > +		case 0:

> > > +		case 1: ns = !!(packet->payload & NS_FLAG);

> > > +			el = (packet->payload & EL_FLAG) >> 61;

> > > +			payload &= ~(0xffULL << 56);

> > > +			return snprintf(buf, buf_len, "%s %llx el%d ns=%d",

> > > +				        (index == 1) ? "TGT" : "PC", payload, el, ns);


> > Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so

> > that things get padded consistently?

> 

> I've added the 0x prefix, but prefer to not fix the length to 016: I

> don't see any direct benefit, rather see benefits to having the length

> vary, for output size control and less obvious reasons, e.g., sorting

> address lines by their length to get a sense of address groups caught

> during the run.  FWIW, Intel doesn't do the 016 either.


With padding, sorting will also place address groups together, so I'm
not sure I follow.

Padding makes it *much* easier to scan over the output by eye, as
columns of event data will always share the same alignment.

> If I've omitted a response to the other comments, it's because they

> are being addressed.


Cool!

Thanks,
Mark.
Mark Rutland Aug. 18, 2017, 5:36 p.m. UTC | #17
Hi Kim,

On Thu, Aug 17, 2017 at 10:11:50PM -0500, Kim Phillips wrote:
> Hi Mark, I've tried to proceed as much as possible without your

> response, so if you still have comments to my above comments, please

> comment in-line above, otherwise review the v2 patch below?


Apologies again for the late response, and thanks for the updated patch!

[...]

> From 464d943dcac15d946863399001174e4dc4e00594 Mon Sep 17 00:00:00 2001

> From: Kim Phillips <kim.phillips@arm.com>

> Date: Wed, 8 Feb 2017 17:11:57 -0600

> Subject: [PATCH v2] perf tools: Add ARM Statistical Profiling Extensions

>  (SPE) support

> 

> 'perf record' and 'perf report --dump-raw-trace' supported in this release

> 

> Example usage:

> 

> taskset -c 2 ./perf record -C 2 -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=1/ \

> 		dd if=/dev/zero of=/dev/null count=10000

> 

> perf report --dump-raw-trace

> 

> Note that the perf.data file is portable, so the report can be run on another

> architecture host if necessary.

> 

> Output will contain raw SPE data and its textual representation, such as:

> 

> 0xc7d0 [0x30]: PERF_RECORD_AUXTRACE size: 0x82f70  offset: 0  ref: 0x1e947e88189  idx: 0  tid: -1  cpu: 2

> .

> . ... ARM SPE data: size 536432 bytes

> .  00000000:  4a 01                                           B COND

> .  00000002:  b1 00 00 00 00 00 00 00 80                      TGT 0 el0 ns=1

> .  0000000b:  42 42                                           RETIRED NOT-TAKEN

> .  0000000d:  b0 20 41 c0 ad ff ff 00 80                      PC ffffadc04120 el0 ns=1

> .  00000016:  98 00 00                                        LAT 0 TOT

> .  00000019:  71 80 3e f7 46 e9 01 00 00                      TS 2101429616256

> .  00000022:  49 01                                           ST

> .  00000024:  b2 50 bd ba 73 00 80 ff ff                      VA ffff800073babd50

> .  0000002d:  b3 50 bd ba f3 00 00 00 80                      PA f3babd50 ns=1

> .  00000036:  9a 00 00                                        LAT 0 XLAT

> .  00000039:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS

> .  0000003b:  b0 8c b4 1e 08 00 00 ff ff                      PC ff0000081eb48c el3 ns=1

> .  00000044:  98 00 00                                        LAT 0 TOT

> .  00000047:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868

> .  00000050:  48 00                                           INSN-OTHER

> .  00000052:  42 02                                           RETIRED

> .  00000054:  b0 58 54 1f 08 00 00 ff ff                      PC ff0000081f5458 el3 ns=1

> .  0000005d:  98 00 00                                        LAT 0 TOT

> .  00000060:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868


So FWIW, I think this is a good example of why that padding I requested
last time round matters.

For the first PC packet, I had to count the number of characters to see
that it was a TTBR0 address, which is made much clearer with leading
padding, as 0000ffffadc04120. With the addresses padded, the EL and NS
fields would also be aligned, making it *much* easier to scan by eye.

[...]

> - multiple SPE clusters/domains support pending potential driver changes?


As covered in my other reply, I don't believe that the driver is going
to change in this regard. Userspace will need to handle multiple SPE
instances.

I'll ignore that in the code below for now.

> - CPU mask / new record behaviour bisected to commit e3ba76deef23064 "perf

>   tools: Force uncore events to system wide monitoring".  Waiting to hear back

>   on why driver can't do system wide monitoring, even across PPIs, by e.g.,

>   sharing the SPE interrupts in one handler (SPE's don't differ in this record

>   regard).


Could you elaborate on this? I don't follow the interrupt handler
comments.

[...]

> +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)

> +{

> +	u64 ts;

> +

> +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));

> +

> +	return ts;

> +}


As covered in my other reply, please don't use the counter for this.

It sounds like we need a simple/generic function to get a nonce, that
we could share with the ETM code.

[...]

> +#define BIT(n)		(1 << (n))

> +

> +#define BIT61		((uint64_t)1 << 61)

> +#define BIT62		((uint64_t)1 << 62)

> +#define BIT63		((uint64_t)1 << 63)

> +

> +#define NS_FLAG		BIT63

> +#define EL_FLAG		(BIT62 | BIT61)


This would be far simpler as:

#define	BIT(n)	(1UL << (n))

#define NS_FLAG		BIT(63)
#define EL_FLAG		(BIT(62) | BIT(61))

[...]

> +/* return ARM SPE payload size from its encoding:

> + * 00 : byte

> + * 01 : halfword (2)

> + * 10 : word (4)

> + * 11 : doubleword (8)

> + */

> +static int payloadlen(unsigned char byte)

> +{

> +	return 1 << ((byte & 0x30) >> 4);

> +}


It might be worth stating in the comment that this is encoded in bits
5:4 of the byte, since otherwise it looks odd.

> +

> +static int arm_spe_get_payload(const unsigned char *buf, size_t len,

> +			       struct arm_spe_pkt *packet)

> +{

> +	size_t payload_len = payloadlen(buf[0]);

> +

> +	if (len < 1 + payload_len)

> +		return ARM_SPE_NEED_MORE_BYTES;


If you did `buf++` here, you could avoid the `+ 1` in all the cases below.

> +

> +	switch (payload_len) {

> +	case 1: packet->payload = *(uint8_t *)(buf + 1); break;

> +	case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;

> +	case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;

> +	case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;

> +	default: return ARM_SPE_BAD_PACKET;

> +	}

> +

> +	return 1 + payload_len;

> +}


[...]

> +int arm_spe_get_packet(const unsigned char *buf, size_t len,

> +		       struct arm_spe_pkt *packet)

> +{

> +	int ret;

> +

> +	ret = arm_spe_do_get_packet(buf, len, packet);

> +	if (ret > 0 && packet->type == ARM_SPE_PAD) {

> +		while (ret < 16 && len > (size_t)ret && !buf[ret])

> +			ret += 1;

> +	}

> +	return ret;

> +}


What's this doing? Skipping padding? What's the significance of 16?

> +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,

> +		     size_t buf_len)

> +{

> +	int ret, ns, el, index = packet->index;

> +	unsigned long long payload = packet->payload;

> +	const char *name = arm_spe_pkt_name(packet->type);

> +

> +	switch (packet->type) {

> +	case ARM_SPE_BAD:

> +	case ARM_SPE_PAD:

> +	case ARM_SPE_END:

> +		return snprintf(buf, buf_len, "%s", name);

> +	case ARM_SPE_EVENTS: {

> +		size_t blen = buf_len;

> +

> +		ret = 0;

> +		ret = snprintf(buf, buf_len, "EV");

> +		buf += ret;

> +		blen -= ret;

> +		if (payload & 0x1) {

> +			ret = snprintf(buf, buf_len, " EXCEPTION-GEN");

> +			buf += ret;

> +			blen -= ret;

> +		}

> +		if (payload & 0x2) {

> +			ret = snprintf(buf, buf_len, " RETIRED");

> +			buf += ret;

> +			blen -= ret;

> +		}

> +		if (payload & 0x4) {

> +			ret = snprintf(buf, buf_len, " L1D-ACCESS");

> +			buf += ret;

> +			blen -= ret;

> +		}

> +		if (payload & 0x8) {

> +			ret = snprintf(buf, buf_len, " L1D-REFILL");

> +			buf += ret;

> +			blen -= ret;

> +		}

> +		if (payload & 0x10) {

> +			ret = snprintf(buf, buf_len, " TLB-ACCESS");

> +			buf += ret;

> +			blen -= ret;

> +		}

> +		if (payload & 0x20) {

> +			ret = snprintf(buf, buf_len, " TLB-REFILL");

> +			buf += ret;

> +			blen -= ret;

> +		}

> +		if (payload & 0x40) {

> +			ret = snprintf(buf, buf_len, " NOT-TAKEN");

> +			buf += ret;

> +			blen -= ret;

> +		}

> +		if (payload & 0x80) {

> +			ret = snprintf(buf, buf_len, " MISPRED");

> +			buf += ret;

> +			blen -= ret;

> +		}

> +		if (index > 1) {

> +			if (payload & 0x100) {

> +				ret = snprintf(buf, buf_len, " LLC-ACCESS");

> +				buf += ret;

> +				blen -= ret;

> +			}

> +			if (payload & 0x200) {

> +				ret = snprintf(buf, buf_len, " LLC-REFILL");

> +				buf += ret;

> +				blen -= ret;

> +			}

> +			if (payload & 0x400) {

> +				ret = snprintf(buf, buf_len, " REMOTE-ACCESS");

> +				buf += ret;

> +				blen -= ret;

> +			}

> +		}

> +		if (ret < 0)

> +			return ret;

> +		blen -= ret;

> +		return buf_len - blen;

> +	}


This looks like it could be turned into another switch, sharing the
repeated logic.

Thanks,
Mark.
Kim Phillips Aug. 18, 2017, 10:22 p.m. UTC | #18
On Fri, 18 Aug 2017 17:59:25 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> On Mon, Jul 17, 2017 at 07:48:22PM -0500, Kim Phillips wrote:

> > On Fri, 30 Jun 2017 15:02:41 +0100 Mark Rutland <mark.rutland@arm.com> wrote:

> > > On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote:

> > <snip>

> > > >  	if (evlist) {

> > > >  		evlist__for_each_entry(evlist, evsel) {

> > > >  			if (cs_etm_pmu &&

> > > >  			    evsel->attr.type == cs_etm_pmu->type)

> > > >  				found_etm = true;

> > > > +			if (arm_spe_pmu &&

> > > > +			    evsel->attr.type == arm_spe_pmu->type)

> > > > +				found_spe = true;

> > > 

> > > Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all

> > > SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1"

> > > and so on).

> > > 

> > > Can we not find all PMUs with a "arm_spe_" prefix?

> > > 

> > > ... or, taking a step back, do we need some sysfs "class" attribute to

> > > identify multi-instance PMUs?

> > 

> > Since there is one SPE per core, and it looks like the buffer full

> > interrupt line is the only difference between the SPE device node

> > specification in the device tree, I guess I don't understand why the

> > driver doesn't accept a singular "arm_spe" from the tool, and manage

> > interrupt handling accordingly.

> 

> There are also differences which can be probed from the device, which


The only thing I see is PMSIDR fields describing things like minimum
recommended sampling interval.  So if CPU A's SPE has that as 256, and
CPU B's is 512, then deny the user asking for a 256 interval across the
two CPUs.  Or, better yet, issue a warning stating the driver has raised
the interval to the lowest common denominator of all CPU SPEs involved
(512 in the above case).

> are not described in the DT (but are described in sysfs). Some of these

> are exposed under sysfs.

> 

> There may be further differences in subsequent revisions of the

> architecture, too.


Future SPE lowest common denominator rules can be amended
according to the capabilities of the new system.

> So the safest bet is to expose them separately, as we

> do for other CPU-affine PMUs in heterogeneous systems.


Yes, perf is very hard to use on heterogeneous systems for this reason.
Cycles are cycles, it doesn't matter whether they're on an A53 or an
A72.

Meanwhile, this type of driver behaviour - and the fact that the drivers
are mute - hurts usability in heterogeneous environments, and can
easily be avoided.

> > Also, if a set of CPUs are missing SPE support, and the user doesn't

> > explicitly define a CPU affinity to outside that partition, then

> > decline to run, stating why.

> 

> It's possible for userspace to do this regardless; look for the set of

> SPE PMUs, and then look at their masks.


The driver still has to check if what the user is asking for, is
doable.  They also may not be using the perf tool.

> > > > +	/*

> > > > +	 * To obtain the auxtrace buffer file descriptor, the auxtrace event

> > > > +	 * must come first.

> > > > +	 */

> > > > +	perf_evlist__to_front(evlist, arm_spe_evsel);

> > > 

> > > Huh? *what* needs the auxtrace buffer fd?

> > > 

> > > This seems really fragile. Can't we store this elsewhere?

> > 

> > It's copied from the bts code, and the other auxtrace record users do

> > the same; it looks like auxtrace record has implicit dependencies on it?

> 

> Is it definitely required? What happens if this isn't done?


It says it wouldn't obtain the auxtrace buffer file descriptor.

> > > > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)

> > > > +{

> > > > +	u64 ts;

> > > > +

> > > > +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));

> > > > +

> > > > +	return ts;

> > > > +}

> > > 

> > > I do not think it's a good idea to read the counter directly like this.

> > > 

> > > What is this "reference" intended to be meaningful relative to?

> > 

> > AFAICT, it's just a nonce the perf tool uses to track unique events,

> > and I thought this better than the ETM driver's heavier get-random

> > implementation.

> > 

> > > Why do we need to do this in userspace?

> > > 

> > > Can we not ask the kernel to output timestamps instead?

> > 

> > Why?  This gets the job done faster.

> 

> I had assumed that this needed to be correlated with the timestamps in

> the event.

> 

> If this is a nonce, please don't read the counter directly in this way.

> It may be trapped/emulated by a higher EL, making it very heavyweight.

> The counter is only exposed so that the VDSO can use it, and that will

> avoid using it in cases where it is unsafe.


Got it, thanks.

> > > > +	packet->type = ARM_SPE_EVENTS;

> > > > +	packet->index = events_len;

> > > 

> > > Huh? The events packet has no "index" field, so why do we need this?

> > 

> > To identify Events with a less number of comparisons in arm_spe_pkt_desc():

> > E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are

> > identified iff index > 1.

> 

> It would be clearer to do the additional comparisons there.

> 

> Does this make a measureable difference in practice?


It should - I'll add a comment.

> > > > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,

> > > > +		     size_t buf_len)

> > > > +{

> > > > +	int ret, ns, el, index = packet->index;

> > > > +	unsigned long long payload = packet->payload;

> > > > +	const char *name = arm_spe_pkt_name(packet->type);

> > > > +

> > > > +	switch (packet->type) {

> 

> > > > +	case ARM_SPE_ADDRESS:

> > > > +		switch (index) {

> > > > +		case 0:

> > > > +		case 1: ns = !!(packet->payload & NS_FLAG);

> > > > +			el = (packet->payload & EL_FLAG) >> 61;

> > > > +			payload &= ~(0xffULL << 56);

> > > > +			return snprintf(buf, buf_len, "%s %llx el%d ns=%d",

> > > > +				        (index == 1) ? "TGT" : "PC", payload, el, ns);

> 

> > > Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so

> > > that things get padded consistently?

> > 

> > I've added the 0x prefix, but prefer to not fix the length to 016: I

> > don't see any direct benefit, rather see benefits to having the length

> > vary, for output size control and less obvious reasons, e.g., sorting

> > address lines by their length to get a sense of address groups caught

> > during the run.  FWIW, Intel doesn't do the 016 either.

> 

> With padding, sorting will also place address groups together, so I'm

> not sure I follow.


sorting by *line length* can be done to easily assess the address
groups in a dump:

$ grep -w PC dump | awk '{ print length, $0 }' | sort -nu
77 .  00000080:  b0 00 00 00 00 00 00 00 a0                      PC 0x0 el1 ns=1
82 .  00000000:  b0 94 61 43 00 00 00 00 80                      PC 0x436194 el0 ns=1
88 .  00000000:  b0 50 20 ac a7 ff ff 00 80                      PC 0xffffa7ac2050 el0 ns=1
89 .  00000040:  b0 80 2d 08 08 00 00 01 a0                      PC 0x1000008082d80 el1 ns=1

> Padding makes it *much* easier to scan over the output by eye, as

> columns of event data will always share the same alignment.


Addresses are already technically misaligned by virtue of their being
prepended with "PC" (2 chars) vs. "TGT" (3 chars):

82 .  00000000:  b0 94 61 43 00 00 00 00 80                      PC 0x436194 el0 ns=1
83 .  0000001e:  b1 68 61 43 00 00 00 00 80                      TGT 0x436168 el0 ns=1

89 .  00000040:  b0 80 2d 08 08 00 00 01 a0                      PC 0x1000008082d80 el1 ns=1
91 .  000005de:  b1 ec 9a 92 08 00 00 ff a0                      TGT 0xff000008929aec el1 ns=1

If you're talking about the postpended "elX ns=Y", well, that less
significant given the variable length is more quickly detected by the
eye - giving the astute reader hints of which execution level the
address is in - and can be parsed using variable length delimeters.

OTOH, we can rename the tokens, e.g., 

current PC  -> {NS,SE}EL{0,1,2,3}PC 0xAAAA
current TGT -> {NS,SE}EL{0,1,2,3}BT 0xAAAA

Where "BT" -> "Branch Target", which admittedly is less obvious to the
uninitiated.

So the last sample above would be:

89 .  00000040:  b0 80 2d 08 08 00 00 01 a0                      NSEL1PC 0x1000008082d80
91 .  000005de:  b1 ec 9a 92 08 00 00 ff a0                      NSEL1BT 0xff000008929aec

Is that better though?

Are there others opinionated here?

I'll get to the v2 review comments later.

Thanks for your feedback!

Kim
Kim Phillips Aug. 21, 2017, 11:18 p.m. UTC | #19
On Fri, 18 Aug 2017 18:36:09 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> Hi Kim,


Hi Mark,

> On Thu, Aug 17, 2017 at 10:11:50PM -0500, Kim Phillips wrote:

> > Hi Mark, I've tried to proceed as much as possible without your

> > response, so if you still have comments to my above comments, please

> > comment in-line above, otherwise review the v2 patch below?

> 

> Apologies again for the late response, and thanks for the updated patch!


Thanks for your prompt response this time around.

> > .

> > . ... ARM SPE data: size 536432 bytes

> > .  00000000:  4a 01                                           B COND

> > .  00000002:  b1 00 00 00 00 00 00 00 80                      TGT 0 el0 ns=1

> > .  0000000b:  42 42                                           RETIRED NOT-TAKEN

> > .  0000000d:  b0 20 41 c0 ad ff ff 00 80                      PC ffffadc04120 el0 ns=1

> > .  00000016:  98 00 00                                        LAT 0 TOT

> > .  00000019:  71 80 3e f7 46 e9 01 00 00                      TS 2101429616256

> > .  00000022:  49 01                                           ST

> > .  00000024:  b2 50 bd ba 73 00 80 ff ff                      VA ffff800073babd50

> > .  0000002d:  b3 50 bd ba f3 00 00 00 80                      PA f3babd50 ns=1

> > .  00000036:  9a 00 00                                        LAT 0 XLAT

> > .  00000039:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS

> > .  0000003b:  b0 8c b4 1e 08 00 00 ff ff                      PC ff0000081eb48c el3 ns=1

> > .  00000044:  98 00 00                                        LAT 0 TOT

> > .  00000047:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868

> > .  00000050:  48 00                                           INSN-OTHER

> > .  00000052:  42 02                                           RETIRED

> > .  00000054:  b0 58 54 1f 08 00 00 ff ff                      PC ff0000081f5458 el3 ns=1

> > .  0000005d:  98 00 00                                        LAT 0 TOT

> > .  00000060:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868

> 

> So FWIW, I think this is a good example of why that padding I requested

> last time round matters.

> 

> For the first PC packet, I had to count the number of characters to see

> that it was a TTBR0 address, which is made much clearer with leading

> padding, as 0000ffffadc04120. With the addresses padded, the EL and NS

> fields would also be aligned, making it *much* easier to scan by eye.


See my response in my prior email.

> > - multiple SPE clusters/domains support pending potential driver changes?

> 

> As covered in my other reply, I don't believe that the driver is going

> to change in this regard. Userspace will need to handle multiple SPE

> instances.

> 

> I'll ignore that in the code below for now.


Please let's continue the discussion in one place, and again in this
case, in the last email.

> > - CPU mask / new record behaviour bisected to commit e3ba76deef23064 "perf

> >   tools: Force uncore events to system wide monitoring".  Waiting to hear back

> >   on why driver can't do system wide monitoring, even across PPIs, by e.g.,

> >   sharing the SPE interrupts in one handler (SPE's don't differ in this record

> >   regard).

> 

> Could you elaborate on this? I don't follow the interrupt handler

> comments.


Would it be possible for the driver to request the IRQs with
IRQF_SHARED, in order to be able to operate across the multiple PPIs?

> > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)

> > +{

> > +	u64 ts;

> > +

> > +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));

> > +

> > +	return ts;

> > +}

> 

> As covered in my other reply, please don't use the counter for this.

> 

> It sounds like we need a simple/generic function to get a nonce, that

> we could share with the ETM code.


I've switched to using clock_gettime(CLOCK_MONOTONIC_RAW, ...).  The
ETM code uses two rand() calls, which, according to some minor
benchmarking on Juno, is almost twice as slow as clock_gettime. It's
three lines still, so I'll update the ETM code in-place independently
of this patch, and after the gettime implementation is reviewed.

> > +int arm_spe_get_packet(const unsigned char *buf, size_t len,

> > +		       struct arm_spe_pkt *packet)

> > +{

> > +	int ret;

> > +

> > +	ret = arm_spe_do_get_packet(buf, len, packet);

> > +	if (ret > 0 && packet->type == ARM_SPE_PAD) {

> > +		while (ret < 16 && len > (size_t)ret && !buf[ret])

> > +			ret += 1;

> > +	}

> > +	return ret;

> > +}

> 

> What's this doing? Skipping padding? What's the significance of 16?


I'll repeat the relevant part of the v2 changelog here:

- do_get_packet fixed to handle excessive, successive PADding from a new source
  of raw SPE data, so instead of:

	.  000011ae:  00                                              PAD
	.  000011af:  00                                              PAD
	.  000011b0:  00                                              PAD
	.  000011b1:  00                                              PAD
	.  000011b2:  00                                              PAD
	.  000011b3:  00                                              PAD
	.  000011b4:  00                                              PAD
	.  000011b5:  00                                              PAD
	.  000011b6:  00                                              PAD

  we now get:

	.  000011ae:  00 00 00 00 00 00 00 00 00                      PAD

...the 16 is the width of the dump format: max. 16 byte being displayed
per line: I'll add a comment as such.

> > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,

> > +		     size_t buf_len)

> > +{

> > +	int ret, ns, el, index = packet->index;

> > +	unsigned long long payload = packet->payload;

> > +	const char *name = arm_spe_pkt_name(packet->type);

> > +

> > +	switch (packet->type) {

> > +	case ARM_SPE_BAD:

> > +	case ARM_SPE_PAD:

> > +	case ARM_SPE_END:

> > +		return snprintf(buf, buf_len, "%s", name);

> > +	case ARM_SPE_EVENTS: {

> > +		size_t blen = buf_len;

> > +

> > +		ret = 0;

> > +		ret = snprintf(buf, buf_len, "EV");

> > +		buf += ret;

> > +		blen -= ret;

> > +		if (payload & 0x1) {

> > +			ret = snprintf(buf, buf_len, " EXCEPTION-GEN");

> > +			buf += ret;

> > +			blen -= ret;

> > +		}

> > +		if (payload & 0x2) {

> > +			ret = snprintf(buf, buf_len, " RETIRED");

> > +			buf += ret;

> > +			blen -= ret;

> > +		}

> > +		if (payload & 0x4) {

> > +			ret = snprintf(buf, buf_len, " L1D-ACCESS");

> > +			buf += ret;

> > +			blen -= ret;

> > +		}

> > +		if (payload & 0x8) {

> > +			ret = snprintf(buf, buf_len, " L1D-REFILL");

> > +			buf += ret;

> > +			blen -= ret;

> > +		}

> > +		if (payload & 0x10) {

> > +			ret = snprintf(buf, buf_len, " TLB-ACCESS");

> > +			buf += ret;

> > +			blen -= ret;

> > +		}

> > +		if (payload & 0x20) {

> > +			ret = snprintf(buf, buf_len, " TLB-REFILL");

> > +			buf += ret;

> > +			blen -= ret;

> > +		}

> > +		if (payload & 0x40) {

> > +			ret = snprintf(buf, buf_len, " NOT-TAKEN");

> > +			buf += ret;

> > +			blen -= ret;

> > +		}

> > +		if (payload & 0x80) {

> > +			ret = snprintf(buf, buf_len, " MISPRED");

> > +			buf += ret;

> > +			blen -= ret;

> > +		}

> > +		if (index > 1) {

> > +			if (payload & 0x100) {

> > +				ret = snprintf(buf, buf_len, " LLC-ACCESS");

> > +				buf += ret;

> > +				blen -= ret;

> > +			}

> > +			if (payload & 0x200) {

> > +				ret = snprintf(buf, buf_len, " LLC-REFILL");

> > +				buf += ret;

> > +				blen -= ret;

> > +			}

> > +			if (payload & 0x400) {

> > +				ret = snprintf(buf, buf_len, " REMOTE-ACCESS");

> > +				buf += ret;

> > +				blen -= ret;

> > +			}

> > +		}

> > +		if (ret < 0)

> > +			return ret;

> > +		blen -= ret;

> > +		return buf_len - blen;

> > +	}

> 

> This looks like it could be turned into another switch, sharing the

> repeated logic.


How, if the payload may have multiple bits set?

I've addressed the rest of your comments and therefore trimmed them
out.  I can post a v3, but would rather shake out the pending issues
first, so please reply to my comments on this and Friday's email (Date:
Fri, 18 Aug 2017 17:22:48 -0500).

Thanks,

Kim