[v4,3/3] perf: Sample additional clock value

Message ID 1415292718-19785-4-git-send-email-pawel.moll@arm.com
State New
Headers show

Commit Message

Pawel Moll Nov. 6, 2014, 4:51 p.m.
This patch adds an option to sample value of an additional clock with
any perf event, with the the aim of allowing time correlation between
data coming from perf and other sources like hardware trace which is
timestamped with an external clock.

The idea is to generate periodic perf record containing timestamps from
two different sources, sampled as close to each other as possible. This
allows performing simple linear approximation:

                      other event
       -----O--------------+-------------O------> t_other
            :              |             :
            :              V             :
       -----O----------------------------O------> t_perf
        perf event

User can request such samples for any standard perf event, with the most
obvious examples of cpu-clock (hrtimer) at selected frequency or other
periodic events like sched:sched_switch.

In order to do this, PERF_SAMPLE_CLOCK has to be set in struct
perf_event_attr.sample_type and a type of the clock to be sampled
selected by setting perf_event_attr.clock to a value corresponding to
a POSIX clock clk_id (see "man 2 clock_gettime")

Currently three clocks are implemented: CLOCK_REALITME = 0,
CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
5 bits wide to allow for future extension to custom, non-POSIX clock
sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
ARM CoreSight (hardware trace) timestamp generator.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---

Changes since v3:

- none

 include/linux/perf_event.h      |  2 ++
 include/uapi/linux/perf_event.h | 16 ++++++++++++++--
 kernel/events/core.c            | 30 ++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 2 deletions(-)

Comments

John Stultz Jan. 5, 2015, 7:17 p.m. | #1
On Mon, Jan 5, 2015 at 5:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Nov 06, 2014 at 04:51:58PM +0000, Pawel Moll wrote:
>> Currently three clocks are implemented: CLOCK_REALITME = 0,
>> CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
>> 5 bits wide to allow for future extension to custom, non-POSIX clock
>> sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
>> ARM CoreSight (hardware trace) timestamp generator.
>
>> @@ -304,7 +305,16 @@ struct perf_event_attr {
>>                               mmap2          :  1, /* include mmap with inode data     */
>>                               comm_exec      :  1, /* flag comm events that are due to an exec */
>>                               uevents        :  1, /* allow uevents into the buffer */
>> -                             __reserved_1   : 38;
>> +
>> +                             /*
>> +                              * clock: one of the POSIX clock IDs:
>> +                              *
>> +                              * 0 - CLOCK_REALTIME
>> +                              * 1 - CLOCK_MONOTONIC
>> +                              * 4 - CLOCK_MONOTONIC_RAW
>> +                              */
>> +                             clock          :  5, /* clock type */
>> +                             __reserved_1   : 33;
>>
>>       union {
>>               __u32           wakeup_events;    /* wakeup every n events */
>
> This would put a constraint on actually changing MAX_CLOCKS, are the
> time people OK with that? Thomas, John?

Yea, I'd not want to enshrine MAX_CLOCKS if we can avoid it. We're
getting a bit close to it these days.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
John Stultz Jan. 21, 2015, 5:44 p.m. | #2
On Wed, Jan 21, 2015 at 9:12 AM, Pawel Moll <pawel.moll@arm.com> wrote:
> On Mon, 2015-01-05 at 13:45 +0000, Peter Zijlstra wrote:
>> On Thu, Nov 06, 2014 at 04:51:58PM +0000, Pawel Moll wrote:
>> > Currently three clocks are implemented: CLOCK_REALITME = 0,
>> > CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
>> > 5 bits wide to allow for future extension to custom, non-POSIX clock
>> > sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
>> > ARM CoreSight (hardware trace) timestamp generator.
>>
>> > @@ -304,7 +305,16 @@ struct perf_event_attr {
>> >                             mmap2          :  1, /* include mmap with inode data     */
>> >                             comm_exec      :  1, /* flag comm events that are due to an exec */
>> >                             uevents        :  1, /* allow uevents into the buffer */
>> > -                           __reserved_1   : 38;
>> > +
>> > +                           /*
>> > +                            * clock: one of the POSIX clock IDs:
>> > +                            *
>> > +                            * 0 - CLOCK_REALTIME
>> > +                            * 1 - CLOCK_MONOTONIC
>> > +                            * 4 - CLOCK_MONOTONIC_RAW
>> > +                            */
>> > +                           clock          :  5, /* clock type */
>> > +                           __reserved_1   : 33;
>> >
>> >     union {
>> >             __u32           wakeup_events;    /* wakeup every n events */
>>
>> This would put a constraint on actually changing MAX_CLOCKS, are the
>> time people OK with that? Thomas, John?
>
> I admit I have some doubts about it myself. But I don't think we can
> afford full int for the clock id, can we?
>
>> I'm also not quite sure of using the >MAX_CLOCKS space for 'special'
>> clocks, preferably those would register themselves with the POSIX clock
>> interface.
>
> That may be a hard sell - John never was particularly keen on extending
> the hard-coded clocks with random sources. And the hardware trace clock
> I had on mind would be probably one of them - its meaning depends a lot
> on the . Of course I'm looking forward to being surprised :-)

Yea. I'm definitely still wanting to be cautious about adding new
clockids. Basically if there's a new well defined time domain, then
I'm open to it,  (for example, I'm expecting there will be a smeared
leapsecond time domain at some point in the future), but we've already
grown more then I'm comfortable with given the existing MAX_CLOCKS
limit.

For example, I regret adding the _ALARM clockids. Those are basically
duplicative time domains from the readers perspective, and only have
unique value for setting timers, which should have been handled via a
flag to the timer interfaces.

I suspect we'll have to bump MAX_CLOCKS that at some point and hope it
doesn't break anyone.

That said, there is the dynamic posix clockids. I'm not sure if it
would make sense, but even if we don't bump MAX_CLOCKS, might there
be some case where someone wants to use a dynamic posix clock for the
perf reference?

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
John Stultz Jan. 21, 2015, 6:05 p.m. | #3
On Wed, Jan 21, 2015 at 9:54 AM, Pawel Moll <pawel.moll@arm.com> wrote:
> On Wed, 2015-01-21 at 17:44 +0000, John Stultz wrote:
>> That said, there is the dynamic posix clockids. I'm not sure if it
>> would make sense, but even if we don't bump MAX_CLOCKS, might there
>> be some case where someone wants to use a dynamic posix clock for the
>> perf reference?
>
> If I remember correctly, last time I tried to use dynamic posix clocks
> in the perf context, one needed to open a ptp character device in order
> to get a file descriptor, than translated into a clockid_t value -
> correct me if I'm wrong. But here you get the fd from the
> sys_perf_open() and clock_*() doesn't know anything about such
> descriptor.

Sorry for losing context here then. Yes, the dynamic clockid has to be
exported via some other interface to userland (likely via the driver
that provides it), but once the id is known it can be used via the
clock_*() functions.   I was thinking here since the perf_event_attr
wants to associate with a clockid, including the possibility for
dynamic clockids would be wise, but I didn't read closely enough to
see how that clockid was specified.

> I was looking into a way of associating a random clock with a random fd,
> so that perf could "attach" itself to the clock API at will, but it
> turned out not to be trivial (I'd have to dig through old threads to
> remember all the nasty details).
>
> The good thing is that it looks like the immediate need for this was no
> more, with perf using monotonic clock as the clock source. It will come
> back when we get into hardware trace correlation, but one step at a
> time...

Ok. I'm eager to see this settled (and the current approach I don't
have any objections to, although I'm not paying super close attention
now that its all in the perf core).   I know this has taken far longer
then you'd have liked, so thanks for your persistence!

-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch hide | download patch | download mbox

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 680767d..b690a0d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -607,6 +607,8 @@  struct perf_sample_data {
 	 * Transaction flags for abort events:
 	 */
 	u64				txn;
+	/* Clock value (additional timestamp for time correlation) */
+	u64				clock;
 };
 
 /* default value for data source */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9a64eb1..53a7a72 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -137,8 +137,9 @@  enum perf_event_sample_format {
 	PERF_SAMPLE_DATA_SRC			= 1U << 15,
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
+	PERF_SAMPLE_CLOCK			= 1U << 18,
 
-	PERF_SAMPLE_MAX = 1U << 18,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
 };
 
 /*
@@ -304,7 +305,16 @@  struct perf_event_attr {
 				mmap2          :  1, /* include mmap with inode data     */
 				comm_exec      :  1, /* flag comm events that are due to an exec */
 				uevents        :  1, /* allow uevents into the buffer */
-				__reserved_1   : 38;
+
+				/*
+				 * clock: one of the POSIX clock IDs:
+				 *
+				 * 0 - CLOCK_REALTIME
+				 * 1 - CLOCK_MONOTONIC
+				 * 4 - CLOCK_MONOTONIC_RAW
+				 */
+				clock          :  5, /* clock type */
+				__reserved_1   : 33;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -544,6 +554,7 @@  enum perf_event_type {
 	 * 	{ u64			id;       } && PERF_SAMPLE_ID
 	 * 	{ u64			stream_id;} && PERF_SAMPLE_STREAM_ID
 	 * 	{ u32			cpu, res; } && PERF_SAMPLE_CPU
+	 *	{ u64			clock;    } && PERF_SAMPLE_CLOCK
 	 *	{ u64			id;	  } && PERF_SAMPLE_IDENTIFIER
 	 * } && perf_event_attr::sample_id_all
 	 *
@@ -687,6 +698,7 @@  enum perf_event_type {
 	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
+	 *	{ u64			clock;    } && PERF_SAMPLE_CLOCK
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9a2d082..816baa6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1264,6 +1264,9 @@  static void perf_event__header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		size += sizeof(data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		size += sizeof(data->clock);
+
 	event->header_size = size;
 }
 
@@ -1291,6 +1294,9 @@  static void perf_event__id_header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_CPU)
 		size += sizeof(data->cpu_entry);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		size += sizeof(data->clock);
+
 	event->id_header_size = size;
 }
 
@@ -4631,6 +4637,24 @@  static void __perf_event_header__init_id(struct perf_event_header *header,
 		data->cpu_entry.cpu	 = raw_smp_processor_id();
 		data->cpu_entry.reserved = 0;
 	}
+
+	if (sample_type & PERF_SAMPLE_CLOCK) {
+		switch (event->attr.clock) {
+		case CLOCK_REALTIME:
+			data->clock = ktime_get_real_ns();
+			break;
+		case CLOCK_MONOTONIC:
+			data->clock = ktime_get_mono_fast_ns();
+			break;
+		case CLOCK_MONOTONIC_RAW:
+			data->clock = ktime_get_raw_ns();
+			break;
+		default:
+			data->clock = 0;
+			break;
+		}
+	}
+
 }
 
 void perf_event_header__init_id(struct perf_event_header *header,
@@ -4661,6 +4685,9 @@  static void __perf_event__output_id_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_CPU)
 		perf_output_put(handle, data->cpu_entry);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		perf_output_put(handle, data->clock);
+
 	if (sample_type & PERF_SAMPLE_IDENTIFIER)
 		perf_output_put(handle, data->id);
 }
@@ -4889,6 +4916,9 @@  void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		perf_output_put(handle, data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		perf_output_put(handle, data->clock);
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;