diff mbox

[27/53] perf/core: Put size of a sample at the end of it by PERF_SAMPLE_TAILSIZE

Message ID 1452520124-2073-28-git-send-email-wangnan0@huawei.com
State New
Headers show

Commit Message

Wang Nan Jan. 11, 2016, 1:48 p.m. UTC
This patch introduces a PERF_SAMPLE_TAILSIZE flag which allows a size
field attached at the end of a sample. The idea comes from [1] that,
with tie size at tail of an event, it is possible for user program who
read from the ring buffer parse events backward.

For example:

   head
    |
    V
 +--+---+-------+----------+------+---+
 |E6|...|   B  8|   C    11|  D  7|E..|
 +--+---+-------+----------+------+---+

In this case, from the 'head' pointer provided by kernel, user program
can first see '6' by (*(head - sizeof(u64))), then it can get the start
pointer of record 'E', then it can read size and find start position
of record D, C, B in similar way.

The implementation is easy: adding a PERF_SAMPLE_TAILSIZE flag, makes
perf_output_sample() output size at the end of a sample.

Following things are done for ensure the ring buffer is safe for
backward parsing:

 - Don't allow two events with different PERF_SAMPLE_TAILSIZE setting
   set their output to each other;

 - For non-sample events, also output tailsize if required.

This patch has a limitation for perf:

Before reading such ring buffer, perf must ensure all events which may
output to it is already stopped, so the 'head' pointer it get is the
end of the last record.

[1] http://lkml.kernel.org/g/1449063499-236703-1-git-send-email-wangnan0@huawei.com

Signed-off-by: Wang Nan <wangnan0@huawei.com>

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yunlong Song <yunlong.song@huawei.com>
---
 include/linux/perf_event.h      | 17 ++++++---
 include/uapi/linux/perf_event.h |  3 +-
 kernel/events/core.c            | 82 +++++++++++++++++++++++++++++------------
 kernel/events/ring_buffer.c     |  7 ++--
 4 files changed, 75 insertions(+), 34 deletions(-)

-- 
1.8.3.4

Comments

Wang Nan Jan. 12, 2016, 5:33 a.m. UTC | #1
On 2016/1/12 2:09, Alexei Starovoitov wrote:
> On Mon, Jan 11, 2016 at 01:48:18PM +0000, Wang Nan wrote:

>> This patch introduces a PERF_SAMPLE_TAILSIZE flag which allows a size

>> field attached at the end of a sample. The idea comes from [1] that,

>> with tie size at tail of an event, it is possible for user program who

>> read from the ring buffer parse events backward.

>>

>> For example:

>>

>>     head

>>      |

>>      V

>>   +--+---+-------+----------+------+---+

>>   |E6|...|   B  8|   C    11|  D  7|E..|

>>   +--+---+-------+----------+------+---+

>>

>> In this case, from the 'head' pointer provided by kernel, user program

>> can first see '6' by (*(head - sizeof(u64))), then it can get the start

>> pointer of record 'E', then it can read size and find start position

>> of record D, C, B in similar way.

> adding extra 8 bytes for every sample is quite unfortunate.

> How about another idea:

> . update data_tail pointer when head is about to overwrite it

>

> Ex:

>     head   data_tail

>      |       |

>      V       V

>   +--+-------+-------+---+----+---+

>   |E |  ...  |   B   | C |  D | E |

>   +--+-------+-------+---+----+---+

>

> if new sample F is about to overwrite B, the kernel would need

> to read the size of B from B's header and update data_tail to point C.

> Or even further.

> Comparing to TAILSIZE approach, now kernel will be doing both reads

> and writes into ring-buffer and there is a concern that reads may

> be hitting cold data, but if the records are small they may be

> actually on the same cache line brought by the previous

> read A's header, write E record cycle. So I think we shouldn't see

> cache misses.


After ring buffer rewind, we need a read before nearly
every write operations. The performance penalty depends on
configuration of write allocate. In addition, another data
dependency is required: we must wait for the size of
event B is retrived before overwrite it.

Even in the very first try at 2013 in [1], reading from the ring
buffer is avoided. I don't think Peter changes his mind now.

> Another concern is validity of records stored. If user space messes

> with ring-buffer, kernel won't be able to move data_tail properly

> and would need to indicate that to userspace somehow.

> But memory saving of 8 bytes per record could be sizable


Yes. But I have already discussed with Peter on this in [2].
Last month I suggested:

<quote>

  1. If PERF_SAMPLE_SIZE is selected, we can avoid outputting the event
     size in header. Which eliminate extra space cost;
</quote>

However:

<quote>

That would mandate you always parse the stream backwards. Which seems
rather unfortunate. Also, no you cannot recoup the extra space, see the
alignment and size requirement.

</quote>

>   and

> user space wouldn't need to walk the whole buffer backwards and

> can just start from valid data_tail, so the dumps of overwrite

> ring-buffer will be faster too.

> Thoughts?

>

Please also refer to [3]. In that patch we introduced a userspace
ring buffer in perf and make it continously collect data from
normal ring buffers. Since we have to wake up perf to read data,
the cost is even higher.

[1] 
http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net
[2] 
http://lkml.kernel.org/r/20151203100801.GV3816@twins.programming.kicks-ass.net
[3] 
http://lkml.kernel.org/r/1448373632-8806-1-git-send-email-yunlong.song@huawei.com
Wang Nan Jan. 12, 2016, 12:36 p.m. UTC | #2
On 2016/1/12 14:11, Alexei Starovoitov wrote:
> On Tue, Jan 12, 2016 at 01:33:28PM +0800, Wangnan (F) wrote:

>>

>> On 2016/1/12 2:09, Alexei Starovoitov wrote:

>>> On Mon, Jan 11, 2016 at 01:48:18PM +0000, Wang Nan wrote:

>>>> This patch introduces a PERF_SAMPLE_TAILSIZE flag which allows a size

>>>> field attached at the end of a sample. The idea comes from [1] that,

>>>> with tie size at tail of an event, it is possible for user program who

>>>> read from the ring buffer parse events backward.

>>>>

>>>> For example:

>>>>

>>>>     head

>>>>      |

>>>>      V

>>>>   +--+---+-------+----------+------+---+

>>>>   |E6|...|   B  8|   C    11|  D  7|E..|

>>>>   +--+---+-------+----------+------+---+

>>>>

>>>> In this case, from the 'head' pointer provided by kernel, user program

>>>> can first see '6' by (*(head - sizeof(u64))), then it can get the start

>>>> pointer of record 'E', then it can read size and find start position

>>>> of record D, C, B in similar way.

>>> adding extra 8 bytes for every sample is quite unfortunate.

>>> How about another idea:

>>> . update data_tail pointer when head is about to overwrite it

>>>

>>> Ex:

>>>     head   data_tail

>>>      |       |

>>>      V       V

>>>   +--+-------+-------+---+----+---+

>>>   |E |  ...  |   B   | C |  D | E |

>>>   +--+-------+-------+---+----+---+

>>>

>>> if new sample F is about to overwrite B, the kernel would need

>>> to read the size of B from B's header and update data_tail to point C.

>>> Or even further.

>>> Comparing to TAILSIZE approach, now kernel will be doing both reads

>>> and writes into ring-buffer and there is a concern that reads may

>>> be hitting cold data, but if the records are small they may be

>>> actually on the same cache line brought by the previous

>>> read A's header, write E record cycle. So I think we shouldn't see

>>> cache misses.

>> After ring buffer rewind, we need a read before nearly

>> every write operations. The performance penalty depends on

>> configuration of write allocate. In addition, another data

>> dependency is required: we must wait for the size of

>> event B is retrived before overwrite it.

>>

>> Even in the very first try at 2013 in [1], reading from the ring

>> buffer is avoided. I don't think Peter changes his mind now.

>>

>>> Another concern is validity of records stored. If user space messes

>>> with ring-buffer, kernel won't be able to move data_tail properly

>>> and would need to indicate that to userspace somehow.

>>> But memory saving of 8 bytes per record could be sizable

>> Yes. But I have already discussed with Peter on this in [2].

>> Last month I suggested:

>>

>> <quote>

>>

>>   1. If PERF_SAMPLE_SIZE is selected, we can avoid outputting the event

>>      size in header. Which eliminate extra space cost;

>> </quote>

>>

>> However:

>>

>> <quote>

>>

>> That would mandate you always parse the stream backwards. Which seems

>> rather unfortunate. Also, no you cannot recoup the extra space, see the

>> alignment and size requirement.

> hmm, in this kernel patch I see that you're adding 8 bytes for

> every record via this extra TAILSISZE flag and in perf you're

> walking the ring buffer backwards by reading this 8 byte

> sizes, comparing header sizes and so on until reaching beginning,

> where you start dumping it as normal.

> So for this 'signal to perf' approach to work the ring buffer

> will contain tailsizes everywhere just so that user space can

> find the beginning. That's not very pretty. imo if kernel

> can do header read to adjust data_tail it would make user

> space side clean. May be there are other solutions.

> Adding tailsize seems like brute force hack.

> There must be some nicer way.

Hi Peter,

  What's your opinion? Should we reconsider moving size field from 
header the end?
Or moving whole header to the end of a record?

Thank you.
Wang Nan Jan. 13, 2016, 4:34 a.m. UTC | #3
On 2016/1/13 3:56, Alexei Starovoitov wrote:
> On Tue, Jan 12, 2016 at 08:36:23PM +0800, Wangnan (F) wrote:

>>> hmm, in this kernel patch I see that you're adding 8 bytes for

>>> every record via this extra TAILSISZE flag and in perf you're

>>> walking the ring buffer backwards by reading this 8 byte

>>> sizes, comparing header sizes and so on until reaching beginning,

>>> where you start dumping it as normal.

>>> So for this 'signal to perf' approach to work the ring buffer

>>> will contain tailsizes everywhere just so that user space can

>>> find the beginning. That's not very pretty. imo if kernel

>>> can do header read to adjust data_tail it would make user

>>> space side clean. May be there are other solutions.

>>> Adding tailsize seems like brute force hack.

>>> There must be some nicer way.

>> Hi Peter,

>>

>>   What's your opinion? Should we reconsider moving size field from header the

>> end?

>> Or moving whole header to the end of a record?

> I think moving the whole header under new TAILHEADER flag is

> actually very good idea. The ring buffer will be fully utilized

> and no extra bytes necessary. User space would need to parse it

> backwards, but for this use case it fits well.


I have another crazy suggestion: can we make kernel writing to
the ring buffer from the end to the beginning? For example:

This is the initial state of the ring buffer, head pointer
pointes to the end of it:

       -------------> Address increase

                                     head
                                       |
                                       V
  +--+---+-------+----------+------+---+
  |                                    |
  +--+---+-------+----------+------+---+


Write the first event at the end of the ring buffer, and *decrease*
the head pointer:

                                 head
                                   |
                                   V
  +--+---+-------+----------+------+---+
  |                                | A |
  +--+---+-------+----------+------+---+


Another record:
                           head
                            |
                            V
  +--+---+-------+----------+------+---+
  |                         |   B  | A |
  +--+---+-------+----------+------+---+


Ring buffer rewind, A is fully overwritten and B is broken:

                                head
                                  |
                                  V
  +--+---+-------+----------+-----+----+
  |F | E |   D   | C        | ... | F  |
  +--+---+-------+----------+-----+----+

At this time user can parse the ring buffer normally from
F to C. From timestamp in it he know which one is the
oldest.

By this perf don't need too much extra work to do. There's no
performance penalty at all, and the 8 bytes are saved.

Thought?

Thank you.
diff mbox

Patch

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f9828a4..c5df1e82 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -835,13 +835,13 @@  extern void perf_event_output(struct perf_event *event,
 				struct pt_regs *regs);
 
 extern void
-perf_event_header__init_id(struct perf_event_header *header,
-			   struct perf_sample_data *data,
-			   struct perf_event *event);
+perf_event_header__init_extra(struct perf_event_header *header,
+			      struct perf_sample_data *data,
+			      struct perf_event *event);
 extern void
-perf_event__output_id_sample(struct perf_event *event,
-			     struct perf_output_handle *handle,
-			     struct perf_sample_data *sample);
+perf_event__output_extra(struct perf_event *event, u64 evt_size,
+			 struct perf_output_handle *handle,
+			 struct perf_sample_data *sample);
 
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
@@ -1032,6 +1032,11 @@  static inline bool has_aux(struct perf_event *event)
 	return event->pmu->setup_aux;
 }
 
+static inline bool has_tailsize(struct perf_event *event)
+{
+	return event->attr.sample_type & PERF_SAMPLE_TAILSIZE;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 1afe962..4e8dde8 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -139,8 +139,9 @@  enum perf_event_sample_format {
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_REGS_INTR			= 1U << 18,
+	PERF_SAMPLE_TAILSIZE			= 1U << 19,
 
-	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bf82441..2d59b59 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5128,12 +5128,14 @@  static void __perf_event_header__init_id(struct perf_event_header *header,
 	}
 }
 
-void perf_event_header__init_id(struct perf_event_header *header,
-				struct perf_sample_data *data,
-				struct perf_event *event)
+void perf_event_header__init_extra(struct perf_event_header *header,
+				   struct perf_sample_data *data,
+				   struct perf_event *event)
 {
 	if (event->attr.sample_id_all)
 		__perf_event_header__init_id(header, data, event);
+	if (has_tailsize(event))
+		header->size += sizeof(u64);
 }
 
 static void __perf_event__output_id_sample(struct perf_output_handle *handle,
@@ -5160,12 +5162,14 @@  static void __perf_event__output_id_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, data->id);
 }
 
-void perf_event__output_id_sample(struct perf_event *event,
-				  struct perf_output_handle *handle,
-				  struct perf_sample_data *sample)
+void perf_event__output_extra(struct perf_event *event, u64 evt_size,
+			      struct perf_output_handle *handle,
+			      struct perf_sample_data *sample)
 {
 	if (event->attr.sample_id_all)
 		__perf_event__output_id_sample(handle, sample);
+	if (has_tailsize(event))
+		perf_output_put(handle, evt_size);
 }
 
 static void perf_output_read_one(struct perf_output_handle *handle,
@@ -5407,6 +5411,13 @@  void perf_output_sample(struct perf_output_handle *handle,
 		}
 	}
 
+	/* Should be the last one */
+	if (sample_type & PERF_SAMPLE_TAILSIZE) {
+		u64 evt_size = header->size;
+
+		perf_output_put(handle, evt_size);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -5526,6 +5537,9 @@  void perf_prepare_sample(struct perf_event_header *header,
 
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_TAILSIZE)
+		header->size += sizeof(u64);
 }
 
 void perf_event_output(struct perf_event *event,
@@ -5579,14 +5593,15 @@  perf_event_read_event(struct perf_event *event,
 	};
 	int ret;
 
-	perf_event_header__init_id(&read_event.header, &sample, event);
+	perf_event_header__init_extra(&read_event.header, &sample, event);
 	ret = perf_output_begin(&handle, event, read_event.header.size);
 	if (ret)
 		return;
 
 	perf_output_put(&handle, read_event);
 	perf_output_read(&handle, event);
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, read_event.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 }
@@ -5698,7 +5713,7 @@  static void perf_event_task_output(struct perf_event *event,
 	if (!perf_event_task_match(event))
 		return;
 
-	perf_event_header__init_id(&task_event->event_id.header, &sample, event);
+	perf_event_header__init_extra(&task_event->event_id.header, &sample, event);
 
 	ret = perf_output_begin(&handle, event,
 				task_event->event_id.header.size);
@@ -5715,7 +5730,9 @@  static void perf_event_task_output(struct perf_event *event,
 
 	perf_output_put(&handle, task_event->event_id);
 
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event,
+				 task_event->event_id.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 out:
@@ -5794,7 +5811,7 @@  static void perf_event_comm_output(struct perf_event *event,
 	if (!perf_event_comm_match(event))
 		return;
 
-	perf_event_header__init_id(&comm_event->event_id.header, &sample, event);
+	perf_event_header__init_extra(&comm_event->event_id.header, &sample, event);
 	ret = perf_output_begin(&handle, event,
 				comm_event->event_id.header.size);
 
@@ -5808,7 +5825,8 @@  static void perf_event_comm_output(struct perf_event *event,
 	__output_copy(&handle, comm_event->comm,
 				   comm_event->comm_size);
 
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, comm_event->event_id.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 out:
@@ -5917,7 +5935,7 @@  static void perf_event_mmap_output(struct perf_event *event,
 		mmap_event->event_id.header.size += sizeof(mmap_event->flags);
 	}
 
-	perf_event_header__init_id(&mmap_event->event_id.header, &sample, event);
+	perf_event_header__init_extra(&mmap_event->event_id.header, &sample, event);
 	ret = perf_output_begin(&handle, event,
 				mmap_event->event_id.header.size);
 	if (ret)
@@ -5940,7 +5958,8 @@  static void perf_event_mmap_output(struct perf_event *event,
 	__output_copy(&handle, mmap_event->file_name,
 				   mmap_event->file_size);
 
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, mmap_event->event_id.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 out:
@@ -6123,14 +6142,15 @@  void perf_event_aux_event(struct perf_event *event, unsigned long head,
 	};
 	int ret;
 
-	perf_event_header__init_id(&rec.header, &sample, event);
+	perf_event_header__init_extra(&rec.header, &sample, event);
 	ret = perf_output_begin(&handle, event, rec.header.size);
 
 	if (ret)
 		return;
 
 	perf_output_put(&handle, rec);
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, rec.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 }
@@ -6156,7 +6176,7 @@  void perf_log_lost_samples(struct perf_event *event, u64 lost)
 		.lost		= lost,
 	};
 
-	perf_event_header__init_id(&lost_samples_event.header, &sample, event);
+	perf_event_header__init_extra(&lost_samples_event.header, &sample, event);
 
 	ret = perf_output_begin(&handle, event,
 				lost_samples_event.header.size);
@@ -6164,7 +6184,8 @@  void perf_log_lost_samples(struct perf_event *event, u64 lost)
 		return;
 
 	perf_output_put(&handle, lost_samples_event);
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, lost_samples_event.header.size,
+				 &handle, &sample);
 	perf_output_end(&handle);
 }
 
@@ -6211,7 +6232,7 @@  static void perf_event_switch_output(struct perf_event *event, void *data)
 					perf_event_tid(event, se->next_prev);
 	}
 
-	perf_event_header__init_id(&se->event_id.header, &sample, event);
+	perf_event_header__init_extra(&se->event_id.header, &sample, event);
 
 	ret = perf_output_begin(&handle, event, se->event_id.header.size);
 	if (ret)
@@ -6222,7 +6243,8 @@  static void perf_event_switch_output(struct perf_event *event, void *data)
 	else
 		perf_output_put(&handle, se->event_id);
 
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, se->event_id.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 }
@@ -6282,7 +6304,7 @@  static void perf_log_throttle(struct perf_event *event, int enable)
 	if (enable)
 		throttle_event.header.type = PERF_RECORD_UNTHROTTLE;
 
-	perf_event_header__init_id(&throttle_event.header, &sample, event);
+	perf_event_header__init_extra(&throttle_event.header, &sample, event);
 
 	ret = perf_output_begin(&handle, event,
 				throttle_event.header.size);
@@ -6290,7 +6312,8 @@  static void perf_log_throttle(struct perf_event *event, int enable)
 		return;
 
 	perf_output_put(&handle, throttle_event);
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, throttle_event.header.size,
+				 &handle, &sample);
 	perf_output_end(&handle);
 }
 
@@ -6318,14 +6341,15 @@  static void perf_log_itrace_start(struct perf_event *event)
 	rec.pid	= perf_event_pid(event, current);
 	rec.tid	= perf_event_tid(event, current);
 
-	perf_event_header__init_id(&rec.header, &sample, event);
+	perf_event_header__init_extra(&rec.header, &sample, event);
 	ret = perf_output_begin(&handle, event, rec.header.size);
 
 	if (ret)
 		return;
 
 	perf_output_put(&handle, rec);
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, rec.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 }
@@ -8098,6 +8122,16 @@  perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 	    event->pmu != output_event->pmu)
 		goto out;
 
+	/*
+	 * Don't allow mixed tailsize setting since the resuling
+	 * ringbuffer would unable to be parsed backward.
+	 *
+	 * '!=' is safe because has_tailsize() returns bool, two differnt
+	 * non-zero values would be treated as equal (both true).
+	 */
+	if (has_tailsize(event) != has_tailsize(output_event))
+		goto out;
+
 set:
 	mutex_lock(&event->mmap_mutex);
 	/* Can't redirect output if we've got an active mmap() */
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index adfdc05..5f8bd89 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -186,10 +186,11 @@  int perf_output_begin(struct perf_output_handle *handle,
 		lost_event.id          = event->id;
 		lost_event.lost        = local_xchg(&rb->lost, 0);
 
-		perf_event_header__init_id(&lost_event.header,
-					   &sample_data, event);
+		perf_event_header__init_extra(&lost_event.header,
+					      &sample_data, event);
 		perf_output_put(handle, lost_event);
-		perf_event__output_id_sample(event, handle, &sample_data);
+		perf_event__output_extra(event, lost_event.header.type,
+					 handle, &sample_data);
 	}
 
 	return 0;