diff mbox

[RFC] perf/core: Put size of a sample at the end of it

Message ID 1449063499-236703-1-git-send-email-wangnan0@huawei.com
State Superseded
Headers show

Commit Message

Wang Nan Dec. 2, 2015, 1:38 p.m. UTC
This is an RFC patch which is for overwrite mode ring buffer. I'd like
to discuss the correctness of this new idea for retriving as many
events as possible from overwrite mode ring buffer. If there's no
fundamental problem, I'll start perf side work.

The biggest problem for overwrite ring buffer is that it is hard to find
the start position of valid record. [1] and [2] tries to solve this
problem by introducing 'tail' and 'next_tail' into metadata page, and
update them each time the ring buffer is half full. Which adds more
instructions to event output code path, hurt performance. In addition,
even with them we still unable to recover all possible records. For
example:

             data_tail          head
                |                 |
                V                 V
 +------+-------+----------+------+---+
 |  A   |   B   |   C      |  D   |   |
 +------+-------+----------+------+---+

If a record written at head pointer and it overwrites record A:

   head      data_tail
    |           |
    V           V
 +--+---+-------+----------+------+---+
 |E |...|   B   |   C      |  D   | E |
 +--+---+-------+----------+------+---+

Record B is still valid but we can't get it through data_tail.

This patch suggests a different solution for this problem that, by
appending the length of a record at the end of it, user program is
possible to get every possible record in a backward manner, don't
need saving tail pointer.

For example:

   head
    |
    V
 +--+---+-------+----------+------+---+
 |E6|...|   B  8|   C    11|  D  7|E..|
 +--+---+-------+----------+------+---+

In this case, from the 'head' pointer provided by kernel, user program
can first see '6' by (*(head - sizeof(u16))), then it can get the start
pointer of record 'E', then it can read size and find start position
of record D, C, B in similar way.

Kernel side implementation is easy: simply adding a PERF_SAMPLE_SIZE
for the size output.

This sloution requires user program (perf) do more things. At least
following things and limitations should be considered:

 1. Before reading such ring buffer, perf must ensure all events which
    may output to it is already stopped, so the 'head' pointer it get
    is the end of the last record.

 2. We must ensure all events attached this ring buffer has
    'PERF_SAMPLE_SIZE' selected.

 3. There must no tracking events output to this ring buffer.

 4. 2 bytes extra space is required for each record.

Further improvement can be taken:

 1. If PERF_SAMPLE_SIZE is selected, we can avoid outputting the event
    size in header. Which eliminate extra space cose;

 2. We can find a way to append size information for tracking events
    also.

[1] http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net
[2] http://lkml.kernel.org/r/20151023151205.GW11639@twins.programming.kicks-ass.net

Signed-off-by: Wang Nan <wangnan0@huawei.com>

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yunlong Song <yunlong.song@huawei.com>
---
 include/uapi/linux/perf_event.h | 3 ++-
 kernel/events/core.c            | 6 ++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Wang Nan Dec. 3, 2015, 10:31 a.m. UTC | #1
On 2015/12/3 18:08, Peter Zijlstra wrote:
> On Wed, Dec 02, 2015 at 01:38:19PM +0000, Wang Nan wrote:

>> This sloution requires user program (perf) do more things. At least

>> following things and limitations should be considered:

>>

>>   1. Before reading such ring buffer, perf must ensure all events which

>>      may output to it is already stopped, so the 'head' pointer it get

>>      is the end of the last record.

> Right, this is tricky, this would not allow two snapshots to happen back

> to back since that would then result in a bunch of missed events.

>

> Aside from this issue its a rather nice idea.


Thank you for your attitude. We can start consider it seriously.

Now I'm working on perf side code to make a workable prototype.

>>   2. We must ensure all events attached this ring buffer has

>>      'PERF_SAMPLE_SIZE' selected.

> That can be easily enforced.


Yes.

>>   3. There must no tracking events output to this ring buffer.

> That is rather unfortunate, we'd best fix that up.

>

>>   4. 2 bytes extra space is required for each record.

> 8, perf records must be 8 byte aligned and sized.


So does it means we need to pad before outputing size?

>> Further improvement can be taken:

>>

>>   1. If PERF_SAMPLE_SIZE is selected, we can avoid outputting the event

>>      size in header. Which eliminate extra space cose;

> That would mandate you always parse the stream backwards. Which seems

> rather unfortunate. Also, no you cannot recoup the extra space, see the

> alignment and size requirement.


That's good. We don't need to consider this :)

Before receiving your comment I'm thinking about modifying
DEFINE_OUTPUT_COPY() to write first sizeof(header) bytes at the
end of reserved area, so it would work automatically for every
possible events.

>>   2. We can find a way to append size information for tracking events

>>      also.

> The !sample records you mean? Yes those had better have them too.


Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Wang Nan Dec. 7, 2015, 1:28 p.m. UTC | #2
This patch set explores the idea shown in [1], which puts size of every
events at the end of them in ring buffer so user space tool like perf
can parse the ring buffer backward and find the oldest event in it.

In this version:

 - Kernel side, rename PERF_SAMPLE_SIZE to PERF_SAMPLE_SIZE_AT_END,
   output 8 bytes to meet the alignment requirement.

 - Perf side, provide a prototype utilize this feature. With this
   prototype users are allowed to capture the last events in an
   overwrite ring buffer using:

   # ./perf record -e dummy -e syscalls:*/overwrite/ ...

[1] http://lkml.kernel.org/g/1449063499-236703-1-git-send-email-wangnan0@huawei.com

Wang Nan (3):
  perf/core: Put size of a sample at the end of it
  perf tools: Enable overwrite settings
  perf record: Find tail pointer through size at end of event

 include/uapi/linux/perf_event.h |  3 +-
 kernel/events/core.c            |  9 +++++
 tools/perf/builtin-record.c     | 76 +++++++++++++++++++++++++++++++++++++++--
 tools/perf/perf.h               |  2 ++
 tools/perf/util/evlist.c        | 42 +++++++++++++++++------
 tools/perf/util/evsel.c         |  7 ++++
 tools/perf/util/evsel.h         |  3 ++
 tools/perf/util/parse-events.c  | 14 ++++++++
 tools/perf/util/parse-events.h  |  4 ++-
 tools/perf/util/parse-events.l  |  2 ++
 10 files changed, 147 insertions(+), 15 deletions(-)

-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 1afe962..c4066da 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -139,8 +139,9 @@  enum perf_event_sample_format {
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_REGS_INTR			= 1U << 18,
+	PERF_SAMPLE_SIZE			= 1U << 19,
 
-	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5854fcf..bbbacec 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5473,6 +5473,9 @@  void perf_output_sample(struct perf_output_handle *handle,
 		}
 	}
 
+	if (sample_type & PERF_SAMPLE_SIZE)
+		perf_output_put(handle, header->size);
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -5592,6 +5595,9 @@  void perf_prepare_sample(struct perf_event_header *header,
 
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_SIZE)
+		header->size += sizeof(u16);
 }
 
 void perf_event_output(struct perf_event *event,