[4/4] perf core: Add backward attribute to perf event

This patch introduces 'write_backward' bit to perf_event_attr, which
controls the direction of a ring buffer. After set, the corresponding
ring buffer is written from end to beginning. This feature is design to
support reading from overwritable ring buffer.

Ring buffer can be created by mapping a perf event fd. Kernel puts event
records into ring buffer, user programs like perf fetch them from
address returned by mmap(). To prevent racing between kernel and perf,
they communicate to each other through 'head' and 'tail' pointers.
Kernel maintains 'head' pointer, points it to the next free area (tail
of the last record). Perf maintains 'tail' pointer, points it to the
tail of last consumed record (record has already been fetched). Kernel
determines the available space in a ring buffer using these two
pointers, prevents to overwrite unfetched records.

By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
Different from normal ring buffer, perf is unable to maintain 'tail'
pointer because writing is forbidden. Therefore, for this type of ring
buffers, kernel overwrite old records unconditionally, works like flight
recorder. This feature would be useful if reading from overwritable ring
buffer were as easy as reading from normal ring buffer. However,
there's an obscure problem.

The following figure demonstrates the state of an overwritable ring
buffer which is nearly full. In this figure, the 'head' pointer points
to the end of last record, and a long record 'E' is pending. For a
normal ring buffer, a 'tail' pointer would have pointed to position (X),
so kernel knows there's no more space in the ring buffer. However, for
an overwritable ring buffer, kernel doesn't care the 'tail' pointer.

   (X)                              head
    .                                |
    .                                V
    +------+-------+----------+------+---+
    |A....A|B.....B|C........C|D....D|   |
    +------+-------+----------+------+---+

After writing record 'E', record 'A' is overwritten.

      head
       |
       V
    +--+---+-------+----------+------+---+
    |.E|..A|B.....B|C........C|D....D|E..|
    +--+---+-------+----------+------+---+

Now perf decides to read from this ring buffer. However, none of the
the two natural positions, 'head' and the start of this ring buffer,
are pointing to the head of a record. Even perf can read the full ring
buffer, it is unable to find the position to start decoding.

The first attempt tries to solve this problem AFAIK can be found from
[1]. It makes kernel to maintain 'tail' pointer: updates it when ring
buffer is half full. However, this approach introduces overhead to
fast path. Test result shows a 1% overhead [2]. In addition, this method
utilizes no more tham 50% records.

Another attempt can be found from [3], which allow putting the size of
an event at the end of each record. This approach allows perf to find
records in a backword manner from 'head' pointer by reading size of a
record from its tail. However, because of alignment requirement, it
needs 8 bytes to record the size of a record, which is a huge waste. Its
performance is also not good, because more data need to be written.
This approach also introduces some extra branch instructions to fast
path.

'write_backward' is a better solution to this problem.

Following figure demonstrates the state of the overwritable ring buffer
when 'write_backward' is set before overwriting:

       head
        |
        V
    +---+------+----------+-------+------+
    |   |D....D|C........C|B.....B|A....A|
    +---+------+----------+-------+------+

and after overwriting:
                                     head
                                      |
                                      V
    +---+------+----------+-------+---+--+
    |..E|D....D|C........C|B.....B|A..|E.|
    +---+------+----------+-------+---+--+

In each situation, 'head' points to the beginning of the newest record.
From this record, perf can iterate over the full ring buffer, fetching
as mush records as possible one by one.

The only limitation needs to consider is back-to-back reading. Due to
the non-deterministic of user program, it is impossible to ensure the
ring buffer keeps stable during reading. Consider an extreme situation:
perf is scheduled out after reading record 'D', then a burst of events
come, eat up the whole ring buffer (one or multiple rounds), but 'head'
pointer happends to be at the same position when perf comes back.
Continue reading after 'D' is incorrect now.

To prevent this problem, we need to find a way to ensure the ring buffer
is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
suggested because its overhead is lower than
ioctl(PERF_EVENT_IOC_ENABLE).

This patch utilizes event's default overflow_handler introduced
previously. perf_event_output_backward() is created as the default
overflow handler for backward ring buffers. To avoid extra overhead to
fast path, original perf_event_output() becomes __perf_event_output()
and marked '__always_inline'. In theory, there's no extra overhead
introduced to fast path.

Performance result:

Calling 3000000 times of 'close(-1)', use gettimeofday() to check
duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
system calls. In ns.

Testing environment:

 CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
 Kernel : v4.5.0
                   MEAN         STDVAR
BASE            800214.950    2853.083
PRE1           2253846.700    9997.014
PRE2           2257495.540    8516.293
POST           2250896.100    8933.921

Where 'BASE' is pure performance without capturing. 'PRE1' is test
result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
patch. 'POST' is test result after this patch. See [4] for detail
experimental setup.

Considering the stdvar, this patch doesn't introduce performance
overhead to fast path.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
[2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
[3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
[4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com

Signed-off-by: Wang Nan <wangnan0@huawei.com>

Cc: He Kuang <hekuang@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
---
 include/linux/perf_event.h      | 28 +++++++++++++++++++++---
 include/uapi/linux/perf_event.h |  3 ++-
 kernel/events/core.c            | 48 ++++++++++++++++++++++++++++++++++++-----
 kernel/events/ring_buffer.c     | 14 ++++++++++++
 4 files changed, 84 insertions(+), 9 deletions(-)

-- 
1.8.3.4

[4/4] perf core: Add backward attribute to perf event

Commit Message

Comments

Patch