diff mbox

[V5,1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

Message ID 1445325735-121694-2-git-send-email-xiakaixu@huawei.com
State New
Headers show

Commit Message

Kaixu Xia Oct. 20, 2015, 7:22 a.m. UTC
This patch adds the flag soft_enable to control the trace data
output process when perf sampling. By setting this flag and
integrating with ebpf, we can control the data output process and
get the samples we are most interested in.

The bpf helper bpf_perf_event_control() can control either the perf
event on current cpu or all the perf events stored in the maps by
checking the third parameter 'flags'.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
---
 include/linux/perf_event.h      |  1 +
 include/uapi/linux/bpf.h        | 11 ++++++++
 include/uapi/linux/perf_event.h |  3 +-
 kernel/bpf/verifier.c           |  3 +-
 kernel/events/core.c            | 13 +++++++++
 kernel/trace/bpf_trace.c        | 62 +++++++++++++++++++++++++++++++++++++++++
 6 files changed, 91 insertions(+), 2 deletions(-)

Comments

Alexei Starovoitov Oct. 20, 2015, 10:53 p.m. UTC | #1
On 10/20/15 12:22 AM, Kaixu Xia wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b11756f..5219635 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
>   		irq_work_queue(&event->pending);
>   	}
>
> +	if (unlikely(!atomic_read(&event->soft_enable)))
> +		return 0;
> +
>   	if (event->overflow_handler)
>   		event->overflow_handler(event, data, regs);
>   	else

Peter,
does this part look right or it should be moved right after
if (unlikely(!is_sampling_event(event)))
                 return 0;
or even to other function?

It feels to me that it should be moved, since we probably don't
want to active throttling, period adjust and event_limit for events
that are in soft_disabled state.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 21, 2015, 9:12 a.m. UTC | #2
On Tue, Oct 20, 2015 at 03:53:02PM -0700, Alexei Starovoitov wrote:
> On 10/20/15 12:22 AM, Kaixu Xia wrote:
> >diff --git a/kernel/events/core.c b/kernel/events/core.c
> >index b11756f..5219635 100644
> >--- a/kernel/events/core.c
> >+++ b/kernel/events/core.c
> >@@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
> >  		irq_work_queue(&event->pending);
> >  	}
> >
> >+	if (unlikely(!atomic_read(&event->soft_enable)))
> >+		return 0;
> >+
> >  	if (event->overflow_handler)
> >  		event->overflow_handler(event, data, regs);
> >  	else
> 
> Peter,
> does this part look right or it should be moved right after
> if (unlikely(!is_sampling_event(event)))
>                 return 0;
> or even to other function?
> 
> It feels to me that it should be moved, since we probably don't
> want to active throttling, period adjust and event_limit for events
> that are in soft_disabled state.

Depends on what its meant to do. As long as you let the interrupt
happen, I think we should in fact do those things (maybe not the
event_limit), but period adjustment and esp. throttling are important
when the event is enabled.

If you want to actually disable the event: pmu->stop() will make it
stop, and you can restart using pmu->start().

And I suppose you can wrap that with a counter if you need nesting.

I'm not sure if any of that is a viable solution, because the patch
description is somewhat short on the problem statement.

As is, I'm not too charmed with the patch, but lacking a better
understanding of what exactly we're trying to achieve I'm struggling
with proposing alternatives.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Kaixu Xia Oct. 21, 2015, 10:31 a.m. UTC | #3
于 2015/10/21 17:12, Peter Zijlstra 写道:
> On Tue, Oct 20, 2015 at 03:53:02PM -0700, Alexei Starovoitov wrote:
>> On 10/20/15 12:22 AM, Kaixu Xia wrote:
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index b11756f..5219635 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
>>>  		irq_work_queue(&event->pending);
>>>  	}
>>>
>>> +	if (unlikely(!atomic_read(&event->soft_enable)))
>>> +		return 0;
>>> +
>>>  	if (event->overflow_handler)
>>>  		event->overflow_handler(event, data, regs);
>>>  	else
>>
>> Peter,
>> does this part look right or it should be moved right after
>> if (unlikely(!is_sampling_event(event)))
>>                 return 0;
>> or even to other function?
>>
>> It feels to me that it should be moved, since we probably don't
>> want to active throttling, period adjust and event_limit for events
>> that are in soft_disabled state.
> 
> Depends on what its meant to do. As long as you let the interrupt
> happen, I think we should in fact do those things (maybe not the
> event_limit), but period adjustment and esp. throttling are important
> when the event is enabled.
> 
> If you want to actually disable the event: pmu->stop() will make it
> stop, and you can restart using pmu->start().
> 
> And I suppose you can wrap that with a counter if you need nesting.
> 
> I'm not sure if any of that is a viable solution, because the patch
> description is somewhat short on the problem statement.
> 
> As is, I'm not too charmed with the patch, but lacking a better
> understanding of what exactly we're trying to achieve I'm struggling
> with proposing alternatives.

Thanks for your comments!
The RFC patch set contains the necessary commit log [1].

In some scenarios we don't want to output trace data when perf sampling
in order to reduce overhead. For example, perf can be run as daemon to
dump trace data when necessary, such as the system performance goes down.
Just like the example given in the cover letter, we only receive the
samples within sys_write() syscall.

The helper bpf_perf_event_control() in this patch set can control the
data output process and get the samples we are most interested in.
The cpu_function_call is probably too much to do from bpf program, so
I choose current design that like 'soft_disable'.

[1] https://lkml.org/lkml/2015/10/12/135
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 21, 2015, 11:33 a.m. UTC | #4
On Wed, Oct 21, 2015 at 06:31:04PM +0800, xiakaixu wrote:

> The RFC patch set contains the necessary commit log [1].

That's of course the wrong place, this should be in the patch's
Changelog. It doesn't become less relevant.

> In some scenarios we don't want to output trace data when perf sampling
> in order to reduce overhead. For example, perf can be run as daemon to
> dump trace data when necessary, such as the system performance goes down.
> Just like the example given in the cover letter, we only receive the
> samples within sys_write() syscall.
> 
> The helper bpf_perf_event_control() in this patch set can control the
> data output process and get the samples we are most interested in.
> The cpu_function_call is probably too much to do from bpf program, so
> I choose current design that like 'soft_disable'.

So, IIRC, we already require eBPF perf events to be CPU-local, which
obviates the entire need for IPIs.

So calling pmu->stop() seems entirely possible (its even NMI safe).
This, however, does not explain if you need nesting, your patch seemed
to have a counter, which suggest you do.

In any case, you could add perf_event_{stop,start}_local() to mirror the
existing perf_event_read_local(), no? That would stop the entire thing
and reduce even more overhead than simply skipping the overflow handler.

> [1] https://lkml.org/lkml/2015/10/12/135

Blergh, vger should auto drop emails with lkml.org links in, that site
is getting ridiculously unreliable. (It did show the email after a
second try -- this time)

Proper links are of the form:

  http://lkml.kernel.org/r/$MSGID

Those have the bonus of actually including the msgid which helps with
finding the email in local archives/mailers.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Wang Nan Oct. 21, 2015, 11:34 a.m. UTC | #5
On 2015/10/21 17:12, Peter Zijlstra wrote:
> On Tue, Oct 20, 2015 at 03:53:02PM -0700, Alexei Starovoitov wrote:
>> On 10/20/15 12:22 AM, Kaixu Xia wrote:
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index b11756f..5219635 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
>>>   		irq_work_queue(&event->pending);
>>>   	}
>>>
>>> +	if (unlikely(!atomic_read(&event->soft_enable)))
>>> +		return 0;
>>> +
>>>   	if (event->overflow_handler)
>>>   		event->overflow_handler(event, data, regs);
>>>   	else
>> Peter,
>> does this part look right or it should be moved right after
>> if (unlikely(!is_sampling_event(event)))
>>                  return 0;
>> or even to other function?
>>
>> It feels to me that it should be moved, since we probably don't
>> want to active throttling, period adjust and event_limit for events
>> that are in soft_disabled state.
> Depends on what its meant to do. As long as you let the interrupt
> happen, I think we should in fact do those things (maybe not the
> event_limit), but period adjustment and esp. throttling are important
> when the event is enabled.
>
> If you want to actually disable the event: pmu->stop() will make it
> stop, and you can restart using pmu->start().xiezuo

I also prefer totally disabling event because our goal is to reduce
sampling overhead as mush as possible. However, events in perf is
CPU bounded, one event in perf cmdline becomes multiple 'perf_event'
in kernel in multi-core system. Disabling/enabling events on all CPUs
by a BPF program a hard task due to racing, NMI, ...

Think about an example scenario: we want to sample cycles in a system
width way to see what the whole system does during a smart phone
refreshing its display, and don't want other samples when display
freezing. We probe at the entry and exit points of Display.refresh() (a
fictional user function), then let two BPF programs to enable 'cycle'
sampling PMU at the entry point and disable it at the exit point.

In this task, we need to start all 'cycles' perf_events when display
start refreshing, and disable all of those events when refreshing is
finished. Only enable the event on the core which executes the entry
point of Display.refresh() is not enough because real workers are
running on other cores, we need them to do the computation cooperativly.
Also, scheduler is possible to schedule the exit point of Display.refresh()
on another core, so we can't simply disable the perf_event on that core and
let other core keel sampling after refreshing finishes.

I have thought a method which can disable sampling in a safe way:
we can call pmu->stop() inside the PMU IRQ handler, so we can ensure
that pmu->stop() always be called by core its event resides.
However, I don't know how to reenable them when safely. Maybe need
something in scheduler?

Thank you.

> And I suppose you can wrap that with a counter if you need nesting.
>
> I'm not sure if any of that is a viable solution, because the patch
> description is somewhat short on the problem statement.
>
> As is, I'm not too charmed with the patch, but lacking a better
> understanding of what exactly we're trying to achieve I'm struggling
> with proposing alternatives.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Wang Nan Oct. 21, 2015, 11:49 a.m. UTC | #6
On 2015/10/21 19:33, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 06:31:04PM +0800, xiakaixu wrote:
>
>> The RFC patch set contains the necessary commit log [1].
> That's of course the wrong place, this should be in the patch's
> Changelog. It doesn't become less relevant.
>
>> In some scenarios we don't want to output trace data when perf sampling
>> in order to reduce overhead. For example, perf can be run as daemon to
>> dump trace data when necessary, such as the system performance goes down.
>> Just like the example given in the cover letter, we only receive the
>> samples within sys_write() syscall.
>>
>> The helper bpf_perf_event_control() in this patch set can control the
>> data output process and get the samples we are most interested in.
>> The cpu_function_call is probably too much to do from bpf program, so
>> I choose current design that like 'soft_disable'.
> So, IIRC, we already require eBPF perf events to be CPU-local, which
> obviates the entire need for IPIs.

But soft-disable/enable don't require IPI because it is only
a memory store operation.

> So calling pmu->stop() seems entirely possible (its even NMI safe).

But we need to turn off sampling across CPUs. Please have a look
at my another email.

> This, however, does not explain if you need nesting, your patch seemed
> to have a counter, which suggest you do.
To avoid reacing.

If our task is sampling cycle events during a function is running,
and if two cores start that function overlap:

Time:   ...................A
Core 0: sys_write----\
                       \
                        \
Core 1:             sys_write%return
Core 2: ................sys_write

Then without counter at time A it is highly possible that
BPF program on core 1 and core 2 get conflict with each other.
The final result is we make some of those events be turned on
and others turned off. Using atomic counter can avoid this
problem.

Thank you.

>
> In any case, you could add perf_event_{stop,start}_local() to mirror the
> existing perf_event_read_local(), no? That would stop the entire thing
> and reduce even more overhead than simply skipping the overflow handler.
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 21, 2015, 11:56 a.m. UTC | #7
On Wed, Oct 21, 2015 at 07:34:28PM +0800, Wangnan (F) wrote:
> >If you want to actually disable the event: pmu->stop() will make it
> >stop, and you can restart using pmu->start().xiezuo
> 
> I also prefer totally disabling event because our goal is to reduce
> sampling overhead as mush as possible. However, events in perf is
> CPU bounded, one event in perf cmdline becomes multiple 'perf_event'
> in kernel in multi-core system. Disabling/enabling events on all CPUs
> by a BPF program a hard task due to racing, NMI, ...

But eBPF perf events must already be local afaik. Look at the
constraints perf_event_read_local() places on the events.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Wang Nan Oct. 21, 2015, 12:03 p.m. UTC | #8
On 2015/10/21 19:56, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 07:34:28PM +0800, Wangnan (F) wrote:
>>> If you want to actually disable the event: pmu->stop() will make it
>>> stop, and you can restart using pmu->start().xiezuo
>> I also prefer totally disabling event because our goal is to reduce
>> sampling overhead as mush as possible. However, events in perf is
>> CPU bounded, one event in perf cmdline becomes multiple 'perf_event'
>> in kernel in multi-core system. Disabling/enabling events on all CPUs
>> by a BPF program a hard task due to racing, NMI, ...
> But eBPF perf events must already be local afaik. Look at the
> constraints perf_event_read_local() places on the events.

I think soft disabling/enabling is free of this constraint,
because technically speaking a soft-disabled perf event is still
running.

What we want to disable is only sampling action to avoid
being overwhelming by sampling and reduce the overhead which
output those unneeded sampling data to perf.data. I don't
care whether the PMU counter is stopped or still running
too much. Even if it still generate interrupts I think it
should be acceptable because interruption handling can be fast.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 21, 2015, 12:17 p.m. UTC | #9
On Wed, Oct 21, 2015 at 07:49:34PM +0800, Wangnan (F) wrote:
> If our task is sampling cycle events during a function is running,
> and if two cores start that function overlap:
> 
> Time:   ...................A
> Core 0: sys_write----\
>                       \
>                        \
> Core 1:             sys_write%return
> Core 2: ................sys_write
> 
> Then without counter at time A it is highly possible that
> BPF program on core 1 and core 2 get conflict with each other.
> The final result is we make some of those events be turned on
> and others turned off. Using atomic counter can avoid this
> problem.

But but, how and why can an eBPF program access a !local event? I
thought we had hard restrictions on that.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Wang Nan Oct. 21, 2015, 1:42 p.m. UTC | #10
On 2015/10/21 20:17, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 07:49:34PM +0800, Wangnan (F) wrote:
>> If our task is sampling cycle events during a function is running,
>> and if two cores start that function overlap:
>>
>> Time:   ...................A
>> Core 0: sys_write----\
>>                        \
>>                         \
>> Core 1:             sys_write%return
>> Core 2: ................sys_write
>>
>> Then without counter at time A it is highly possible that
>> BPF program on core 1 and core 2 get conflict with each other.
>> The final result is we make some of those events be turned on
>> and others turned off. Using atomic counter can avoid this
>> problem.
> But but, how and why can an eBPF program access a !local event? I
> thought we had hard restrictions on that.

How can an eBPF program access a !local event:

when creating perf event array we don't care which perf event
is for which CPU, so perf program can access any perf event in
that array. Which is straightforward. And in soft
disabling/enabling, what need to be done is an atomic
operation on something in 'perf_event' structure, which is safe
enough.

Why we need an eBPF program access a !local event:

I think I have explained why we need an eBPF program to access
a !local event. In summary, without this ability we can't
speak in user's language because they are focus on higher level
principles (display refreshing, application booting, http
processing...) and 'on which CPU' is not in their dictionaries most
of the time. Without cross-core soft-enabling/disabling it is hard
to translate requirements like "start sampling when display refreshing
begin" and "stop sampling when application booted" into eBPF programs
and perf cmdline. Don't you think it is useful for reducing sampling
data and needs to be considered?

One alternative solution I can image is to attach a BPF program
at sampling like kprobe, and return 0 if we don't want sampling
take action. Thought? Actually speaking I don't like it very much
because the principle of soft-disable is much simpler and safe, but
if you really like it I think we can try.

Do you think soft-disable/enable perf events on other cores makes
any real problem?

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 21, 2015, 1:49 p.m. UTC | #11
On Wed, Oct 21, 2015 at 09:42:12PM +0800, Wangnan (F) wrote:
> How can an eBPF program access a !local event:
> 
> when creating perf event array we don't care which perf event
> is for which CPU, so perf program can access any perf event in
> that array.

So what is stopping the eBPF thing from calling perf_event_read_local()
on a !local event and triggering a kernel splat?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
pi3orama Oct. 21, 2015, 2:01 p.m. UTC | #12
发自我的 iPhone

> 在 2015年10月21日,下午9:49,Peter Zijlstra <peterz@infradead.org> 写道:
> 
>> On Wed, Oct 21, 2015 at 09:42:12PM +0800, Wangnan (F) wrote:
>> How can an eBPF program access a !local event:
>> 
>> when creating perf event array we don't care which perf event
>> is for which CPU, so perf program can access any perf event in
>> that array.
> 
> So what is stopping the eBPF thing from calling perf_event_read_local()
> on a !local event and triggering a kernel splat?

I can understand the perf_event_read_local() case, but I really can't understand
what is stopping us to write to an atomic field belong to a !local perf event.
Could you please give a further explanation?

Thank you.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 21, 2015, 2:09 p.m. UTC | #13
On Wed, Oct 21, 2015 at 10:01:46PM +0800, pi3orama wrote:
> > 在 2015年10月21日,下午9:49,Peter Zijlstra <peterz@infradead.org> 写道:
> > 
> >> On Wed, Oct 21, 2015 at 09:42:12PM +0800, Wangnan (F) wrote:
> >> How can an eBPF program access a !local event:
> >> 
> >> when creating perf event array we don't care which perf event
> >> is for which CPU, so perf program can access any perf event in
> >> that array.
> > 
> > So what is stopping the eBPF thing from calling perf_event_read_local()
> > on a !local event and triggering a kernel splat?
> 
> I can understand the perf_event_read_local() case, but I really can't understand
> what is stopping us to write to an atomic field belong to a !local perf event.
> Could you please give a further explanation?

I simply do not get how this eBPF stuff works.

Either I have access to !local events and I can hand one to
perf_event_read_local() and cause badness, or I do not have access to
!local events and the whole 'soft enable/disable' thing is simple.

They cannot be both true.

So explain; how does this eBPF stuff work.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
pi3orama Oct. 21, 2015, 3:06 p.m. UTC | #14
发自我的 iPhone

> 在 2015年10月21日,下午10:09,Peter Zijlstra <peterz@infradead.org> 写道:
> 
> On Wed, Oct 21, 2015 at 10:01:46PM +0800, pi3orama wrote:
>>> 在 2015年10月21日,下午9:49,Peter Zijlstra <peterz@infradead.org> 写道:
>>> 
>>>> On Wed, Oct 21, 2015 at 09:42:12PM +0800, Wangnan (F) wrote:
>>>> How can an eBPF program access a !local event:
>>>> 
>>>> when creating perf event array we don't care which perf event
>>>> is for which CPU, so perf program can access any perf event in
>>>> that array.
>>> 
>>> So what is stopping the eBPF thing from calling perf_event_read_local()
>>> on a !local event and triggering a kernel splat?
>> 
>> I can understand the perf_event_read_local() case, but I really can't understand
>> what is stopping us to write to an atomic field belong to a !local perf event.
>> Could you please give a further explanation?
> 
> I simply do not get how this eBPF stuff works.
> 
> Either I have access to !local events and I can hand one to
> perf_event_read_local() and cause badness, or I do not have access to
> !local events and the whole 'soft enable/disable' thing is simple.
> 
> They cannot be both true.
> 
> So explain; how does this eBPF stuff work.

I think I get your point this time, and let me explain the eBPF stuff to you.

You are aware that BPF programmer can break the system in this way:

A=get_non_local_perf_event()
perf_event_read_local(A)
BOOM!

However the above logic is impossible because BPF program can't work this
way.

First of all, it is impossible for a BPF program directly invoke a kernel function. 
Doesn't like kernel module, BPF program can only invoke functions designed for 
them, like what this patch does. So the ability of BPF programs is strictly
restricted by kernel. If we don't allow BPF program call perf_event_read_local()
across core, we can check this and return error in function we provide for them.

Second: there's no way for a BPF program directly access a perf event. All perf
events have to be wrapped by a map and be accessed by BPF functions described
above. We don't allow BPF program fetch array element from that map. So 
pointers of perf event is safely protected from BPF program.

In summary, your either-or logic doesn't hold in BPF world. A BPF program can
only access perf event in a highly restricted way. We don't allow it calling
perf_event_read_local() across core, so it can't.

Thank you.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Oct. 21, 2015, 4:57 p.m. UTC | #15
On Wed, Oct 21, 2015 at 11:06:47PM +0800, pi3orama wrote:
> > So explain; how does this eBPF stuff work.
> 
> I think I get your point this time, and let me explain the eBPF stuff to you.
> 
> You are aware that BPF programmer can break the system in this way:
> 
> A=get_non_local_perf_event()
> perf_event_read_local(A)
> BOOM!
> 
> However the above logic is impossible because BPF program can't work this
> way.
> 
> First of all, it is impossible for a BPF program directly invoke a
> kernel function.  Doesn't like kernel module, BPF program can only
> invoke functions designed for them, like what this patch does. So the
> ability of BPF programs is strictly restricted by kernel. If we don't
> allow BPF program call perf_event_read_local() across core, we can
> check this and return error in function we provide for them.
> 
> Second: there's no way for a BPF program directly access a perf event.
> All perf events have to be wrapped by a map and be accessed by BPF
> functions described above. We don't allow BPF program fetch array
> element from that map. So pointers of perf event is safely protected
> from BPF program.
> 
> In summary, your either-or logic doesn't hold in BPF world. A BPF
> program can only access perf event in a highly restricted way. We
> don't allow it calling perf_event_read_local() across core, so it
> can't.

Urgh, that's still horridly inconsistent. Can we please come up with a
consistent interface to perf?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Alexei Starovoitov Oct. 21, 2015, 9:19 p.m. UTC | #16
On 10/21/15 9:57 AM, Peter Zijlstra wrote:
>> In summary, your either-or logic doesn't hold in BPF world. A BPF
>> >program can only access perf event in a highly restricted way. We
>> >don't allow it calling perf_event_read_local() across core, so it
>> >can't.

That's actually broken. My fault as well, since I didn't review that
patch carefully enough. Will send a fix in a second.
No matter what bpf program does that should never be a kernel splat.

> Urgh, that's still horridly inconsistent. Can we please come up with a
> consistent interface to perf?

I had the same concerns during v1-v4 series of this set.
My suggestion was to do ioctl(enable/disable) of events from userspace
after receiving notification from kernel via my bpf_perf_event_output()
helper.
Wangnan's argument was that display refresh happens often and it's fast,
so the time taken by user space to enable events on all cpus is too
slow and ioctl does ipi and disturbs other cpus even more.
So soft_disable done by the program to enable/disable particular events
on all cpus kinda makes sense.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Wang Nan Oct. 22, 2015, 1:56 a.m. UTC | #17
Hi Alexei,

On 2015/10/21 21:42, Wangnan (F) wrote:
>
>
> One alternative solution I can image is to attach a BPF program
> at sampling like kprobe, and return 0 if we don't want sampling
> take action. Thought?

Do you think attaching BPF programs to sampling is an acceptable idea?

Thank you.

> Actually speaking I don't like it very much
> because the principle of soft-disable is much simpler and safe, but
> if you really like it I think we can try.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Wang Nan Oct. 22, 2015, 2:46 a.m. UTC | #18
On 2015/10/22 0:57, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 11:06:47PM +0800, pi3orama wrote:
>>> So explain; how does this eBPF stuff work.
>> I think I get your point this time, and let me explain the eBPF stuff to you.
>>
>> You are aware that BPF programmer can break the system in this way:
>>
>> A=get_non_local_perf_event()
>> perf_event_read_local(A)
>> BOOM!
>>
>> However the above logic is impossible because BPF program can't work this
>> way.
>>
>> First of all, it is impossible for a BPF program directly invoke a
>> kernel function.  Doesn't like kernel module, BPF program can only
>> invoke functions designed for them, like what this patch does. So the
>> ability of BPF programs is strictly restricted by kernel. If we don't
>> allow BPF program call perf_event_read_local() across core, we can
>> check this and return error in function we provide for them.
>>
>> Second: there's no way for a BPF program directly access a perf event.
>> All perf events have to be wrapped by a map and be accessed by BPF
>> functions described above. We don't allow BPF program fetch array
>> element from that map. So pointers of perf event is safely protected
>> from BPF program.
>>
>> In summary, your either-or logic doesn't hold in BPF world. A BPF
>> program can only access perf event in a highly restricted way. We
>> don't allow it calling perf_event_read_local() across core, so it
>> can't.
> Urgh, that's still horridly inconsistent. Can we please come up with a
> consistent interface to perf?

BPF program and kernel module are two different worlds as I said before.

I don't think making them to share a common interface is a good idea
because such sharing will give BPF programs too much freedom than it
really need, then it will be hard prevent them to do something bad.
If we really need kernel interface, I think what we need is kernel
module, not BPF program.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Alexei Starovoitov Oct. 22, 2015, 3:09 a.m. UTC | #19
On 10/21/15 6:56 PM, Wangnan (F) wrote:
>> One alternative solution I can image is to attach a BPF program
>> at sampling like kprobe, and return 0 if we don't want sampling
>> take action. Thought?
>
> Do you think attaching BPF programs to sampling is an acceptable idea?

If you mean to extend 'filter' concept to sampling events?
So instead of soft_disable of non-local events, you'll attach bpf
program to sampling events and use map lookup to decide whether
to filter out or not such sampling event?
What pt_regs would be in such case?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Wang Nan Oct. 22, 2015, 3:12 a.m. UTC | #20
On 2015/10/22 11:09, Alexei Starovoitov wrote:
> On 10/21/15 6:56 PM, Wangnan (F) wrote:
>>> One alternative solution I can image is to attach a BPF program
>>> at sampling like kprobe, and return 0 if we don't want sampling
>>> take action. Thought?
>>
>> Do you think attaching BPF programs to sampling is an acceptable idea?
>
> If you mean to extend 'filter' concept to sampling events?
> So instead of soft_disable of non-local events, you'll attach bpf
> program to sampling events and use map lookup to decide whether
> to filter out or not such sampling event?

Yes.

> What pt_regs would be in such case?
>

Sampling is based on interruption. We can use pt_reg captured by the IRQ 
handler,
or we can simply pass NULL to those BPF program.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Alexei Starovoitov Oct. 22, 2015, 3:26 a.m. UTC | #21
On 10/21/15 8:12 PM, Wangnan (F) wrote:
>
>
> On 2015/10/22 11:09, Alexei Starovoitov wrote:
>> On 10/21/15 6:56 PM, Wangnan (F) wrote:
>>>> One alternative solution I can image is to attach a BPF program
>>>> at sampling like kprobe, and return 0 if we don't want sampling
>>>> take action. Thought?
>>>
>>> Do you think attaching BPF programs to sampling is an acceptable idea?
>>
>> If you mean to extend 'filter' concept to sampling events?
>> So instead of soft_disable of non-local events, you'll attach bpf
>> program to sampling events and use map lookup to decide whether
>> to filter out or not such sampling event?
>
> Yes.
>
>> What pt_regs would be in such case?
>>
>
> Sampling is based on interruption. We can use pt_reg captured by the IRQ
> handler,
> or we can simply pass NULL to those BPF program.

NULL is obviously not ok. Try to answer yourself 'why it's not'.
Clean implementation should add single 'if (..->prog)' to event sampling
critical path. Please don't rush it and think it through.
but the first thing first, please see my fix for bpf_perf_event_read:
http://patchwork.ozlabs.org/patch/534120/
and ack it if it makes sense.
we need to make sure existing stuff is solid, before going further.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Ingo Molnar Oct. 22, 2015, 7:39 a.m. UTC | #22
* Wangnan (F) <wangnan0@huawei.com> wrote:

> 
> 
> On 2015/10/22 0:57, Peter Zijlstra wrote:
> >On Wed, Oct 21, 2015 at 11:06:47PM +0800, pi3orama wrote:
> >>>So explain; how does this eBPF stuff work.
> >>I think I get your point this time, and let me explain the eBPF stuff to you.
> >>
> >>You are aware that BPF programmer can break the system in this way:
> >>
> >>A=get_non_local_perf_event()
> >>perf_event_read_local(A)
> >>BOOM!
> >>
> >>However the above logic is impossible because BPF program can't work this
> >>way.
> >>
> >>First of all, it is impossible for a BPF program directly invoke a
> >>kernel function.  Doesn't like kernel module, BPF program can only
> >>invoke functions designed for them, like what this patch does. So the
> >>ability of BPF programs is strictly restricted by kernel. If we don't
> >>allow BPF program call perf_event_read_local() across core, we can
> >>check this and return error in function we provide for them.
> >>
> >>Second: there's no way for a BPF program directly access a perf event.
> >>All perf events have to be wrapped by a map and be accessed by BPF
> >>functions described above. We don't allow BPF program fetch array
> >>element from that map. So pointers of perf event is safely protected
> >>from BPF program.
> >>
> >>In summary, your either-or logic doesn't hold in BPF world. A BPF
> >>program can only access perf event in a highly restricted way. We
> >>don't allow it calling perf_event_read_local() across core, so it
> >>can't.
> >Urgh, that's still horridly inconsistent. Can we please come up with a
> >consistent interface to perf?
> 
> BPF program and kernel module are two different worlds as I said before.
> 
> I don't think making them to share a common interface is a good idea because 
> such sharing will give BPF programs too much freedom than it really need, then 
> it will be hard prevent them to do something bad. If we really need kernel 
> interface, I think what we need is kernel module, not BPF program.

What do you mean, as this does not parse for me.

We obviously can (and very likely should) make certain perf functionality 
available to BPF programs.

It should still be a well defined yet flexible iterface, with safe behavior, 
obviously - all in line with existing BPF sandboxing principles.

'Kernel modules' don't enter this consideration at all, not sure why you mention 
them - all this functionality is also available if CONFIG_MODULES is turned off 
completely.

Thanks,

	Ingo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Wang Nan Oct. 22, 2015, 7:51 a.m. UTC | #23
On 2015/10/22 15:39, Ingo Molnar wrote:
> * Wangnan (F) <wangnan0@huawei.com> wrote:
>
[SNIP]
>>
>> In summary, your either-or logic doesn't hold in BPF world. A BPF
>> program can only access perf event in a highly restricted way. We
>> don't allow it calling perf_event_read_local() across core, so it
>> can't.
>>> Urgh, that's still horridly inconsistent. Can we please come up with a
>>> consistent interface to perf?
>> BPF program and kernel module are two different worlds as I said before.
>>
>> I don't think making them to share a common interface is a good idea because
>> such sharing will give BPF programs too much freedom than it really need, then
>> it will be hard prevent them to do something bad. If we really need kernel
>> interface, I think what we need is kernel module, not BPF program.
> What do you mean, as this does not parse for me.

Because I'm not very sure what the meaning of "inconsistent" in
Peter's words...

I think what Peter want us to do is to provide similar (consistent) 
interface
between kernel and eBPF that, if kernel reads from a perf_event through
perf_event_read_local(struct perf_event *), BPF program should
do this work with similar code, or at least similar logic, so
we need to create handler for a perf event, and provide a BPF function
called BPF_FUNC_perf_event_read_local then pass such handler to it.

I don't think like this because if we want kernel interface we'd
better use kernel module, not eBPF so I mentioned kernel module here.

Ingo, do you think BPF inerface should be *consistent* with anything?

Thank you.

> We obviously can (and very likely should) make certain perf functionality
> available to BPF programs.
>
> It should still be a well defined yet flexible iterface, with safe behavior,
> obviously - all in line with existing BPF sandboxing principles.
>
> 'Kernel modules' don't enter this consideration at all, not sure why you mention
> them - all this functionality is also available if CONFIG_MODULES is turned off
> completely.
>
> Thanks,
>
> 	Ingo
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 22, 2015, 9:06 a.m. UTC | #24
On Wed, Oct 21, 2015 at 02:19:49PM -0700, Alexei Starovoitov wrote:

> >Urgh, that's still horridly inconsistent. Can we please come up with a
> >consistent interface to perf?

> My suggestion was to do ioctl(enable/disable) of events from userspace
> after receiving notification from kernel via my bpf_perf_event_output()
> helper.

> Wangnan's argument was that display refresh happens often and it's fast,
> so the time taken by user space to enable events on all cpus is too
> slow and ioctl does ipi and disturbs other cpus even more.
> So soft_disable done by the program to enable/disable particular events
> on all cpus kinda makes sense.

And this all makes me think I still have no clue what you're all trying
to do here.

Who cares about display updates and why. And why should there be an
active userspace part to eBPF programs?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 22, 2015, 9:24 a.m. UTC | #25
On Thu, Oct 22, 2015 at 03:51:37PM +0800, Wangnan (F) wrote:
> Because I'm not very sure what the meaning of "inconsistent" in
> Peter's words...

What's inconsistent is that some perf actions can be done only on local
events while others can be done on !local.

And I can't say I particularly like having to pass in events from
userspace, but I think that Alexei convinced me that it made some sense
at some point in time.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 22, 2015, 9:49 a.m. UTC | #26
On Thu, Oct 22, 2015 at 11:12:16AM +0800, Wangnan (F) wrote:
> On 2015/10/22 11:09, Alexei Starovoitov wrote:
> >On 10/21/15 6:56 PM, Wangnan (F) wrote:
> >>>One alternative solution I can image is to attach a BPF program
> >>>at sampling like kprobe, and return 0 if we don't want sampling
> >>>take action. Thought?
> >>
> >>Do you think attaching BPF programs to sampling is an acceptable idea?
> >
> >If you mean to extend 'filter' concept to sampling events?
> >So instead of soft_disable of non-local events, you'll attach bpf
> >program to sampling events and use map lookup to decide whether
> >to filter out or not such sampling event?
> 
> Yes.

One could overload or stack the overflow handler I suppose. But this
would be in line with the software/tracepoint events calling eBPF muck
on trigger, right?

> >What pt_regs would be in such case?

> Sampling is based on interruption. We can use pt_reg captured by the IRQ
> handler,

s/IRQ/NMI/

Also, we 'edit' the pt_regs on the way down to the overflow handler as
sometimes 'better' information can be had from PMU state. But a pt_regs
is available there.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Wang Nan Oct. 22, 2015, 10:28 a.m. UTC | #27
On 2015/10/22 17:06, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 02:19:49PM -0700, Alexei Starovoitov wrote:
>
>>> Urgh, that's still horridly inconsistent. Can we please come up with a
>>> consistent interface to perf?
>> My suggestion was to do ioctl(enable/disable) of events from userspace
>> after receiving notification from kernel via my bpf_perf_event_output()
>> helper.
>> Wangnan's argument was that display refresh happens often and it's fast,
>> so the time taken by user space to enable events on all cpus is too
>> slow and ioctl does ipi and disturbs other cpus even more.
>> So soft_disable done by the program to enable/disable particular events
>> on all cpus kinda makes sense.
> And this all makes me think I still have no clue what you're all trying
> to do here.
>
> Who cares about display updates and why. And why should there be an
> active userspace part to eBPF programs?

So you want the background story? OK, let me describe it. This mail is not
short so please be patient.

On a smartphone, if time between two frames is longer than 16ms, user
can aware it. This is a display glitch. We want to check those glitches
with perf to find what cause them. The basic idea is: use 'perf record'
to collect enough information, find those samples generated just before
the glitch form perf.data then analysis them offline.

There are many works need to be done before perf can support such
analysis. One improtant thing is to reduce the overhead from perf to
avoid perf itself become the reason of glitches. We can do this by reduce
the number of events perf collects, but then we won't have enough
information to analysis when glitch happen. Another way we are trying to 
implement
now is to dynamically turn events on and off, or at least enable/disable
sampling dynamically because the overhead of copying those samples
is a big part of perf's total overhead. After that we can trace as many
event as possible, but only fetch data from them when we detect a glitch.

BPF program is the best choice to describe our relative complex glitch
detection model. For simplicity, you can think our glitch detection model
contain 3 points at user programs: point A means display refreshing begin,
point B means display refreshing has last for 16ms, point C means
display refreshing finished. They can be found in user programs and
probed through uprobe, on which we can attach BPF programs on.

Then we want to use perf this way:

Sample 'cycles' event to collect callgraph so we know what all CPUs are
doing during the refreshing, but only start sampling when we detect a
frame refreshing last for 16ms.

We can translate the above logic into BPF programs: at point B, enable
'cycles' event to generate samples; at point C, disable 'cycles' event
to avoid useless samples to be generated.

Then, make 'perf record -e cycles' to collect call graph and other
information through cycles event. From perf.data, we can use 'perf script'
and other tools to analysis resuling data.

We have consider some potential solution and find them inapproate or need
too much work to do:

  1. As you may prefer, create BPF functions to call pmu->stop() /
     pmu->start() for perf event on the CPU on which BPF programs get
     triggered.

     The shortcoming of this method is we can only turn on the perf event on
     the CPU execute point B. We are unable to know what other CPU are doing
     during glitching. But what we want is system-wide information. In 
addition,
     point C and point B are not necessarily be executed at one core, so 
we may
     shut down wrong event if scheduler decide to run point C on another 
core.

  2. As Alexei's suggestion, output something through his 
bpf_perf_event_output(),
     let perf disable and enable those events using ioctl in userspace.

     This is a good idea, but introduces asynchronization problem.
     bpf_perf_event_output() output something to perf's ring buffer, but 
perf
     get noticed about this message when epoll_wait() return. We tested 
a tracepoint
     event and found that an event may awared by perf sereval seconds after
     it generated. One solution is to use --no-buffer, but it still need 
perf to be
     scheduled on time and parse its ringbuffer quickly. Also, please 
note that point
     C is possible to appear very shortly after point B because some 
APPs are optimized
     to make their refreshing time very near to 16ms.

This is the background story. Looks like whether we implement some 
corss-CPU controling
or we satisified with coarse grained controlling. The best method we can 
think is to
use atomic operation to soft enable/disable perf events. We believe it 
is simple enough
and won't cause problem.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 23, 2015, 12:52 p.m. UTC | #28
On Thu, Oct 22, 2015 at 06:28:22PM +0800, Wangnan (F) wrote:
> information to analysis when glitch happen. Another way we are trying to
> implement
> now is to dynamically turn events on and off, or at least enable/disable
> sampling dynamically because the overhead of copying those samples
> is a big part of perf's total overhead. After that we can trace as many
> event as possible, but only fetch data from them when we detect a glitch.

So why don't you 'fix' the flight recorder mode and just leave the data
in memory and not bother copying it out until a glitch happens?

Something like this:

lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net

it appears we never quite finished that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Kaixu Xia Oct. 27, 2015, 6:43 a.m. UTC | #29
于 2015/10/23 23:12, Peter Zijlstra 写道:
> On Fri, Oct 23, 2015 at 02:52:11PM +0200, Peter Zijlstra wrote:

>> On Thu, Oct 22, 2015 at 06:28:22PM +0800, Wangnan (F) wrote:

>>> information to analysis when glitch happen. Another way we are trying to

>>> implement

>>> now is to dynamically turn events on and off, or at least enable/disable

>>> sampling dynamically because the overhead of copying those samples

>>> is a big part of perf's total overhead. After that we can trace as many

>>> event as possible, but only fetch data from them when we detect a glitch.

>>

>> So why don't you 'fix' the flight recorder mode and just leave the data

>> in memory and not bother copying it out until a glitch happens?

>>

>> Something like this:

>>

>> lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net

>>

>> it appears we never quite finished that.

> 

> Updated to current sources, compile tested only.

> 

> It obviously needs testing and performance numbers..  and some

> userspace.

> 

> ---

> Subject: perf: Update event buffer tail when overwriting old events

> From: Peter Zijlstra <peterz@infradead.org>

> 

>> From: "Yan, Zheng" <zheng.z.yan@intel.com>

>>

>> If perf event buffer is in overwrite mode, the kernel only updates

>> the data head when it overwrites old samples. The program that owns

>> the buffer need periodically check the buffer and update a variable

>> that tracks the date tail. If the program fails to do this in time,

>> the data tail can be overwritted by new samples. The program has to

>> rewind the buffer because it does not know where is the first vaild

>> sample.

>>

>> This patch makes the kernel update the date tail when it overwrites

>> old events. So the program that owns the event buffer can always

>> read the latest samples. This is convenient for programs that use

>> perf to do branch tracing. One use case is GDB branch tracing:

>> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)

>> It uses perf interface to read BTS, but only cares the branches

>> before the ptrace event.

> 

> Original-patch-by: "Yan, Zheng" <zheng.z.yan@intel.com>

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---

>  arch/x86/kernel/cpu/perf_event_intel_ds.c |    2 

>  include/linux/perf_event.h                |    6 --

>  kernel/events/core.c                      |   56 +++++++++++++++++----

>  kernel/events/internal.h                  |    2 

>  kernel/events/ring_buffer.c               |   77 +++++++++++++++++++++---------

>  5 files changed, 107 insertions(+), 36 deletions(-)

> 

> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c

> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c

> @@ -1140,7 +1140,7 @@ static void __intel_pmu_pebs_event(struc

>  

>  	while (count > 1) {

>  		setup_pebs_sample_data(event, iregs, at, &data, &regs);

> -		perf_event_output(event, &data, &regs);

> +		event->overflow_handler(event, &data, &regs);

>  		at += x86_pmu.pebs_record_size;

>  		at = get_next_pebs_record_by_bit(at, top, bit);

>  		count--;

> --- a/include/linux/perf_event.h

> +++ b/include/linux/perf_event.h

> @@ -828,10 +828,6 @@ extern int perf_event_overflow(struct pe

>  				 struct perf_sample_data *data,

>  				 struct pt_regs *regs);

>  

> -extern void perf_event_output(struct perf_event *event,

> -				struct perf_sample_data *data,

> -				struct pt_regs *regs);

> -

>  extern void

>  perf_event_header__init_id(struct perf_event_header *header,

>  			   struct perf_sample_data *data,

> @@ -1032,6 +1028,8 @@ static inline bool has_aux(struct perf_e

>  

>  extern int perf_output_begin(struct perf_output_handle *handle,

>  			     struct perf_event *event, unsigned int size);

> +extern int perf_output_begin_overwrite(struct perf_output_handle *handle,

> +			     struct perf_event *event, unsigned int size);

>  extern void perf_output_end(struct perf_output_handle *handle);

>  extern unsigned int perf_output_copy(struct perf_output_handle *handle,

>  			     const void *buf, unsigned int len);

> --- a/kernel/events/core.c

> +++ b/kernel/events/core.c

> @@ -4515,6 +4515,8 @@ static int perf_mmap_fault(struct vm_are

>  	return ret;

>  }

>  

> +static void perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb);

> +

>  static void ring_buffer_attach(struct perf_event *event,

>  			       struct ring_buffer *rb)

>  {

> @@ -4546,6 +4548,8 @@ static void ring_buffer_attach(struct pe

>  		spin_lock_irqsave(&rb->event_lock, flags);

>  		list_add_rcu(&event->rb_entry, &rb->event_list);

>  		spin_unlock_irqrestore(&rb->event_lock, flags);

> +

> +		perf_event_set_overflow(event, rb);

>  	}

>  

>  	rcu_assign_pointer(event->rb, rb);

> @@ -5579,9 +5583,12 @@ void perf_prepare_sample(struct perf_eve

>  	}

>  }

>  

> -void perf_event_output(struct perf_event *event,

> -			struct perf_sample_data *data,

> -			struct pt_regs *regs)

> +static __always_inline void

> +__perf_event_output(struct perf_event *event,

> +		    struct perf_sample_data *data,

> +		    struct pt_regs *regs,

> +		    int (*output_begin)(struct perf_output_handle *,

> +			                struct perf_event *, unsigned int))

>  {

>  	struct perf_output_handle handle;

>  	struct perf_event_header header;

> @@ -5591,7 +5598,7 @@ void perf_event_output(struct perf_event

>  

>  	perf_prepare_sample(&header, data, event, regs);

>  

> -	if (perf_output_begin(&handle, event, header.size))

> +	if (output_begin(&handle, event, header.size))

>  		goto exit;

>  

>  	perf_output_sample(&handle, &header, data, event);

> @@ -5602,6 +5609,33 @@ void perf_event_output(struct perf_event

>  	rcu_read_unlock();

>  }

>  

> +static void perf_event_output(struct perf_event *event,

> +				struct perf_sample_data *data,

> +				struct pt_regs *regs)

> +{

> +	__perf_event_output(event, data, regs, perf_output_begin);

> +}

> +

> +static void perf_event_output_overwrite(struct perf_event *event,

> +				struct perf_sample_data *data,

> +				struct pt_regs *regs)

> +{

> +	__perf_event_output(event, data, regs, perf_output_begin_overwrite);

> +}

> +

> +static void

> +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)

> +{

> +	if (event->overflow_handler != perf_event_output &&

> +	    event->overflow_handler != perf_event_output_overwrite)

> +		return;

> +

> +	if (rb->overwrite)

> +		event->overflow_handler = perf_event_output_overwrite;

> +	else

> +		event->overflow_handler = perf_event_output;

> +}

> +

>  /*

>   * read event_id

>   */

> @@ -6426,10 +6460,7 @@ static int __perf_event_overflow(struct

>  		irq_work_queue(&event->pending);

>  	}

>  

> -	if (event->overflow_handler)

> -		event->overflow_handler(event, data, regs);

> -	else

> -		perf_event_output(event, data, regs);

> +	event->overflow_handler(event, data, regs);

>  

>  	if (*perf_event_fasync(event) && event->pending_kill) {

>  		event->pending_wakeup = 1;

> @@ -7904,8 +7935,13 @@ perf_event_alloc(struct perf_event_attr

>  		context = parent_event->overflow_handler_context;

>  	}

>  

> -	event->overflow_handler	= overflow_handler;

> -	event->overflow_handler_context = context;

> +	if (overflow_handler) {

> +		event->overflow_handler	= overflow_handler;

> +		event->overflow_handler_context = context;

> +	} else {

> +		event->overflow_handler = perf_event_output;

> +		event->overflow_handler_context = NULL;

> +	}

>  

>  	perf_event__state_init(event);

>  

> --- a/kernel/events/internal.h

> +++ b/kernel/events/internal.h

> @@ -21,6 +21,8 @@ struct ring_buffer {

>  

>  	atomic_t			poll;		/* POLL_ for wakeups */

>  

> +	local_t				tail;		/* read position     */

> +	local_t				next_tail;	/* next read position */

>  	local_t				head;		/* write position    */

>  	local_t				nest;		/* nested writers    */

>  	local_t				events;		/* event limit       */

> --- a/kernel/events/ring_buffer.c

> +++ b/kernel/events/ring_buffer.c

> @@ -102,11 +102,11 @@ static void perf_output_put_handle(struc

>  	preempt_enable();

>  }

>  

> -int perf_output_begin(struct perf_output_handle *handle,

> -		      struct perf_event *event, unsigned int size)

> +static __always_inline int __perf_output_begin(struct perf_output_handle *handle,

> +		      struct perf_event *event, unsigned int size, bool overwrite)

>  {

>  	struct ring_buffer *rb;

> -	unsigned long tail, offset, head;

> +	unsigned long tail, offset, head, max_size;

>  	int have_lost, page_shift;

>  	struct {

>  		struct perf_event_header header;

> @@ -125,7 +125,8 @@ int perf_output_begin(struct perf_output

>  	if (unlikely(!rb))

>  		goto out;

>  

> -	if (unlikely(!rb->nr_pages))

> +	max_size = perf_data_size(rb);

> +	if (unlikely(size > max_size))

>  		goto out;

>  

>  	handle->rb    = rb;

> @@ -140,27 +141,49 @@ int perf_output_begin(struct perf_output

>  

>  	perf_output_get_handle(handle);

>  

> -	do {

> -		tail = READ_ONCE_CTRL(rb->user_page->data_tail);

> -		offset = head = local_read(&rb->head);

> -		if (!rb->overwrite &&

> -		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))

> -			goto fail;

> +	if (overwrite) {

> +		do {

> +			tail = local_read(&rb->tail);

> +			offset = local_read(&rb->head);

> +			head = offset + size;

> +			if (unlikely(CIRC_SPACE(head, tail, max_size) < size)) {

Should be 'if (unlikely(CIRC_SPACE(offset, tail, max_size) < size))'?

> +				tail = local_read(&rb->next_tail);

> +				local_set(&rb->tail, tail);

> +				rb->user_page->data_tail = tail;

> +			}

> +		} while (local_cmpxchg(&rb->head, offset, head) != offset);

>  

>  		/*

> -		 * The above forms a control dependency barrier separating the

> -		 * @tail load above from the data stores below. Since the @tail

> -		 * load is required to compute the branch to fail below.

> -		 *

> -		 * A, matches D; the full memory barrier userspace SHOULD issue

> -		 * after reading the data and before storing the new tail

> -		 * position.

> -		 *

> -		 * See perf_output_put_handle().

> +		 * Save the start of next event when half of the buffer

> +		 * has been filled. Later when the event buffer overflows,

> +		 * update the tail pointer to point to it.

>  		 */

> +		if (tail == local_read(&rb->next_tail) &&

> +		    CIRC_CNT(head, tail, max_size) >= (max_size / 2))

> +			local_cmpxchg(&rb->next_tail, tail, head);

> +	} else {

> +		do {

> +			tail = READ_ONCE_CTRL(rb->user_page->data_tail);

> +			offset = head = local_read(&rb->head);

> +			if (!rb->overwrite &&

> +			    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))

> +				goto fail;

> +

> +			/*

> +			 * The above forms a control dependency barrier separating the

> +			 * @tail load above from the data stores below. Since the @tail

> +			 * load is required to compute the branch to fail below.

> +			 *

> +			 * A, matches D; the full memory barrier userspace SHOULD issue

> +			 * after reading the data and before storing the new tail

> +			 * position.

> +			 *

> +			 * See perf_output_put_handle().

> +			 */

>  

> -		head += size;

> -	} while (local_cmpxchg(&rb->head, offset, head) != offset);

> +			head += size;

> +		} while (local_cmpxchg(&rb->head, offset, head) != offset);

> +	}

>  

>  	/*

>  	 * We rely on the implied barrier() by local_cmpxchg() to ensure

> @@ -203,6 +226,18 @@ int perf_output_begin(struct perf_output

>  	return -ENOSPC;

>  }

>  

> +int perf_output_begin(struct perf_output_handle *handle,

> +		      struct perf_event *event, unsigned int size)

> +{

> +	return __perf_output_begin(handle, event, size, false);

> +}

> +

> +int perf_output_begin_overwrite(struct perf_output_handle *handle,

> +		      struct perf_event *event, unsigned int size)

> +{

> +	return __perf_output_begin(handle, event, size, true);

> +}

> +

>  unsigned int perf_output_copy(struct perf_output_handle *handle,

>  		      const void *buf, unsigned int len)

>  {

> 

> .

> 



-- 
Regards
Kaixu Xia

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 092a0e8..bb3bf87 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -472,6 +472,7 @@  struct perf_event {
 	struct irq_work			pending;
 
 	atomic_t			event_limit;
+	atomic_t			soft_enable;
 
 	void (*destroy)(struct perf_event *);
 	struct rcu_head			rcu_head;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 564f1f0..164d2a9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -287,6 +287,17 @@  enum bpf_func_id {
 	 * Return: realm if != 0
 	 */
 	BPF_FUNC_get_route_realm,
+
+	/**
+	 * u64 bpf_perf_event_control(&map, index, flags) - control perf events in maps
+	 * @map: pointer to PERF_EVENT_ARRAY maps
+	 * @index: the key of perf event
+	 * @flags: bit 0 - if true, dump event data on current cpu
+	 *	   bit 1 - if true, control all the events in maps
+	 *	   other bits - reserved
+	 * Return: 0 on success
+	 */
+	BPF_FUNC_perf_event_control,
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 2881145..a791b03 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -331,7 +331,8 @@  struct perf_event_attr {
 				comm_exec      :  1, /* flag comm events that are due to an exec */
 				use_clockid    :  1, /* use @clockid for time fields */
 				context_switch :  1, /* context switch data */
-				__reserved_1   : 37;
+				soft_disable   :  1, /* output data on samples by default */
+				__reserved_1   : 36;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1d6b97b..ffec14b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -245,6 +245,7 @@  static const struct {
 } func_limit[] = {
 	{BPF_MAP_TYPE_PROG_ARRAY, BPF_FUNC_tail_call},
 	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_read},
+	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_control},
 };
 
 static void print_verifier_state(struct verifier_env *env)
@@ -910,7 +911,7 @@  static int check_map_func_compatibility(struct bpf_map *map, int func_id)
 		 * don't allow any other map type to be passed into
 		 * the special func;
 		 */
-		if (bool_map != bool_func)
+		if (bool_func && bool_map != bool_func)
 			return -EINVAL;
 	}
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b11756f..5219635 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6337,6 +6337,9 @@  static int __perf_event_overflow(struct perf_event *event,
 		irq_work_queue(&event->pending);
 	}
 
+	if (unlikely(!atomic_read(&event->soft_enable)))
+		return 0;
+
 	if (event->overflow_handler)
 		event->overflow_handler(event, data, regs);
 	else
@@ -7709,6 +7712,14 @@  static void account_event(struct perf_event *event)
 	account_event_cpu(event, event->cpu);
 }
 
+static void perf_event_check_dump_flag(struct perf_event *event)
+{
+	if (event->attr.soft_disable == 1)
+		atomic_set(&event->soft_enable, 0);
+	else
+		atomic_set(&event->soft_enable, 1);
+}
+
 /*
  * Allocate and initialize a event structure
  */
@@ -7840,6 +7851,8 @@  perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		}
 	}
 
+	perf_event_check_dump_flag(event);
+
 	return event;
 
 err_per_task:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0fe96c7..398ed94 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -215,6 +215,66 @@  const struct bpf_func_proto bpf_perf_event_read_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+/* flags for PERF_EVENT_ARRAY maps */
+enum {
+	BPF_EVENT_CTL_BIT_CUR = 0,
+	BPF_EVENT_CTL_BIT_ALL = 1,
+	__NR_BPF_EVENT_CTL_BITS,
+};
+
+#define	BPF_CTL_BIT_FLAG_MASK	GENMASK_ULL(63, __NR_BPF_EVENT_CTL_BITS)
+#define	BPF_CTL_BIT_DUMP_CUR	BIT_ULL(BPF_EVENT_CTL_BIT_CUR)
+#define	BPF_CTL_BIT_DUMP_ALL	BIT_ULL(BPF_EVENT_CTL_BIT_ALL)
+
+static u64 bpf_perf_event_control(u64 r1, u64 index, u64 flags, u64 r4, u64 r5)
+{
+	struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	struct perf_event *event;
+	int i;
+
+	if (unlikely(index >= array->map.max_entries))
+		return -E2BIG;
+
+	if (flags & BPF_CTL_BIT_FLAG_MASK)
+		return -EINVAL;
+
+	if (flags & BPF_CTL_BIT_DUMP_ALL) {
+		bool dump_control = flags & BPF_CTL_BIT_DUMP_CUR;
+
+		for (i = 0; i < array->map.max_entries; i++) {
+			event = (struct perf_event *)array->ptrs[i];
+			if (!event)
+				continue;
+
+			if (dump_control)
+				atomic_inc_unless_negative(&event->soft_enable);
+			else
+				atomic_dec_if_positive(&event->soft_enable);
+		}
+		return 0;
+	}
+
+	event = (struct perf_event *)array->ptrs[index];
+	if (!event)
+		return -ENOENT;
+
+	if (flags & BPF_CTL_BIT_DUMP_CUR)
+		atomic_inc_unless_negative(&event->soft_enable);
+	else
+		atomic_dec_if_positive(&event->soft_enable);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_perf_event_control_proto = {
+	.func		= bpf_perf_event_control,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -242,6 +302,8 @@  static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
 		return &bpf_get_smp_processor_id_proto;
 	case BPF_FUNC_perf_event_read:
 		return &bpf_perf_event_read_proto;
+	case BPF_FUNC_perf_event_control:
+		return &bpf_perf_event_control_proto;
 	default:
 		return NULL;
 	}