perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

Message ID 20170628105600.GC5981@leverpostej
State New
Headers show

Commit Message

Mark Rutland June 28, 2017, 10:56 a.m.
On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:
> On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:


> As we're trying to avoid smapling state, I think we can move the check

> into perf_prepare_sample() or __perf_event_output(), where that state is

> actually sampled. I'll take a look at that momentarily.

> 

> Just to clarify, you don't care about the sample state at all? i.e. you

> don't need the user program counter?

> 

> Is that signal delivered to the tracee, or to a different process that

> traces it? If the latter, what ensures that the task is stopped

> sufficiently quickly?

> 

> > It seems to me that it might be reasonable to ignore the interrupt if

> > the purpose of the interrupt is to trigger sampling of the CPUs

> > register state.  But if the interrupt will trigger some other

> > operation, such as a signal on an fd, then there's no reason to drop

> > it.

> 

> Agreed. I'll try to have a patch for this soon.

> 

> I just need to figure out exactly where that overflow signal is

> generated by the perf core.


I've figured that out now. That's handled by perf_pending_event(), whcih
is the irq_work we schedule in __perf_event_overflow().

Does the below patch work for you?

---->8----
From bb1f99dace508ce34ab0818f91d59e73450fa142 Mon Sep 17 00:00:00 2001
From: Mark Rutland <mark.rutland@arm.com>

Date: Wed, 28 Jun 2017 11:39:25 +0100
Subject: [PATCH] perf/core: generate overflow signal when samples are dropped

We recently tried to kill an information leak where users could receive
kernel samples due to skid on the PMU interrupt. To block this, we
bailed out early in perf_event_overflow(), as we do for non-sampling
events.

This broke rr, which uses sampling events to receive a signal on
overflow (but does not care about the contents of the sample). These
signals are critical to the correct operation of rr.

Instead of bailing out early in perf_event_overflow, we can bail prior
to performing the actual sampling in __perf_event_output(). This avoids
the information leak, but preserves the generation of the signal.

Since we don't place any sample data into the ring buffer, the signal is
arguably spurious. However, a userspace ringbuffer consumer can already
consume data prior to taking the associated signals, and therefore must
handle spurious signals to operate correctly. Thus, this signal
shouldn't be harmful.

Fixes: cc1582c231ea041f ("perf/core: Drop kernel samples even though :u is specified")
Reported-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>

Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c | 46 +++++++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 21 deletions(-)

-- 
1.9.1

Comments

Vince Weaver June 28, 2017, 12:40 p.m. | #1
On Wed, 28 Jun 2017, Mark Rutland wrote:

> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:


> Instead of bailing out early in perf_event_overflow, we can bail prior

> to performing the actual sampling in __perf_event_output(). This avoids

> the information leak, but preserves the generation of the signal.

> 

> Since we don't place any sample data into the ring buffer, the signal is

> arguably spurious. However, a userspace ringbuffer consumer can already

> consume data prior to taking the associated signals, and therefore must

> handle spurious signals to operate correctly. Thus, this signal

> shouldn't be harmful.


this could still break some of my perf_event validation tests.

Ones that set up a sampling event for every 1M instructions, run for 100M 
instructions, and expect there to be 100 samples received.

If we're so worried about info leakage, can't we just zero-out the problem 
address (or randomize the kernel address) rather than just pretending the 
interrupt didn't happen?

Vince
Mark Rutland June 28, 2017, 1:07 p.m. | #2
On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote:
> On Wed, 28 Jun 2017, Mark Rutland wrote:

> 

> > On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:

> 

> > Instead of bailing out early in perf_event_overflow, we can bail prior

> > to performing the actual sampling in __perf_event_output(). This avoids

> > the information leak, but preserves the generation of the signal.

> > 

> > Since we don't place any sample data into the ring buffer, the signal is

> > arguably spurious. However, a userspace ringbuffer consumer can already

> > consume data prior to taking the associated signals, and therefore must

> > handle spurious signals to operate correctly. Thus, this signal

> > shouldn't be harmful.

> 

> this could still break some of my perf_event validation tests.

> 

> Ones that set up a sampling event for every 1M instructions, run for 100M 

> instructions, and expect there to be 100 samples received.


Is that test reliable today?

I'd expect that at least on ARM it's not, given that events can be
counted imprecisely, and mode filters can be applied imprecisely. So you
might get fewer (or more) samples. I'd imagine similar is true on other
archtiectures.

If sampling took long enough, the existing ratelimiting could come into
effect, too.

Surely that already has some error margin?

> If we're so worried about info leakage, can't we just zero-out the problem 

> address (or randomize the kernel address) rather than just pretending the 

> interrupt didn't happen?


Making up zeroed or randomized data is going to confuse users. I can't
imagine that real users are going to want bogus samples that they have
to identify (somehow) in order to skip when processing the data.

I can see merit in signalling "lost" samples to userspace, so long as
they're easily distinguished from real samples.

One option is to fake up a sample using the user regs regardless, but
that's both fragile and surprising in other cases.

Thanks,
Mark.
Kyle Huey June 28, 2017, 4:48 p.m. | #3
On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:

>> On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:

>

>> As we're trying to avoid smapling state, I think we can move the check

>> into perf_prepare_sample() or __perf_event_output(), where that state is

>> actually sampled. I'll take a look at that momentarily.

>>

>> Just to clarify, you don't care about the sample state at all? i.e. you

>> don't need the user program counter?

>>

>> Is that signal delivered to the tracee, or to a different process that

>> traces it? If the latter, what ensures that the task is stopped

>> sufficiently quickly?

>>

>> > It seems to me that it might be reasonable to ignore the interrupt if

>> > the purpose of the interrupt is to trigger sampling of the CPUs

>> > register state.  But if the interrupt will trigger some other

>> > operation, such as a signal on an fd, then there's no reason to drop

>> > it.

>>

>> Agreed. I'll try to have a patch for this soon.

>>

>> I just need to figure out exactly where that overflow signal is

>> generated by the perf core.

>

> I've figured that out now. That's handled by perf_pending_event(), whcih

> is the irq_work we schedule in __perf_event_overflow().

>

> Does the below patch work for you?

>

> ---->8----

> From bb1f99dace508ce34ab0818f91d59e73450fa142 Mon Sep 17 00:00:00 2001

> From: Mark Rutland <mark.rutland@arm.com>

> Date: Wed, 28 Jun 2017 11:39:25 +0100

> Subject: [PATCH] perf/core: generate overflow signal when samples are dropped

>

> We recently tried to kill an information leak where users could receive

> kernel samples due to skid on the PMU interrupt. To block this, we

> bailed out early in perf_event_overflow(), as we do for non-sampling

> events.

>

> This broke rr, which uses sampling events to receive a signal on

> overflow (but does not care about the contents of the sample). These

> signals are critical to the correct operation of rr.

>

> Instead of bailing out early in perf_event_overflow, we can bail prior

> to performing the actual sampling in __perf_event_output(). This avoids

> the information leak, but preserves the generation of the signal.

>

> Since we don't place any sample data into the ring buffer, the signal is

> arguably spurious. However, a userspace ringbuffer consumer can already

> consume data prior to taking the associated signals, and therefore must

> handle spurious signals to operate correctly. Thus, this signal

> shouldn't be harmful.

>

> Fixes: cc1582c231ea041f ("perf/core: Drop kernel samples even though :u is specified")

> Reported-by: Kyle Huey <me@kylehuey.com>

> Signed-off-by: Mark Rutland <mark.rutland@arm.com>

> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>

> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>

> Cc: Ingo Molnar <mingo@kernel.org>

> Cc: Jin Yao <yao.jin@linux.intel.com>

> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---

>  kernel/events/core.c | 46 +++++++++++++++++++++++++---------------------

>  1 file changed, 25 insertions(+), 21 deletions(-)

>

> diff --git a/kernel/events/core.c b/kernel/events/core.c

> index 6c4e523..6b263f3 100644

> --- a/kernel/events/core.c

> +++ b/kernel/events/core.c

> @@ -6090,6 +6090,21 @@ void perf_prepare_sample(struct perf_event_header *header,

>         }

>  }

>

> +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)

> +{

> +       /*

> +        * Due to interrupt latency (AKA "skid"), we may enter the

> +        * kernel before taking an overflow, even if the PMU is only

> +        * counting user events.

> +        * To avoid leaking information to userspace, we must always

> +        * reject kernel samples when exclude_kernel is set.

> +        */

> +       if (event->attr.exclude_kernel && !user_mode(regs))

> +               return false;

> +

> +       return true;

> +}

> +

>  static void __always_inline

>  __perf_event_output(struct perf_event *event,

>                     struct perf_sample_data *data,

> @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header,

>         struct perf_output_handle handle;

>         struct perf_event_header header;

>

> +       /*

> +        * For security, drop the skid kernel samples if necessary.

> +        */

> +       if (!sample_is_allowed(event, regs))

> +               return ret;


Just a bare return here.

> +

>         /* protect the callchain buffers */

>         rcu_read_lock();

>

> @@ -7316,21 +7337,6 @@ int perf_event_account_interrupt(struct perf_event *event)

>         return __perf_event_account_interrupt(event, 1);

>  }

>

> -static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)

> -{

> -       /*

> -        * Due to interrupt latency (AKA "skid"), we may enter the

> -        * kernel before taking an overflow, even if the PMU is only

> -        * counting user events.

> -        * To avoid leaking information to userspace, we must always

> -        * reject kernel samples when exclude_kernel is set.

> -        */

> -       if (event->attr.exclude_kernel && !user_mode(regs))

> -               return false;

> -

> -       return true;

> -}

> -

>  /*

>   * Generic event overflow handling, sampling.

>   */

> @@ -7352,12 +7358,6 @@ static int __perf_event_overflow(struct perf_event *event,

>         ret = __perf_event_account_interrupt(event, throttle);

>

>         /*

> -        * For security, drop the skid kernel samples if necessary.

> -        */

> -       if (!sample_is_allowed(event, regs))

> -               return ret;

> -

> -       /*

>          * XXX event_limit might not quite work as expected on inherited

>          * events

>          */

> @@ -7372,6 +7372,10 @@ static int __perf_event_overflow(struct perf_event *event,

>

>         READ_ONCE(event->overflow_handler)(event, data, regs);

>

> +       /*

> +        * We must generate a wakeup regardless of whether we actually

> +        * generated a sample. This is relied upon by rr.

> +        */

>         if (*perf_event_fasync(event) && event->pending_kill) {

>                 event->pending_wakeup = 1;

>                 irq_work_queue(&event->pending);

> --

> 1.9.1

>


I can confirm that with that fixed to compile, this patch fixes rr.

Thanks!

- Kyle
Mark Rutland June 28, 2017, 5:49 p.m. | #4
On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:
> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote:

> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header,

> >         struct perf_output_handle handle;

> >         struct perf_event_header header;

> >

> > +       /*

> > +        * For security, drop the skid kernel samples if necessary.

> > +        */

> > +       if (!sample_is_allowed(event, regs))

> > +               return ret;

> 

> Just a bare return here.


Ugh, yes. Sorry about that. I'll fix that up.

[...]

> I can confirm that with that fixed to compile, this patch fixes rr.


Thanks for giving this a go.

Having thought about this some more, I think Vince does make a good
point that throwing away samples is liable to break stuff, e.g. that
which only relies on (non-sensitive) samples.

It still seems wrong to make up data, though.

Maybe for exclude_kernel && !exclude_user events we can always generate
samples from the user regs, rather than the exception regs. That's going
to be closer to what the user wants, regardless. I'll take a look
tomorrow.

Thanks,
Mark.
Kyle Huey June 28, 2017, 10:55 p.m. | #5
On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:

>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header,

>> >         struct perf_output_handle handle;

>> >         struct perf_event_header header;

>> >

>> > +       /*

>> > +        * For security, drop the skid kernel samples if necessary.

>> > +        */

>> > +       if (!sample_is_allowed(event, regs))

>> > +               return ret;

>>

>> Just a bare return here.

>

> Ugh, yes. Sorry about that. I'll fix that up.

>

> [...]

>

>> I can confirm that with that fixed to compile, this patch fixes rr.

>

> Thanks for giving this a go.

>

> Having thought about this some more, I think Vince does make a good

> point that throwing away samples is liable to break stuff, e.g. that

> which only relies on (non-sensitive) samples.

>

> It still seems wrong to make up data, though.

>

> Maybe for exclude_kernel && !exclude_user events we can always generate

> samples from the user regs, rather than the exception regs. That's going

> to be closer to what the user wants, regardless. I'll take a look

> tomorrow.


I'm not very familiar with the kernel internals, but the reason I
didn't suggest this originally is it seems like it will be difficult
to determine what the "correct" userspace registers are.  For example,
what happens if a performance counter is fixed to a given tid, the
interrupt fires during a context switch from that task to another that
is not being monitored, and the kernel is far enough along in the
context switch that the current task struct has been switched out?
Reporting the new task's registers seems as bad as reporting the
kernel's registers.  But maybe this is easier than I imagine for
whatever reason.

Something to think about.

- Kyle
Jin, Yao June 29, 2017, 12:27 a.m. | #6
On 6/29/2017 6:55 AM, Kyle Huey wrote:
> On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:

>>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>>>> @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header,

>>>>          struct perf_output_handle handle;

>>>>          struct perf_event_header header;

>>>>

>>>> +       /*

>>>> +        * For security, drop the skid kernel samples if necessary.

>>>> +        */

>>>> +       if (!sample_is_allowed(event, regs))

>>>> +               return ret;

>>> Just a bare return here.

>> Ugh, yes. Sorry about that. I'll fix that up.

>>

>> [...]

>>

>>> I can confirm that with that fixed to compile, this patch fixes rr.

>> Thanks for giving this a go.

>>

>> Having thought about this some more, I think Vince does make a good

>> point that throwing away samples is liable to break stuff, e.g. that

>> which only relies on (non-sensitive) samples.

>>

>> It still seems wrong to make up data, though.

>>

>> Maybe for exclude_kernel && !exclude_user events we can always generate

>> samples from the user regs, rather than the exception regs. That's going

>> to be closer to what the user wants, regardless. I'll take a look

>> tomorrow.

> I'm not very familiar with the kernel internals, but the reason I

> didn't suggest this originally is it seems like it will be difficult

> to determine what the "correct" userspace registers are.  For example,

> what happens if a performance counter is fixed to a given tid, the

> interrupt fires during a context switch from that task to another that

> is not being monitored, and the kernel is far enough along in the

> context switch that the current task struct has been switched out?

> Reporting the new task's registers seems as bad as reporting the

> kernel's registers.  But maybe this is easier than I imagine for

> whatever reason.

>

> Something to think about.

>

> - Kyle


Yes, I think so.

The skid interrupt may be triggered at a wrong context and return wrong 
indications (e.g. wrong regs) to userspace.

So that's why I think the *skid* interrupt had better be dropped.

Thanks
Jin Yao
Ingo Molnar June 29, 2017, 8:12 a.m. | #7
* Mark Rutland <mark.rutland@arm.com> wrote:

> It still seems wrong to make up data, though.


So what we have here is a hardware quirk: we asked for user-space samples, but 
didn't get them and we cannot expose the kernel-internal address.

The question is, how do we handle the hardware quirk. Since we cannot fix the 
hardware on existing systems there's really just two choices:

 - Lose the sample (and signal it as a lost sample)

 - Keep the sample but change the sensitive kernel-internal address to something 
   that is not sensitive: 0 or -1 works, but we could perhaps also return a 
   well-known user-space address such as the vDSO syscall trampoline or such?

there's no other option really.

I'd lean towards Vince's take: losing samples is more surprising than getting the 
occasional sample with some sanitized data in it.

If we make the artificial data still a meaningful user-space address, related to 
kernel entries, then it might even be a bonus, as users would learn to recognize 
it as: 'oh, skid artifact, I know about that'.

Thanks,

	Ingo
Alexey Budankov June 29, 2017, 8:13 a.m. | #8
Hi Folks,

On 28.06.2017 16:07, Mark Rutland wrote:
> On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote:

>> On Wed, 28 Jun 2017, Mark Rutland wrote:

>>

>>> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:

>>

>>> Instead of bailing out early in perf_event_overflow, we can bail prior

>>> to performing the actual sampling in __perf_event_output(). This avoids

>>> the information leak, but preserves the generation of the signal.

>>>

>>> Since we don't place any sample data into the ring buffer, the signal is

>>> arguably spurious. However, a userspace ringbuffer consumer can already

>>> consume data prior to taking the associated signals, and therefore must

>>> handle spurious signals to operate correctly. Thus, this signal

>>> shouldn't be harmful.

>>

>> this could still break some of my perf_event validation tests.

>>

>> Ones that set up a sampling event for every 1M instructions, run for 100M 

>> instructions, and expect there to be 100 samples received.

> 

> Is that test reliable today?

> 

> I'd expect that at least on ARM it's not, given that events can be

> counted imprecisely, and mode filters can be applied imprecisely. So you

> might get fewer (or more) samples. I'd imagine similar is true on other

> archtiectures.

> 

> If sampling took long enough, the existing ratelimiting could come into

> effect, too.

> 

> Surely that already has some error margin?


FYI.

From my recent experience and observation (on Intel Xeon Phi) 
wakeup_events_overflow and overflow_poll tests may fail if 
/proc/sys/kernel/perf_event_max_sample_rate is low enough:

# cat /proc/sys/kernel/perf_event_max_sample_rate
6000
# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow
This tests wakeup event overflows.
Testing with wakeup_events=1.
Counts, using mmap buffer 0x7f707b0dc000
	POLL_IN : 10
	POLL_OUT: 0
	POLL_MSG: 0
	POLL_ERR: 0
	POLL_PRI: 0
	POLL_HUP: 0
	UNKNOWN : 0
Testing wakeup events overflow...                            PASSED
# abudanko/perf_event_tests/tests/overflow/overflow_poll
This tests using poll() to catch overflow.
Monitoring pid 131412 status 1407
Child has stopped due to signal 5 (Trace/breakpoint trap)
Continuing child
Returned HUP!
Counts, using mmap buffer 0x7fb3bfe04000
	POLL_IN : 10
	POLL_OUT: 0
	POLL_MSG: 0
	POLL_ERR: 0
	POLL_PRI: 0
	POLL_HUP: 1
	UNKNOWN : 0
Testing catching overflow with poll()...                     PASSED

# echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate
# abudanko/perf_event_tests/tests/overflow/overflow_poll
This tests using poll() to catch overflow.
Monitoring pid 131551 status 1407
Child has stopped due to signal 5 (Trace/breakpoint trap)
Continuing child
Returned HUP!
Counts, using mmap buffer 0x7f80532df000
	POLL_IN : 9
	POLL_OUT: 0
	POLL_MSG: 0
	POLL_ERR: 0
	POLL_PRI: 0
	POLL_HUP: 1
	UNKNOWN : 0
Unexpected POLL_IN interrupt.
Testing catching overflow with poll()...                     FAILED
[root@nntpdsd52-210 ~]# abudanko/perf_event_tests/tests/overflow/overflow_poll
This tests using poll() to catch overflow.
Monitoring pid 131553 status 1407
Child has stopped due to signal 5 (Trace/breakpoint trap)
Continuing child
Returned HUP!
Counts, using mmap buffer 0x7f650952c000
	POLL_IN : 9
	POLL_OUT: 0
	POLL_MSG: 0
	POLL_ERR: 0
	POLL_PRI: 0
	POLL_HUP: 1
	UNKNOWN : 0
Unexpected POLL_IN interrupt.
Testing catching overflow with poll()...                     FAILED

> 

>> If we're so worried about info leakage, can't we just zero-out the problem 

>> address (or randomize the kernel address) rather than just pretending the 

>> interrupt didn't happen?

> 

> Making up zeroed or randomized data is going to confuse users. I can't

> imagine that real users are going to want bogus samples that they have

> to identify (somehow) in order to skip when processing the data.

> 

> I can see merit in signalling "lost" samples to userspace, so long as

> they're easily distinguished from real samples.

> 

> One option is to fake up a sample using the user regs regardless, but

> that's both fragile and surprising in other cases.

> 

> Thanks,

> Mark.

> 


Thanks,
Alexey
Alexey Budankov June 29, 2017, 8:25 a.m. | #9
On 29.06.2017 11:13, Alexey Budankov wrote:
> Hi Folks,

> 

> On 28.06.2017 16:07, Mark Rutland wrote:

>> On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote:

>>> On Wed, 28 Jun 2017, Mark Rutland wrote:

>>>

>>>> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:

>>>

>>>> Instead of bailing out early in perf_event_overflow, we can bail prior

>>>> to performing the actual sampling in __perf_event_output(). This avoids

>>>> the information leak, but preserves the generation of the signal.

>>>>

>>>> Since we don't place any sample data into the ring buffer, the signal is

>>>> arguably spurious. However, a userspace ringbuffer consumer can already

>>>> consume data prior to taking the associated signals, and therefore must

>>>> handle spurious signals to operate correctly. Thus, this signal

>>>> shouldn't be harmful.

>>>

>>> this could still break some of my perf_event validation tests.

>>>

>>> Ones that set up a sampling event for every 1M instructions, run for 100M 

>>> instructions, and expect there to be 100 samples received.

>>

>> Is that test reliable today?

>>

>> I'd expect that at least on ARM it's not, given that events can be

>> counted imprecisely, and mode filters can be applied imprecisely. So you

>> might get fewer (or more) samples. I'd imagine similar is true on other

>> archtiectures.

>>

>> If sampling took long enough, the existing ratelimiting could come into

>> effect, too.

>>

>> Surely that already has some error margin?

> 

> FYI.

> 

>>From my recent experience and observation (on Intel Xeon Phi) 

> wakeup_events_overflow and overflow_poll tests may fail if 

> /proc/sys/kernel/perf_event_max_sample_rate is low enough:

> 

> # cat /proc/sys/kernel/perf_event_max_sample_rate

> 6000

> # abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow

> This tests wakeup event overflows.

> Testing with wakeup_events=1.

> Counts, using mmap buffer 0x7f707b0dc000

> 	POLL_IN : 10

> 	POLL_OUT: 0

> 	POLL_MSG: 0

> 	POLL_ERR: 0

> 	POLL_PRI: 0

> 	POLL_HUP: 0

> 	UNKNOWN : 0

> Testing wakeup events overflow...                            PASSED

> # abudanko/perf_event_tests/tests/overflow/overflow_poll

> This tests using poll() to catch overflow.

> Monitoring pid 131412 status 1407

> Child has stopped due to signal 5 (Trace/breakpoint trap)

> Continuing child

> Returned HUP!

> Counts, using mmap buffer 0x7fb3bfe04000

> 	POLL_IN : 10

> 	POLL_OUT: 0

> 	POLL_MSG: 0

> 	POLL_ERR: 0

> 	POLL_PRI: 0

> 	POLL_HUP: 1

> 	UNKNOWN : 0

> Testing catching overflow with poll()...                     PASSED

> 

> # echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate

> # abudanko/perf_event_tests/tests/overflow/overflow_poll

> This tests using poll() to catch overflow.

> Monitoring pid 131551 status 1407

> Child has stopped due to signal 5 (Trace/breakpoint trap)

> Continuing child

> Returned HUP!

> Counts, using mmap buffer 0x7f80532df000

> 	POLL_IN : 9

> 	POLL_OUT: 0

> 	POLL_MSG: 0

> 	POLL_ERR: 0

> 	POLL_PRI: 0

> 	POLL_HUP: 1

> 	UNKNOWN : 0

> Unexpected POLL_IN interrupt.

> Testing catching overflow with poll()...                     FAILED

> [root@nntpdsd52-210 ~]# abudanko/perf_event_tests/tests/overflow/overflow_poll

> This tests using poll() to catch overflow.

> Monitoring pid 131553 status 1407

> Child has stopped due to signal 5 (Trace/breakpoint trap)

> Continuing child

> Returned HUP!

> Counts, using mmap buffer 0x7f650952c000

> 	POLL_IN : 9

> 	POLL_OUT: 0

> 	POLL_MSG: 0

> 	POLL_ERR: 0

> 	POLL_PRI: 0

> 	POLL_HUP: 1

> 	UNKNOWN : 0

> Unexpected POLL_IN interrupt.

> Testing catching overflow with poll()...                     FAILED


More of the other test:

# echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate
[ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow
This tests wakeup event overflows.
Testing with wakeup_events=1.
Counts, using mmap buffer 0x7fe65deff000
	POLL_IN : 4
	POLL_OUT: 0
	POLL_MSG: 0
	POLL_ERR: 0
	POLL_PRI: 0
	POLL_HUP: 0
	UNKNOWN : 0
POLL_IN value 4, expected 10.
Testing wakeup events overflow...                            FAILED
[ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow
This tests wakeup event overflows.
Testing with wakeup_events=1.
Counts, using mmap buffer 0x7f818a732000
	POLL_IN : 2
	POLL_OUT: 0
	POLL_MSG: 0
	POLL_ERR: 0
	POLL_PRI: 0
	POLL_HUP: 0
	UNKNOWN : 0
POLL_IN value 2, expected 10.
Testing wakeup events overflow...                            FAILED
[ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow
This tests wakeup event overflows.
Testing with wakeup_events=1.
Counts, using mmap buffer 0x7fb3675c5000
	POLL_IN : 4
	POLL_OUT: 0
	POLL_MSG: 0
	POLL_ERR: 0
	POLL_PRI: 0
	POLL_HUP: 0
	UNKNOWN : 0
POLL_IN value 4, expected 10.
Testing wakeup events overflow...                            FAILED

> 

>>

>>> If we're so worried about info leakage, can't we just zero-out the problem 

>>> address (or randomize the kernel address) rather than just pretending the 

>>> interrupt didn't happen?

>>

>> Making up zeroed or randomized data is going to confuse users. I can't

>> imagine that real users are going to want bogus samples that they have

>> to identify (somehow) in order to skip when processing the data.

>>

>> I can see merit in signalling "lost" samples to userspace, so long as

>> they're easily distinguished from real samples.

>>

>> One option is to fake up a sample using the user regs regardless, but

>> that's both fragile and surprising in other cases.

>>

>> Thanks,

>> Mark.

>>

> 

> Thanks,

> Alexey

> 

> 

>
Kyle Huey June 30, 2017, 5:44 p.m. | #10
On Wed, Jun 28, 2017 at 3:55 PM, Kyle Huey <me@kylehuey.com> wrote:
> On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:

>>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>>> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header,

>>> >         struct perf_output_handle handle;

>>> >         struct perf_event_header header;

>>> >

>>> > +       /*

>>> > +        * For security, drop the skid kernel samples if necessary.

>>> > +        */

>>> > +       if (!sample_is_allowed(event, regs))

>>> > +               return ret;

>>>

>>> Just a bare return here.

>>

>> Ugh, yes. Sorry about that. I'll fix that up.

>>

>> [...]

>>

>>> I can confirm that with that fixed to compile, this patch fixes rr.

>>

>> Thanks for giving this a go.

>>

>> Having thought about this some more, I think Vince does make a good

>> point that throwing away samples is liable to break stuff, e.g. that

>> which only relies on (non-sensitive) samples.

>>

>> It still seems wrong to make up data, though.

>>

>> Maybe for exclude_kernel && !exclude_user events we can always generate

>> samples from the user regs, rather than the exception regs. That's going

>> to be closer to what the user wants, regardless. I'll take a look

>> tomorrow.

>

> I'm not very familiar with the kernel internals, but the reason I

> didn't suggest this originally is it seems like it will be difficult

> to determine what the "correct" userspace registers are.  For example,

> what happens if a performance counter is fixed to a given tid, the

> interrupt fires during a context switch from that task to another that

> is not being monitored, and the kernel is far enough along in the

> context switch that the current task struct has been switched out?

> Reporting the new task's registers seems as bad as reporting the

> kernel's registers.  But maybe this is easier than I imagine for

> whatever reason.

>

> Something to think about.

>

> - Kyle


We've noticed that the regressing changeset made it into stable
branches that distros are shipping now[0], and we're starting to
receive bug reports from our users.  We would really like to get this
patch or something similar into 4.12 and the next stable releases if
at all possible.

Thanks!

- Kyle

[0] Full list in https://mail.mozilla.org/pipermail/rr-dev/2017-June/000500.html
Peter Zijlstra July 4, 2017, 9:06 a.m. | #11
On Thu, Jun 29, 2017 at 10:12:33AM +0200, Ingo Molnar wrote:
> 

> * Mark Rutland <mark.rutland@arm.com> wrote:

> 

> > It still seems wrong to make up data, though.

> 

> So what we have here is a hardware quirk: we asked for user-space samples, but 

> didn't get them and we cannot expose the kernel-internal address.

> 

> The question is, how do we handle the hardware quirk. Since we cannot fix the 

> hardware on existing systems there's really just two choices:

> 

>  - Lose the sample (and signal it as a lost sample)

> 

>  - Keep the sample but change the sensitive kernel-internal address to something 

>    that is not sensitive: 0 or -1 works, but we could perhaps also return a 

>    well-known user-space address such as the vDSO syscall trampoline or such?

> 

> there's no other option really.

> 

> I'd lean towards Vince's take: losing samples is more surprising than getting the 

> occasional sample with some sanitized data in it.

> 

> If we make the artificial data still a meaningful user-space address, related to 

> kernel entries, then it might even be a bonus, as users would learn to recognize 

> it as: 'oh, skid artifact, I know about that'.


So while we could easily fake SAMPLE_IP to do as you suggest, other
entries might be much harder to fake. That said, I have no problems with
just 0 stuffing them.

The only real problem is determining how much to stuff I suppose.
Mark Rutland July 4, 2017, 9:33 a.m. | #12
On Tue, Jul 04, 2017 at 11:03:13AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2017 at 03:55:07PM -0700, Kyle Huey wrote:

> 

> > > Having thought about this some more, I think Vince does make a good

> > > point that throwing away samples is liable to break stuff, e.g. that

> > > which only relies on (non-sensitive) samples.

> > >

> > > It still seems wrong to make up data, though.

> 

> It is something we do in other places as well though. For example the

> printk() %pK thing fakes NULL pointers when kptr_restrict is set.


It looks like I'm outnumbered on that, then. :)

I'd still argue it's worth somehow indicating which samples were thrown
away, so that (updated) userspace can choose to ignore them, but I guess
that can come later.

> Faking data gets a wee bit tricky in how much data we need to clear

> through, its not only IP, pretty much everything we get from the

> interrupt context, like the branch stack and registers is also suspect.


Indeed. I'll take a run through __perf_event_output() and callees, and
see what we need to drop.

> > > Maybe for exclude_kernel && !exclude_user events we can always generate

> > > samples from the user regs, rather than the exception regs. That's going

> > > to be closer to what the user wants, regardless. I'll take a look

> > > tomorrow.

> > 

> > I'm not very familiar with the kernel internals, but the reason I

> > didn't suggest this originally is it seems like it will be difficult

> > to determine what the "correct" userspace registers are.  For example,

> > what happens if a performance counter is fixed to a given tid, the

> > interrupt fires during a context switch from that task to another that

> > is not being monitored, and the kernel is far enough along in the

> > context switch that the current task struct has been switched out?

> > Reporting the new task's registers seems as bad as reporting the

> > kernel's registers.  But maybe this is easier than I imagine for

> > whatever reason.

> 

> If the counter is fixed to a task then its scheduled along with the

> task. We'll schedule out the event before doing the actual task switch

> and switch in the new event after.

> 

> That said, with a per-cpu event the TID sample value is indeed subject

> to skid like you describe.


For per-cpu events, does that matter? Those don't have TID filters in
the first place, no?

Thanks,
Mark.
Peter Zijlstra July 4, 2017, 9:46 a.m. | #13
On Tue, Jul 04, 2017 at 10:33:45AM +0100, Mark Rutland wrote:

> > That said, with a per-cpu event the TID sample value is indeed subject

> > to skid like you describe.

> 

> For per-cpu events, does that matter? Those don't have TID filters in

> the first place, no?


eBPF can do all sorts I suppose.

But yes, at some point you just have to give up, there's only so much
one can do.
Ingo Molnar July 4, 2017, 10:04 a.m. | #14
* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jun 29, 2017 at 10:12:33AM +0200, Ingo Molnar wrote:

> > 

> > * Mark Rutland <mark.rutland@arm.com> wrote:

> > 

> > > It still seems wrong to make up data, though.

> > 

> > So what we have here is a hardware quirk: we asked for user-space samples, but 

> > didn't get them and we cannot expose the kernel-internal address.

> > 

> > The question is, how do we handle the hardware quirk. Since we cannot fix the 

> > hardware on existing systems there's really just two choices:

> > 

> >  - Lose the sample (and signal it as a lost sample)

> > 

> >  - Keep the sample but change the sensitive kernel-internal address to something 

> >    that is not sensitive: 0 or -1 works, but we could perhaps also return a 

> >    well-known user-space address such as the vDSO syscall trampoline or such?

> > 

> > there's no other option really.

> > 

> > I'd lean towards Vince's take: losing samples is more surprising than getting the 

> > occasional sample with some sanitized data in it.

> > 

> > If we make the artificial data still a meaningful user-space address, related to 

> > kernel entries, then it might even be a bonus, as users would learn to recognize 

> > it as: 'oh, skid artifact, I know about that'.

> 

> So while we could easily fake SAMPLE_IP to do as you suggest, other

> entries might be much harder to fake. That said, I have no problems with

> just 0 stuffing them.

> 

> The only real problem is determining how much to stuff I suppose.


I think the RIP is the most important one to fix up in an informative fashion 
(instead of just zeroing it out), so that mainstream users of 'perf top' or
'perf report' have a chance to see that certain entries have this skid artifact.

The other registers should be zeroed out once we stop trusting a sample.

Thanks,

	Ingo
Robert O'Callahan July 6, 2017, 5:07 a.m. | #15
On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> Should any of those be moved into the "should be dropped" pile?


Why not be conservative and clear every sample you're not sure about?

We'd appreciate a fix sooner rather than later here, since rr is
currently broken on every stable Linux kernel and our attempts to
implement a workaround have failed.

(We have separate "interrupt" and "measure" counters, and I thought we
might work around this regression by programming the "interrupt"
counter to count kernel events as well as user events (interrupting
early is OK), but that caused our (completely separate) "measure"
counter to report off-by-one results (!), which seems to be a
different bug present on a range of older kernels.)

Thanks,
Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn
Kyle Huey July 11, 2017, 2:03 a.m. | #16
On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote:
> On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>> Should any of those be moved into the "should be dropped" pile?

>

> Why not be conservative and clear every sample you're not sure about?

>

> We'd appreciate a fix sooner rather than later here, since rr is

> currently broken on every stable Linux kernel and our attempts to

> implement a workaround have failed.

>

> (We have separate "interrupt" and "measure" counters, and I thought we

> might work around this regression by programming the "interrupt"

> counter to count kernel events as well as user events (interrupting

> early is OK), but that caused our (completely separate) "measure"

> counter to report off-by-one results (!), which seems to be a

> different bug present on a range of older kernels.)


This seems to have stalled out here unfortunately.

Can we get a consensus (from ingo or peterz?) on Mark's question?  Or,
alternatively, can we move the patch at the top of this thread forward
on the stable branches until we do reach an answer to that question?

We've abandoned hope of working around this problem in rr and are
currently broken for all of our users with an up-to-date kernel, so
the situation for us is rather dire at the moment I'm afraid.

Thanks,

- Kyle
Ingo Molnar July 11, 2017, 9:03 a.m. | #17
* Kyle Huey <me@kylehuey.com> wrote:

> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote:

> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote:

> >> Should any of those be moved into the "should be dropped" pile?

> >

> > Why not be conservative and clear every sample you're not sure about?

> >

> > We'd appreciate a fix sooner rather than later here, since rr is

> > currently broken on every stable Linux kernel and our attempts to

> > implement a workaround have failed.

> >

> > (We have separate "interrupt" and "measure" counters, and I thought we

> > might work around this regression by programming the "interrupt"

> > counter to count kernel events as well as user events (interrupting

> > early is OK), but that caused our (completely separate) "measure"

> > counter to report off-by-one results (!), which seems to be a

> > different bug present on a range of older kernels.)

> 

> This seems to have stalled out here unfortunately.

> 

> Can we get a consensus (from ingo or peterz?) on Mark's question?  Or,

> alternatively, can we move the patch at the top of this thread forward

> on the stable branches until we do reach an answer to that question?

> 

> We've abandoned hope of working around this problem in rr and are

> currently broken for all of our users with an up-to-date kernel, so

> the situation for us is rather dire at the moment I'm afraid.


Sorry about that - I've queued up a revert for the original commit and will send 
the fix to Linus later today. I've added a -stable tag as well so it can be 
forwarded to Greg the moment it hits upstream.

We should do the original fix as well, but in a version that does not skip the 
sample but zeroes out the RIP and registers (or sets them all to -1LL) - and also 
covers other possible places where skid-RIP is exposed, such as LBR.

Thanks,

	Ingo
Jin, Yao July 11, 2017, 1:07 p.m. | #18
On 7/11/2017 5:03 PM, Ingo Molnar wrote:
> * Kyle Huey <me@kylehuey.com> wrote:

>

>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote:

>>> On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>>>> Should any of those be moved into the "should be dropped" pile?

>>> Why not be conservative and clear every sample you're not sure about?

>>>

>>> We'd appreciate a fix sooner rather than later here, since rr is

>>> currently broken on every stable Linux kernel and our attempts to

>>> implement a workaround have failed.

>>>

>>> (We have separate "interrupt" and "measure" counters, and I thought we

>>> might work around this regression by programming the "interrupt"

>>> counter to count kernel events as well as user events (interrupting

>>> early is OK), but that caused our (completely separate) "measure"

>>> counter to report off-by-one results (!), which seems to be a

>>> different bug present on a range of older kernels.)

>> This seems to have stalled out here unfortunately.

>>

>> Can we get a consensus (from ingo or peterz?) on Mark's question?  Or,

>> alternatively, can we move the patch at the top of this thread forward

>> on the stable branches until we do reach an answer to that question?

>>

>> We've abandoned hope of working around this problem in rr and are

>> currently broken for all of our users with an up-to-date kernel, so

>> the situation for us is rather dire at the moment I'm afraid.

> Sorry about that - I've queued up a revert for the original commit and will send

> the fix to Linus later today. I've added a -stable tag as well so it can be

> forwarded to Greg the moment it hits upstream.

>

> We should do the original fix as well, but in a version that does not skip the

> sample but zeroes out the RIP and registers (or sets them all to -1LL) - and also

> covers other possible places where skid-RIP is exposed, such as LBR.

>

> Thanks,

>

> 	Ingo


Could we provide 2 options in user space when enabling the event sampling?

One option is for the use case like rr debugger which only cares the PMI 
interrupt but doesn't care the skid. The skid samples doesn't need to be 
dropped.

The other option is for the use case which needs the accurate sample 
data. For this option, the skid samples are dropped.

I would suggest to let the user space make the decision to choose which 
option.

Thanks
Jin Yao
Kyle Huey July 11, 2017, 3:32 p.m. | #19
On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar <mingo@kernel.org> wrote:
>

> * Kyle Huey <me@kylehuey.com> wrote:

>

>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote:

>> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>> >> Should any of those be moved into the "should be dropped" pile?

>> >

>> > Why not be conservative and clear every sample you're not sure about?

>> >

>> > We'd appreciate a fix sooner rather than later here, since rr is

>> > currently broken on every stable Linux kernel and our attempts to

>> > implement a workaround have failed.

>> >

>> > (We have separate "interrupt" and "measure" counters, and I thought we

>> > might work around this regression by programming the "interrupt"

>> > counter to count kernel events as well as user events (interrupting

>> > early is OK), but that caused our (completely separate) "measure"

>> > counter to report off-by-one results (!), which seems to be a

>> > different bug present on a range of older kernels.)

>>

>> This seems to have stalled out here unfortunately.

>>

>> Can we get a consensus (from ingo or peterz?) on Mark's question?  Or,

>> alternatively, can we move the patch at the top of this thread forward

>> on the stable branches until we do reach an answer to that question?

>>

>> We've abandoned hope of working around this problem in rr and are

>> currently broken for all of our users with an up-to-date kernel, so

>> the situation for us is rather dire at the moment I'm afraid.

>

> Sorry about that - I've queued up a revert for the original commit and will send

> the fix to Linus later today. I've added a -stable tag as well so it can be

> forwarded to Greg the moment it hits upstream.


Great, thank you.

- Kyle
Ingo Molnar July 12, 2017, 7:57 a.m. | #20
* Jin, Yao <yao.jin@linux.intel.com> wrote:

> Could we provide 2 options in user space when enabling the event sampling?

> 

> One option is for the use case like rr debugger which only cares the PMI 

> interrupt but doesn't care the skid. The skid samples doesn't need to be 

> dropped.


Since it's an information leak, this is not something we want to expose as an 
interface. It's also a very ugly interface: why not just *clear* the sample, 
instead of dropping it?

The hardware messed up and gave us something we specifically did not permit 
user-space to collect. We have to fix the bad effects to the best extent we can, 
and not based on some knob.

Thanks,

	Ingo
Kyle Huey July 18, 2017, 12:07 a.m. | #21
On Tue, Jul 11, 2017 at 8:32 AM, Kyle Huey <me@kylehuey.com> wrote:
> On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar <mingo@kernel.org> wrote:

>>

>> * Kyle Huey <me@kylehuey.com> wrote:

>>

>>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote:

>>> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>>> >> Should any of those be moved into the "should be dropped" pile?

>>> >

>>> > Why not be conservative and clear every sample you're not sure about?

>>> >

>>> > We'd appreciate a fix sooner rather than later here, since rr is

>>> > currently broken on every stable Linux kernel and our attempts to

>>> > implement a workaround have failed.

>>> >

>>> > (We have separate "interrupt" and "measure" counters, and I thought we

>>> > might work around this regression by programming the "interrupt"

>>> > counter to count kernel events as well as user events (interrupting

>>> > early is OK), but that caused our (completely separate) "measure"

>>> > counter to report off-by-one results (!), which seems to be a

>>> > different bug present on a range of older kernels.)

>>>

>>> This seems to have stalled out here unfortunately.

>>>

>>> Can we get a consensus (from ingo or peterz?) on Mark's question?  Or,

>>> alternatively, can we move the patch at the top of this thread forward

>>> on the stable branches until we do reach an answer to that question?

>>>

>>> We've abandoned hope of working around this problem in rr and are

>>> currently broken for all of our users with an up-to-date kernel, so

>>> the situation for us is rather dire at the moment I'm afraid.

>>

>> Sorry about that - I've queued up a revert for the original commit and will send

>> the fix to Linus later today. I've added a -stable tag as well so it can be

>> forwarded to Greg the moment it hits upstream.

>

> Great, thank you.

>

> - Kyle


Hi again,

I saw that the revert landed on perf/urgent but it doesn't look like
that got pulled by Linus in time for 4.13-rc1.  Consider this a gentle
poke to please get this merged :)

- Kyle

Patch

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6c4e523..6b263f3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6090,6 +6090,21 @@  void perf_prepare_sample(struct perf_event_header *header,
 	}
 }
 
+static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
+{
+	/*
+	 * Due to interrupt latency (AKA "skid"), we may enter the
+	 * kernel before taking an overflow, even if the PMU is only
+	 * counting user events.
+	 * To avoid leaking information to userspace, we must always
+	 * reject kernel samples when exclude_kernel is set.
+	 */
+	if (event->attr.exclude_kernel && !user_mode(regs))
+		return false;
+
+	return true;
+}
+
 static void __always_inline
 __perf_event_output(struct perf_event *event,
 		    struct perf_sample_data *data,
@@ -6101,6 +6116,12 @@  void perf_prepare_sample(struct perf_event_header *header,
 	struct perf_output_handle handle;
 	struct perf_event_header header;
 
+	/*
+	 * For security, drop the skid kernel samples if necessary.
+	 */
+	if (!sample_is_allowed(event, regs))
+		return ret;
+
 	/* protect the callchain buffers */
 	rcu_read_lock();
 
@@ -7316,21 +7337,6 @@  int perf_event_account_interrupt(struct perf_event *event)
 	return __perf_event_account_interrupt(event, 1);
 }
 
-static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
-{
-	/*
-	 * Due to interrupt latency (AKA "skid"), we may enter the
-	 * kernel before taking an overflow, even if the PMU is only
-	 * counting user events.
-	 * To avoid leaking information to userspace, we must always
-	 * reject kernel samples when exclude_kernel is set.
-	 */
-	if (event->attr.exclude_kernel && !user_mode(regs))
-		return false;
-
-	return true;
-}
-
 /*
  * Generic event overflow handling, sampling.
  */
@@ -7352,12 +7358,6 @@  static int __perf_event_overflow(struct perf_event *event,
 	ret = __perf_event_account_interrupt(event, throttle);
 
 	/*
-	 * For security, drop the skid kernel samples if necessary.
-	 */
-	if (!sample_is_allowed(event, regs))
-		return ret;
-
-	/*
 	 * XXX event_limit might not quite work as expected on inherited
 	 * events
 	 */
@@ -7372,6 +7372,10 @@  static int __perf_event_overflow(struct perf_event *event,
 
 	READ_ONCE(event->overflow_handler)(event, data, regs);
 
+	/*
+	 * We must generate a wakeup regardless of whether we actually
+	 * generated a sample. This is relied upon by rr.
+	 */
 	if (*perf_event_fasync(event) && event->pending_kill) {
 		event->pending_wakeup = 1;
 		irq_work_queue(&event->pending);