Message ID | 20170628105600.GC5981@leverpostej |
---|---|
State | New |
Headers | show |
On Wed, 28 Jun 2017, Mark Rutland wrote: > On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote: > Instead of bailing out early in perf_event_overflow, we can bail prior > to performing the actual sampling in __perf_event_output(). This avoids > the information leak, but preserves the generation of the signal. > > Since we don't place any sample data into the ring buffer, the signal is > arguably spurious. However, a userspace ringbuffer consumer can already > consume data prior to taking the associated signals, and therefore must > handle spurious signals to operate correctly. Thus, this signal > shouldn't be harmful. this could still break some of my perf_event validation tests. Ones that set up a sampling event for every 1M instructions, run for 100M instructions, and expect there to be 100 samples received. If we're so worried about info leakage, can't we just zero-out the problem address (or randomize the kernel address) rather than just pretending the interrupt didn't happen? Vince
On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote: > On Wed, 28 Jun 2017, Mark Rutland wrote: > > > On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote: > > > Instead of bailing out early in perf_event_overflow, we can bail prior > > to performing the actual sampling in __perf_event_output(). This avoids > > the information leak, but preserves the generation of the signal. > > > > Since we don't place any sample data into the ring buffer, the signal is > > arguably spurious. However, a userspace ringbuffer consumer can already > > consume data prior to taking the associated signals, and therefore must > > handle spurious signals to operate correctly. Thus, this signal > > shouldn't be harmful. > > this could still break some of my perf_event validation tests. > > Ones that set up a sampling event for every 1M instructions, run for 100M > instructions, and expect there to be 100 samples received. Is that test reliable today? I'd expect that at least on ARM it's not, given that events can be counted imprecisely, and mode filters can be applied imprecisely. So you might get fewer (or more) samples. I'd imagine similar is true on other archtiectures. If sampling took long enough, the existing ratelimiting could come into effect, too. Surely that already has some error margin? > If we're so worried about info leakage, can't we just zero-out the problem > address (or randomize the kernel address) rather than just pretending the > interrupt didn't happen? Making up zeroed or randomized data is going to confuse users. I can't imagine that real users are going to want bogus samples that they have to identify (somehow) in order to skip when processing the data. I can see merit in signalling "lost" samples to userspace, so long as they're easily distinguished from real samples. One option is to fake up a sample using the user regs regardless, but that's both fragile and surprising in other cases. Thanks, Mark.
On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote: > On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote: >> On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote: > >> As we're trying to avoid smapling state, I think we can move the check >> into perf_prepare_sample() or __perf_event_output(), where that state is >> actually sampled. I'll take a look at that momentarily. >> >> Just to clarify, you don't care about the sample state at all? i.e. you >> don't need the user program counter? >> >> Is that signal delivered to the tracee, or to a different process that >> traces it? If the latter, what ensures that the task is stopped >> sufficiently quickly? >> >> > It seems to me that it might be reasonable to ignore the interrupt if >> > the purpose of the interrupt is to trigger sampling of the CPUs >> > register state. But if the interrupt will trigger some other >> > operation, such as a signal on an fd, then there's no reason to drop >> > it. >> >> Agreed. I'll try to have a patch for this soon. >> >> I just need to figure out exactly where that overflow signal is >> generated by the perf core. > > I've figured that out now. That's handled by perf_pending_event(), whcih > is the irq_work we schedule in __perf_event_overflow(). > > Does the below patch work for you? > > ---->8---- > From bb1f99dace508ce34ab0818f91d59e73450fa142 Mon Sep 17 00:00:00 2001 > From: Mark Rutland <mark.rutland@arm.com> > Date: Wed, 28 Jun 2017 11:39:25 +0100 > Subject: [PATCH] perf/core: generate overflow signal when samples are dropped > > We recently tried to kill an information leak where users could receive > kernel samples due to skid on the PMU interrupt. To block this, we > bailed out early in perf_event_overflow(), as we do for non-sampling > events. > > This broke rr, which uses sampling events to receive a signal on > overflow (but does not care about the contents of the sample). These > signals are critical to the correct operation of rr. > > Instead of bailing out early in perf_event_overflow, we can bail prior > to performing the actual sampling in __perf_event_output(). This avoids > the information leak, but preserves the generation of the signal. > > Since we don't place any sample data into the ring buffer, the signal is > arguably spurious. However, a userspace ringbuffer consumer can already > consume data prior to taking the associated signals, and therefore must > handle spurious signals to operate correctly. Thus, this signal > shouldn't be harmful. > > Fixes: cc1582c231ea041f ("perf/core: Drop kernel samples even though :u is specified") > Reported-by: Kyle Huey <me@kylehuey.com> > Signed-off-by: Mark Rutland <mark.rutland@arm.com> > Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> > Cc: Arnaldo Carvalho de Melo <acme@kernel.org> > Cc: Ingo Molnar <mingo@kernel.org> > Cc: Jin Yao <yao.jin@linux.intel.com> > Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > kernel/events/core.c | 46 +++++++++++++++++++++++++--------------------- > 1 file changed, 25 insertions(+), 21 deletions(-) > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 6c4e523..6b263f3 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -6090,6 +6090,21 @@ void perf_prepare_sample(struct perf_event_header *header, > } > } > > +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs) > +{ > + /* > + * Due to interrupt latency (AKA "skid"), we may enter the > + * kernel before taking an overflow, even if the PMU is only > + * counting user events. > + * To avoid leaking information to userspace, we must always > + * reject kernel samples when exclude_kernel is set. > + */ > + if (event->attr.exclude_kernel && !user_mode(regs)) > + return false; > + > + return true; > +} > + > static void __always_inline > __perf_event_output(struct perf_event *event, > struct perf_sample_data *data, > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, > struct perf_output_handle handle; > struct perf_event_header header; > > + /* > + * For security, drop the skid kernel samples if necessary. > + */ > + if (!sample_is_allowed(event, regs)) > + return ret; Just a bare return here. > + > /* protect the callchain buffers */ > rcu_read_lock(); > > @@ -7316,21 +7337,6 @@ int perf_event_account_interrupt(struct perf_event *event) > return __perf_event_account_interrupt(event, 1); > } > > -static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs) > -{ > - /* > - * Due to interrupt latency (AKA "skid"), we may enter the > - * kernel before taking an overflow, even if the PMU is only > - * counting user events. > - * To avoid leaking information to userspace, we must always > - * reject kernel samples when exclude_kernel is set. > - */ > - if (event->attr.exclude_kernel && !user_mode(regs)) > - return false; > - > - return true; > -} > - > /* > * Generic event overflow handling, sampling. > */ > @@ -7352,12 +7358,6 @@ static int __perf_event_overflow(struct perf_event *event, > ret = __perf_event_account_interrupt(event, throttle); > > /* > - * For security, drop the skid kernel samples if necessary. > - */ > - if (!sample_is_allowed(event, regs)) > - return ret; > - > - /* > * XXX event_limit might not quite work as expected on inherited > * events > */ > @@ -7372,6 +7372,10 @@ static int __perf_event_overflow(struct perf_event *event, > > READ_ONCE(event->overflow_handler)(event, data, regs); > > + /* > + * We must generate a wakeup regardless of whether we actually > + * generated a sample. This is relied upon by rr. > + */ > if (*perf_event_fasync(event) && event->pending_kill) { > event->pending_wakeup = 1; > irq_work_queue(&event->pending); > -- > 1.9.1 > I can confirm that with that fixed to compile, this patch fixes rr. Thanks! - Kyle
On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote: > On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote: > > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, > > struct perf_output_handle handle; > > struct perf_event_header header; > > > > + /* > > + * For security, drop the skid kernel samples if necessary. > > + */ > > + if (!sample_is_allowed(event, regs)) > > + return ret; > > Just a bare return here. Ugh, yes. Sorry about that. I'll fix that up. [...] > I can confirm that with that fixed to compile, this patch fixes rr. Thanks for giving this a go. Having thought about this some more, I think Vince does make a good point that throwing away samples is liable to break stuff, e.g. that which only relies on (non-sensitive) samples. It still seems wrong to make up data, though. Maybe for exclude_kernel && !exclude_user events we can always generate samples from the user regs, rather than the exception regs. That's going to be closer to what the user wants, regardless. I'll take a look tomorrow. Thanks, Mark.
On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <mark.rutland@arm.com> wrote: > On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote: >> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, >> > struct perf_output_handle handle; >> > struct perf_event_header header; >> > >> > + /* >> > + * For security, drop the skid kernel samples if necessary. >> > + */ >> > + if (!sample_is_allowed(event, regs)) >> > + return ret; >> >> Just a bare return here. > > Ugh, yes. Sorry about that. I'll fix that up. > > [...] > >> I can confirm that with that fixed to compile, this patch fixes rr. > > Thanks for giving this a go. > > Having thought about this some more, I think Vince does make a good > point that throwing away samples is liable to break stuff, e.g. that > which only relies on (non-sensitive) samples. > > It still seems wrong to make up data, though. > > Maybe for exclude_kernel && !exclude_user events we can always generate > samples from the user regs, rather than the exception regs. That's going > to be closer to what the user wants, regardless. I'll take a look > tomorrow. I'm not very familiar with the kernel internals, but the reason I didn't suggest this originally is it seems like it will be difficult to determine what the "correct" userspace registers are. For example, what happens if a performance counter is fixed to a given tid, the interrupt fires during a context switch from that task to another that is not being monitored, and the kernel is far enough along in the context switch that the current task struct has been switched out? Reporting the new task's registers seems as bad as reporting the kernel's registers. But maybe this is easier than I imagine for whatever reason. Something to think about. - Kyle
On 6/29/2017 6:55 AM, Kyle Huey wrote: > On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote: >>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote: >>>> @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, >>>> struct perf_output_handle handle; >>>> struct perf_event_header header; >>>> >>>> + /* >>>> + * For security, drop the skid kernel samples if necessary. >>>> + */ >>>> + if (!sample_is_allowed(event, regs)) >>>> + return ret; >>> Just a bare return here. >> Ugh, yes. Sorry about that. I'll fix that up. >> >> [...] >> >>> I can confirm that with that fixed to compile, this patch fixes rr. >> Thanks for giving this a go. >> >> Having thought about this some more, I think Vince does make a good >> point that throwing away samples is liable to break stuff, e.g. that >> which only relies on (non-sensitive) samples. >> >> It still seems wrong to make up data, though. >> >> Maybe for exclude_kernel && !exclude_user events we can always generate >> samples from the user regs, rather than the exception regs. That's going >> to be closer to what the user wants, regardless. I'll take a look >> tomorrow. > I'm not very familiar with the kernel internals, but the reason I > didn't suggest this originally is it seems like it will be difficult > to determine what the "correct" userspace registers are. For example, > what happens if a performance counter is fixed to a given tid, the > interrupt fires during a context switch from that task to another that > is not being monitored, and the kernel is far enough along in the > context switch that the current task struct has been switched out? > Reporting the new task's registers seems as bad as reporting the > kernel's registers. But maybe this is easier than I imagine for > whatever reason. > > Something to think about. > > - Kyle Yes, I think so. The skid interrupt may be triggered at a wrong context and return wrong indications (e.g. wrong regs) to userspace. So that's why I think the *skid* interrupt had better be dropped. Thanks Jin Yao
* Mark Rutland <mark.rutland@arm.com> wrote:
> It still seems wrong to make up data, though.
So what we have here is a hardware quirk: we asked for user-space samples, but
didn't get them and we cannot expose the kernel-internal address.
The question is, how do we handle the hardware quirk. Since we cannot fix the
hardware on existing systems there's really just two choices:
- Lose the sample (and signal it as a lost sample)
- Keep the sample but change the sensitive kernel-internal address to something
that is not sensitive: 0 or -1 works, but we could perhaps also return a
well-known user-space address such as the vDSO syscall trampoline or such?
there's no other option really.
I'd lean towards Vince's take: losing samples is more surprising than getting the
occasional sample with some sanitized data in it.
If we make the artificial data still a meaningful user-space address, related to
kernel entries, then it might even be a bonus, as users would learn to recognize
it as: 'oh, skid artifact, I know about that'.
Thanks,
Ingo
Hi Folks, On 28.06.2017 16:07, Mark Rutland wrote: > On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote: >> On Wed, 28 Jun 2017, Mark Rutland wrote: >> >>> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote: >> >>> Instead of bailing out early in perf_event_overflow, we can bail prior >>> to performing the actual sampling in __perf_event_output(). This avoids >>> the information leak, but preserves the generation of the signal. >>> >>> Since we don't place any sample data into the ring buffer, the signal is >>> arguably spurious. However, a userspace ringbuffer consumer can already >>> consume data prior to taking the associated signals, and therefore must >>> handle spurious signals to operate correctly. Thus, this signal >>> shouldn't be harmful. >> >> this could still break some of my perf_event validation tests. >> >> Ones that set up a sampling event for every 1M instructions, run for 100M >> instructions, and expect there to be 100 samples received. > > Is that test reliable today? > > I'd expect that at least on ARM it's not, given that events can be > counted imprecisely, and mode filters can be applied imprecisely. So you > might get fewer (or more) samples. I'd imagine similar is true on other > archtiectures. > > If sampling took long enough, the existing ratelimiting could come into > effect, too. > > Surely that already has some error margin? FYI. From my recent experience and observation (on Intel Xeon Phi) wakeup_events_overflow and overflow_poll tests may fail if /proc/sys/kernel/perf_event_max_sample_rate is low enough: # cat /proc/sys/kernel/perf_event_max_sample_rate 6000 # abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow This tests wakeup event overflows. Testing with wakeup_events=1. Counts, using mmap buffer 0x7f707b0dc000 POLL_IN : 10 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 0 UNKNOWN : 0 Testing wakeup events overflow... PASSED # abudanko/perf_event_tests/tests/overflow/overflow_poll This tests using poll() to catch overflow. Monitoring pid 131412 status 1407 Child has stopped due to signal 5 (Trace/breakpoint trap) Continuing child Returned HUP! Counts, using mmap buffer 0x7fb3bfe04000 POLL_IN : 10 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 1 UNKNOWN : 0 Testing catching overflow with poll()... PASSED # echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate # abudanko/perf_event_tests/tests/overflow/overflow_poll This tests using poll() to catch overflow. Monitoring pid 131551 status 1407 Child has stopped due to signal 5 (Trace/breakpoint trap) Continuing child Returned HUP! Counts, using mmap buffer 0x7f80532df000 POLL_IN : 9 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 1 UNKNOWN : 0 Unexpected POLL_IN interrupt. Testing catching overflow with poll()... FAILED [root@nntpdsd52-210 ~]# abudanko/perf_event_tests/tests/overflow/overflow_poll This tests using poll() to catch overflow. Monitoring pid 131553 status 1407 Child has stopped due to signal 5 (Trace/breakpoint trap) Continuing child Returned HUP! Counts, using mmap buffer 0x7f650952c000 POLL_IN : 9 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 1 UNKNOWN : 0 Unexpected POLL_IN interrupt. Testing catching overflow with poll()... FAILED > >> If we're so worried about info leakage, can't we just zero-out the problem >> address (or randomize the kernel address) rather than just pretending the >> interrupt didn't happen? > > Making up zeroed or randomized data is going to confuse users. I can't > imagine that real users are going to want bogus samples that they have > to identify (somehow) in order to skip when processing the data. > > I can see merit in signalling "lost" samples to userspace, so long as > they're easily distinguished from real samples. > > One option is to fake up a sample using the user regs regardless, but > that's both fragile and surprising in other cases. > > Thanks, > Mark. > Thanks, Alexey
On 29.06.2017 11:13, Alexey Budankov wrote: > Hi Folks, > > On 28.06.2017 16:07, Mark Rutland wrote: >> On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote: >>> On Wed, 28 Jun 2017, Mark Rutland wrote: >>> >>>> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote: >>> >>>> Instead of bailing out early in perf_event_overflow, we can bail prior >>>> to performing the actual sampling in __perf_event_output(). This avoids >>>> the information leak, but preserves the generation of the signal. >>>> >>>> Since we don't place any sample data into the ring buffer, the signal is >>>> arguably spurious. However, a userspace ringbuffer consumer can already >>>> consume data prior to taking the associated signals, and therefore must >>>> handle spurious signals to operate correctly. Thus, this signal >>>> shouldn't be harmful. >>> >>> this could still break some of my perf_event validation tests. >>> >>> Ones that set up a sampling event for every 1M instructions, run for 100M >>> instructions, and expect there to be 100 samples received. >> >> Is that test reliable today? >> >> I'd expect that at least on ARM it's not, given that events can be >> counted imprecisely, and mode filters can be applied imprecisely. So you >> might get fewer (or more) samples. I'd imagine similar is true on other >> archtiectures. >> >> If sampling took long enough, the existing ratelimiting could come into >> effect, too. >> >> Surely that already has some error margin? > > FYI. > >>From my recent experience and observation (on Intel Xeon Phi) > wakeup_events_overflow and overflow_poll tests may fail if > /proc/sys/kernel/perf_event_max_sample_rate is low enough: > > # cat /proc/sys/kernel/perf_event_max_sample_rate > 6000 > # abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow > This tests wakeup event overflows. > Testing with wakeup_events=1. > Counts, using mmap buffer 0x7f707b0dc000 > POLL_IN : 10 > POLL_OUT: 0 > POLL_MSG: 0 > POLL_ERR: 0 > POLL_PRI: 0 > POLL_HUP: 0 > UNKNOWN : 0 > Testing wakeup events overflow... PASSED > # abudanko/perf_event_tests/tests/overflow/overflow_poll > This tests using poll() to catch overflow. > Monitoring pid 131412 status 1407 > Child has stopped due to signal 5 (Trace/breakpoint trap) > Continuing child > Returned HUP! > Counts, using mmap buffer 0x7fb3bfe04000 > POLL_IN : 10 > POLL_OUT: 0 > POLL_MSG: 0 > POLL_ERR: 0 > POLL_PRI: 0 > POLL_HUP: 1 > UNKNOWN : 0 > Testing catching overflow with poll()... PASSED > > # echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate > # abudanko/perf_event_tests/tests/overflow/overflow_poll > This tests using poll() to catch overflow. > Monitoring pid 131551 status 1407 > Child has stopped due to signal 5 (Trace/breakpoint trap) > Continuing child > Returned HUP! > Counts, using mmap buffer 0x7f80532df000 > POLL_IN : 9 > POLL_OUT: 0 > POLL_MSG: 0 > POLL_ERR: 0 > POLL_PRI: 0 > POLL_HUP: 1 > UNKNOWN : 0 > Unexpected POLL_IN interrupt. > Testing catching overflow with poll()... FAILED > [root@nntpdsd52-210 ~]# abudanko/perf_event_tests/tests/overflow/overflow_poll > This tests using poll() to catch overflow. > Monitoring pid 131553 status 1407 > Child has stopped due to signal 5 (Trace/breakpoint trap) > Continuing child > Returned HUP! > Counts, using mmap buffer 0x7f650952c000 > POLL_IN : 9 > POLL_OUT: 0 > POLL_MSG: 0 > POLL_ERR: 0 > POLL_PRI: 0 > POLL_HUP: 1 > UNKNOWN : 0 > Unexpected POLL_IN interrupt. > Testing catching overflow with poll()... FAILED More of the other test: # echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate [ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow This tests wakeup event overflows. Testing with wakeup_events=1. Counts, using mmap buffer 0x7fe65deff000 POLL_IN : 4 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 0 UNKNOWN : 0 POLL_IN value 4, expected 10. Testing wakeup events overflow... FAILED [ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow This tests wakeup event overflows. Testing with wakeup_events=1. Counts, using mmap buffer 0x7f818a732000 POLL_IN : 2 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 0 UNKNOWN : 0 POLL_IN value 2, expected 10. Testing wakeup events overflow... FAILED [ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow This tests wakeup event overflows. Testing with wakeup_events=1. Counts, using mmap buffer 0x7fb3675c5000 POLL_IN : 4 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 0 UNKNOWN : 0 POLL_IN value 4, expected 10. Testing wakeup events overflow... FAILED > >> >>> If we're so worried about info leakage, can't we just zero-out the problem >>> address (or randomize the kernel address) rather than just pretending the >>> interrupt didn't happen? >> >> Making up zeroed or randomized data is going to confuse users. I can't >> imagine that real users are going to want bogus samples that they have >> to identify (somehow) in order to skip when processing the data. >> >> I can see merit in signalling "lost" samples to userspace, so long as >> they're easily distinguished from real samples. >> >> One option is to fake up a sample using the user regs regardless, but >> that's both fragile and surprising in other cases. >> >> Thanks, >> Mark. >> > > Thanks, > Alexey > > >
On Wed, Jun 28, 2017 at 3:55 PM, Kyle Huey <me@kylehuey.com> wrote: > On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote: >>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote: >>> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, >>> > struct perf_output_handle handle; >>> > struct perf_event_header header; >>> > >>> > + /* >>> > + * For security, drop the skid kernel samples if necessary. >>> > + */ >>> > + if (!sample_is_allowed(event, regs)) >>> > + return ret; >>> >>> Just a bare return here. >> >> Ugh, yes. Sorry about that. I'll fix that up. >> >> [...] >> >>> I can confirm that with that fixed to compile, this patch fixes rr. >> >> Thanks for giving this a go. >> >> Having thought about this some more, I think Vince does make a good >> point that throwing away samples is liable to break stuff, e.g. that >> which only relies on (non-sensitive) samples. >> >> It still seems wrong to make up data, though. >> >> Maybe for exclude_kernel && !exclude_user events we can always generate >> samples from the user regs, rather than the exception regs. That's going >> to be closer to what the user wants, regardless. I'll take a look >> tomorrow. > > I'm not very familiar with the kernel internals, but the reason I > didn't suggest this originally is it seems like it will be difficult > to determine what the "correct" userspace registers are. For example, > what happens if a performance counter is fixed to a given tid, the > interrupt fires during a context switch from that task to another that > is not being monitored, and the kernel is far enough along in the > context switch that the current task struct has been switched out? > Reporting the new task's registers seems as bad as reporting the > kernel's registers. But maybe this is easier than I imagine for > whatever reason. > > Something to think about. > > - Kyle We've noticed that the regressing changeset made it into stable branches that distros are shipping now[0], and we're starting to receive bug reports from our users. We would really like to get this patch or something similar into 4.12 and the next stable releases if at all possible. Thanks! - Kyle [0] Full list in https://mail.mozilla.org/pipermail/rr-dev/2017-June/000500.html
On Thu, Jun 29, 2017 at 10:12:33AM +0200, Ingo Molnar wrote: > > * Mark Rutland <mark.rutland@arm.com> wrote: > > > It still seems wrong to make up data, though. > > So what we have here is a hardware quirk: we asked for user-space samples, but > didn't get them and we cannot expose the kernel-internal address. > > The question is, how do we handle the hardware quirk. Since we cannot fix the > hardware on existing systems there's really just two choices: > > - Lose the sample (and signal it as a lost sample) > > - Keep the sample but change the sensitive kernel-internal address to something > that is not sensitive: 0 or -1 works, but we could perhaps also return a > well-known user-space address such as the vDSO syscall trampoline or such? > > there's no other option really. > > I'd lean towards Vince's take: losing samples is more surprising than getting the > occasional sample with some sanitized data in it. > > If we make the artificial data still a meaningful user-space address, related to > kernel entries, then it might even be a bonus, as users would learn to recognize > it as: 'oh, skid artifact, I know about that'. So while we could easily fake SAMPLE_IP to do as you suggest, other entries might be much harder to fake. That said, I have no problems with just 0 stuffing them. The only real problem is determining how much to stuff I suppose.
On Tue, Jul 04, 2017 at 11:03:13AM +0200, Peter Zijlstra wrote: > On Wed, Jun 28, 2017 at 03:55:07PM -0700, Kyle Huey wrote: > > > > Having thought about this some more, I think Vince does make a good > > > point that throwing away samples is liable to break stuff, e.g. that > > > which only relies on (non-sensitive) samples. > > > > > > It still seems wrong to make up data, though. > > It is something we do in other places as well though. For example the > printk() %pK thing fakes NULL pointers when kptr_restrict is set. It looks like I'm outnumbered on that, then. :) I'd still argue it's worth somehow indicating which samples were thrown away, so that (updated) userspace can choose to ignore them, but I guess that can come later. > Faking data gets a wee bit tricky in how much data we need to clear > through, its not only IP, pretty much everything we get from the > interrupt context, like the branch stack and registers is also suspect. Indeed. I'll take a run through __perf_event_output() and callees, and see what we need to drop. > > > Maybe for exclude_kernel && !exclude_user events we can always generate > > > samples from the user regs, rather than the exception regs. That's going > > > to be closer to what the user wants, regardless. I'll take a look > > > tomorrow. > > > > I'm not very familiar with the kernel internals, but the reason I > > didn't suggest this originally is it seems like it will be difficult > > to determine what the "correct" userspace registers are. For example, > > what happens if a performance counter is fixed to a given tid, the > > interrupt fires during a context switch from that task to another that > > is not being monitored, and the kernel is far enough along in the > > context switch that the current task struct has been switched out? > > Reporting the new task's registers seems as bad as reporting the > > kernel's registers. But maybe this is easier than I imagine for > > whatever reason. > > If the counter is fixed to a task then its scheduled along with the > task. We'll schedule out the event before doing the actual task switch > and switch in the new event after. > > That said, with a per-cpu event the TID sample value is indeed subject > to skid like you describe. For per-cpu events, does that matter? Those don't have TID filters in the first place, no? Thanks, Mark.
On Tue, Jul 04, 2017 at 10:33:45AM +0100, Mark Rutland wrote: > > That said, with a per-cpu event the TID sample value is indeed subject > > to skid like you describe. > > For per-cpu events, does that matter? Those don't have TID filters in > the first place, no? eBPF can do all sorts I suppose. But yes, at some point you just have to give up, there's only so much one can do.
* Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Jun 29, 2017 at 10:12:33AM +0200, Ingo Molnar wrote: > > > > * Mark Rutland <mark.rutland@arm.com> wrote: > > > > > It still seems wrong to make up data, though. > > > > So what we have here is a hardware quirk: we asked for user-space samples, but > > didn't get them and we cannot expose the kernel-internal address. > > > > The question is, how do we handle the hardware quirk. Since we cannot fix the > > hardware on existing systems there's really just two choices: > > > > - Lose the sample (and signal it as a lost sample) > > > > - Keep the sample but change the sensitive kernel-internal address to something > > that is not sensitive: 0 or -1 works, but we could perhaps also return a > > well-known user-space address such as the vDSO syscall trampoline or such? > > > > there's no other option really. > > > > I'd lean towards Vince's take: losing samples is more surprising than getting the > > occasional sample with some sanitized data in it. > > > > If we make the artificial data still a meaningful user-space address, related to > > kernel entries, then it might even be a bonus, as users would learn to recognize > > it as: 'oh, skid artifact, I know about that'. > > So while we could easily fake SAMPLE_IP to do as you suggest, other > entries might be much harder to fake. That said, I have no problems with > just 0 stuffing them. > > The only real problem is determining how much to stuff I suppose. I think the RIP is the most important one to fix up in an informative fashion (instead of just zeroing it out), so that mainstream users of 'perf top' or 'perf report' have a chance to see that certain entries have this skid artifact. The other registers should be zeroed out once we stop trusting a sample. Thanks, Ingo
On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> Should any of those be moved into the "should be dropped" pile?
Why not be conservative and clear every sample you're not sure about?
We'd appreciate a fix sooner rather than later here, since rr is
currently broken on every stable Linux kernel and our attempts to
implement a workaround have failed.
(We have separate "interrupt" and "measure" counters, and I thought we
might work around this regression by programming the "interrupt"
counter to count kernel events as well as user events (interrupting
early is OK), but that caused our (completely separate) "measure"
counter to report off-by-one results (!), which seems to be a
different bug present on a range of older kernels.)
Thanks,
Rob
--
lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD
selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr
.a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote: > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> Should any of those be moved into the "should be dropped" pile? > > Why not be conservative and clear every sample you're not sure about? > > We'd appreciate a fix sooner rather than later here, since rr is > currently broken on every stable Linux kernel and our attempts to > implement a workaround have failed. > > (We have separate "interrupt" and "measure" counters, and I thought we > might work around this regression by programming the "interrupt" > counter to count kernel events as well as user events (interrupting > early is OK), but that caused our (completely separate) "measure" > counter to report off-by-one results (!), which seems to be a > different bug present on a range of older kernels.) This seems to have stalled out here unfortunately. Can we get a consensus (from ingo or peterz?) on Mark's question? Or, alternatively, can we move the patch at the top of this thread forward on the stable branches until we do reach an answer to that question? We've abandoned hope of working around this problem in rr and are currently broken for all of our users with an up-to-date kernel, so the situation for us is rather dire at the moment I'm afraid. Thanks, - Kyle
* Kyle Huey <me@kylehuey.com> wrote: > On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote: > > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: > >> Should any of those be moved into the "should be dropped" pile? > > > > Why not be conservative and clear every sample you're not sure about? > > > > We'd appreciate a fix sooner rather than later here, since rr is > > currently broken on every stable Linux kernel and our attempts to > > implement a workaround have failed. > > > > (We have separate "interrupt" and "measure" counters, and I thought we > > might work around this regression by programming the "interrupt" > > counter to count kernel events as well as user events (interrupting > > early is OK), but that caused our (completely separate) "measure" > > counter to report off-by-one results (!), which seems to be a > > different bug present on a range of older kernels.) > > This seems to have stalled out here unfortunately. > > Can we get a consensus (from ingo or peterz?) on Mark's question? Or, > alternatively, can we move the patch at the top of this thread forward > on the stable branches until we do reach an answer to that question? > > We've abandoned hope of working around this problem in rr and are > currently broken for all of our users with an up-to-date kernel, so > the situation for us is rather dire at the moment I'm afraid. Sorry about that - I've queued up a revert for the original commit and will send the fix to Linus later today. I've added a -stable tag as well so it can be forwarded to Greg the moment it hits upstream. We should do the original fix as well, but in a version that does not skip the sample but zeroes out the RIP and registers (or sets them all to -1LL) - and also covers other possible places where skid-RIP is exposed, such as LBR. Thanks, Ingo
On 7/11/2017 5:03 PM, Ingo Molnar wrote: > * Kyle Huey <me@kylehuey.com> wrote: > >> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote: >>> On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: >>>> Should any of those be moved into the "should be dropped" pile? >>> Why not be conservative and clear every sample you're not sure about? >>> >>> We'd appreciate a fix sooner rather than later here, since rr is >>> currently broken on every stable Linux kernel and our attempts to >>> implement a workaround have failed. >>> >>> (We have separate "interrupt" and "measure" counters, and I thought we >>> might work around this regression by programming the "interrupt" >>> counter to count kernel events as well as user events (interrupting >>> early is OK), but that caused our (completely separate) "measure" >>> counter to report off-by-one results (!), which seems to be a >>> different bug present on a range of older kernels.) >> This seems to have stalled out here unfortunately. >> >> Can we get a consensus (from ingo or peterz?) on Mark's question? Or, >> alternatively, can we move the patch at the top of this thread forward >> on the stable branches until we do reach an answer to that question? >> >> We've abandoned hope of working around this problem in rr and are >> currently broken for all of our users with an up-to-date kernel, so >> the situation for us is rather dire at the moment I'm afraid. > Sorry about that - I've queued up a revert for the original commit and will send > the fix to Linus later today. I've added a -stable tag as well so it can be > forwarded to Greg the moment it hits upstream. > > We should do the original fix as well, but in a version that does not skip the > sample but zeroes out the RIP and registers (or sets them all to -1LL) - and also > covers other possible places where skid-RIP is exposed, such as LBR. > > Thanks, > > Ingo Could we provide 2 options in user space when enabling the event sampling? One option is for the use case like rr debugger which only cares the PMI interrupt but doesn't care the skid. The skid samples doesn't need to be dropped. The other option is for the use case which needs the accurate sample data. For this option, the skid samples are dropped. I would suggest to let the user space make the decision to choose which option. Thanks Jin Yao
On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Kyle Huey <me@kylehuey.com> wrote: > >> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote: >> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> >> Should any of those be moved into the "should be dropped" pile? >> > >> > Why not be conservative and clear every sample you're not sure about? >> > >> > We'd appreciate a fix sooner rather than later here, since rr is >> > currently broken on every stable Linux kernel and our attempts to >> > implement a workaround have failed. >> > >> > (We have separate "interrupt" and "measure" counters, and I thought we >> > might work around this regression by programming the "interrupt" >> > counter to count kernel events as well as user events (interrupting >> > early is OK), but that caused our (completely separate) "measure" >> > counter to report off-by-one results (!), which seems to be a >> > different bug present on a range of older kernels.) >> >> This seems to have stalled out here unfortunately. >> >> Can we get a consensus (from ingo or peterz?) on Mark's question? Or, >> alternatively, can we move the patch at the top of this thread forward >> on the stable branches until we do reach an answer to that question? >> >> We've abandoned hope of working around this problem in rr and are >> currently broken for all of our users with an up-to-date kernel, so >> the situation for us is rather dire at the moment I'm afraid. > > Sorry about that - I've queued up a revert for the original commit and will send > the fix to Linus later today. I've added a -stable tag as well so it can be > forwarded to Greg the moment it hits upstream. Great, thank you. - Kyle
* Jin, Yao <yao.jin@linux.intel.com> wrote: > Could we provide 2 options in user space when enabling the event sampling? > > One option is for the use case like rr debugger which only cares the PMI > interrupt but doesn't care the skid. The skid samples doesn't need to be > dropped. Since it's an information leak, this is not something we want to expose as an interface. It's also a very ugly interface: why not just *clear* the sample, instead of dropping it? The hardware messed up and gave us something we specifically did not permit user-space to collect. We have to fix the bad effects to the best extent we can, and not based on some knob. Thanks, Ingo
On Tue, Jul 11, 2017 at 8:32 AM, Kyle Huey <me@kylehuey.com> wrote: > On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar <mingo@kernel.org> wrote: >> >> * Kyle Huey <me@kylehuey.com> wrote: >> >>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote: >>> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: >>> >> Should any of those be moved into the "should be dropped" pile? >>> > >>> > Why not be conservative and clear every sample you're not sure about? >>> > >>> > We'd appreciate a fix sooner rather than later here, since rr is >>> > currently broken on every stable Linux kernel and our attempts to >>> > implement a workaround have failed. >>> > >>> > (We have separate "interrupt" and "measure" counters, and I thought we >>> > might work around this regression by programming the "interrupt" >>> > counter to count kernel events as well as user events (interrupting >>> > early is OK), but that caused our (completely separate) "measure" >>> > counter to report off-by-one results (!), which seems to be a >>> > different bug present on a range of older kernels.) >>> >>> This seems to have stalled out here unfortunately. >>> >>> Can we get a consensus (from ingo or peterz?) on Mark's question? Or, >>> alternatively, can we move the patch at the top of this thread forward >>> on the stable branches until we do reach an answer to that question? >>> >>> We've abandoned hope of working around this problem in rr and are >>> currently broken for all of our users with an up-to-date kernel, so >>> the situation for us is rather dire at the moment I'm afraid. >> >> Sorry about that - I've queued up a revert for the original commit and will send >> the fix to Linus later today. I've added a -stable tag as well so it can be >> forwarded to Greg the moment it hits upstream. > > Great, thank you. > > - Kyle Hi again, I saw that the revert landed on perf/urgent but it doesn't look like that got pulled by Linus in time for 4.13-rc1. Consider this a gentle poke to please get this merged :) - Kyle
diff --git a/kernel/events/core.c b/kernel/events/core.c index 6c4e523..6b263f3 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6090,6 +6090,21 @@ void perf_prepare_sample(struct perf_event_header *header, } } +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs) +{ + /* + * Due to interrupt latency (AKA "skid"), we may enter the + * kernel before taking an overflow, even if the PMU is only + * counting user events. + * To avoid leaking information to userspace, we must always + * reject kernel samples when exclude_kernel is set. + */ + if (event->attr.exclude_kernel && !user_mode(regs)) + return false; + + return true; +} + static void __always_inline __perf_event_output(struct perf_event *event, struct perf_sample_data *data, @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, struct perf_output_handle handle; struct perf_event_header header; + /* + * For security, drop the skid kernel samples if necessary. + */ + if (!sample_is_allowed(event, regs)) + return ret; + /* protect the callchain buffers */ rcu_read_lock(); @@ -7316,21 +7337,6 @@ int perf_event_account_interrupt(struct perf_event *event) return __perf_event_account_interrupt(event, 1); } -static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs) -{ - /* - * Due to interrupt latency (AKA "skid"), we may enter the - * kernel before taking an overflow, even if the PMU is only - * counting user events. - * To avoid leaking information to userspace, we must always - * reject kernel samples when exclude_kernel is set. - */ - if (event->attr.exclude_kernel && !user_mode(regs)) - return false; - - return true; -} - /* * Generic event overflow handling, sampling. */ @@ -7352,12 +7358,6 @@ static int __perf_event_overflow(struct perf_event *event, ret = __perf_event_account_interrupt(event, throttle); /* - * For security, drop the skid kernel samples if necessary. - */ - if (!sample_is_allowed(event, regs)) - return ret; - - /* * XXX event_limit might not quite work as expected on inherited * events */ @@ -7372,6 +7372,10 @@ static int __perf_event_overflow(struct perf_event *event, READ_ONCE(event->overflow_handler)(event, data, regs); + /* + * We must generate a wakeup regardless of whether we actually + * generated a sample. This is relied upon by rr. + */ if (*perf_event_fasync(event) && event->pending_kill) { event->pending_wakeup = 1; irq_work_queue(&event->pending);