diff mbox series

[RFC] perf/core: don't sample kernel regs upon skid

Message ID 20180702151250.14536-1-mark.rutland@arm.com
State New
Headers show
Series [RFC] perf/core: don't sample kernel regs upon skid | expand

Commit Message

Mark Rutland July 2, 2018, 3:12 p.m. UTC
Users can request that general purpose registers, instruction pointer,
etc, are sampled when a perf event counter overflows. To try to avoid
this resulting in kernel state being leaked, unprivileged users are
usually forbidden from opening events which count while the kernel is
running.

Unfortunately, this is not sufficient to avoid leading kernel state.

For various reasons, there can be a delay between the overflow occurring
and the resulting overflow exception (e.g. an NMI) being taken. During
this window, other instructions may be executed, resulting in skid.

This skid means that a userspace-only event overflowing may result in an
exception being taken *after* entry to the kernel, allowing kernel
registers to be sampled. Depending on the amount of skid, this may only
leak the PC (breaking KASLR), or it may leak secrets which are currently
live in GPRs.

Let's avoid this by only sampling from the user registers when an event
is supposed to exclude the kernel, providing the illusion that the
overflow exception is taken from userspace.

We also have similar cases when sampling a guest, where we get the host
regs in some cases. It's not entirely clear to me how we should handle
these.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 kernel/events/core.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

-- 
2.11.0

Comments

Peter Zijlstra July 2, 2018, 3:46 p.m. UTC | #1
On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote:
> Users can request that general purpose registers, instruction pointer,

> etc, are sampled when a perf event counter overflows. To try to avoid

> this resulting in kernel state being leaked, unprivileged users are

> usually forbidden from opening events which count while the kernel is

> running.

> 

> Unfortunately, this is not sufficient to avoid leading kernel state.


'leaking' surely.

> 

> For various reasons, there can be a delay between the overflow occurring

> and the resulting overflow exception (e.g. an NMI) being taken. During

> this window, other instructions may be executed, resulting in skid.

> 

> This skid means that a userspace-only event overflowing may result in an

> exception being taken *after* entry to the kernel, allowing kernel

> registers to be sampled. Depending on the amount of skid, this may only

> leak the PC (breaking KASLR), or it may leak secrets which are currently

> live in GPRs.

> 

> Let's avoid this by only sampling from the user registers when an event

> is supposed to exclude the kernel, providing the illusion that the

> overflow exception is taken from userspace.

> 

> We also have similar cases when sampling a guest, where we get the host

> regs in some cases. It's not entirely clear to me how we should handle

> these.


Would not a similar:

	if ((event->attr.exclude_hv || event->attr.exclude_host) /* WTF both !? */ &&
	    perf_guest_cbs && !perf_guest_cbs->is_in_guest())
		return perf_guest_cbs->guest_pt_regs();

work there? Of course, perf_guest_info_callbacks is currently lacking
that guest_pt_regs() thingy..

> 

> Signed-off-by: Mark Rutland <mark.rutland@arm.com>

> Cc: Ingo Molnar <mingo@redhat.com>

> Cc: Jin Yao <yao.jin@linux.intel.com>

> Cc: Peter Zijlstra <peterz@infradead.org>

> ---

>  kernel/events/core.c | 28 ++++++++++++++++++++++++++++

>  1 file changed, 28 insertions(+)

> 

> diff --git a/kernel/events/core.c b/kernel/events/core.c

> index 8f0434a9951a..2ab2548b2e66 100644

> --- a/kernel/events/core.c

> +++ b/kernel/events/core.c

> @@ -6361,6 +6361,32 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)

>  	return callchain ?: &__empty_callchain;

>  }

>  

> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event,

> +					    struct pt_regs *regs)

> +{

> +	/*

> +	 * Due to interrupt latency (AKA "skid"), we may enter the kernel

> +	 * before taking an overflow, even if the PMU is only counting user

> +	 * events.

> +	 *

> +	 * If we're not counting kernel events, always use the user regs when

> +	 * sampling.

> +	 *

> +	 * TODO: what do we do about sampling a guest's registers? The IP is

> +	 * special-cased, but for the rest of the regs they'll get the

> +	 * user/kernel regs depending on whether exclude_kernel is set, which

> +	 * is nonsensical.

> +	 *

> +	 * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU

> +	 * can't provide the GPRs), so should we just zero the GPRs when in a

> +	 * guest? Or skip outputting the regs in perf_output_sample?


Seems daft Xen cannot provide registers; why is that? Boris?

> +	 */

> +	if (event->attr.exclude_kernel && !user_mode(regs))

> +		return task_pt_regs(current);

> +

> +	return regs;

> +}

> +

>  void perf_prepare_sample(struct perf_event_header *header,

>  			 struct perf_sample_data *data,

>  			 struct perf_event *event,

> @@ -6368,6 +6394,8 @@ void perf_prepare_sample(struct perf_event_header *header,

>  {

>  	u64 sample_type = event->attr.sample_type;

>  

> +	regs = perf_get_sample_regs(event, regs);

> +

>  	header->type = PERF_RECORD_SAMPLE;

>  	header->size = sizeof(*header) + event->header_size;


In any case ACK for this thing.
Mark Rutland July 2, 2018, 4:02 p.m. UTC | #2
On Mon, Jul 02, 2018 at 05:46:55PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote:

> > Users can request that general purpose registers, instruction pointer,

> > etc, are sampled when a perf event counter overflows. To try to avoid

> > this resulting in kernel state being leaked, unprivileged users are

> > usually forbidden from opening events which count while the kernel is

> > running.

> > 

> > Unfortunately, this is not sufficient to avoid leading kernel state.

> 

> 'leaking' surely.

> 

> > 

> > For various reasons, there can be a delay between the overflow occurring

> > and the resulting overflow exception (e.g. an NMI) being taken. During

> > this window, other instructions may be executed, resulting in skid.

> > 

> > This skid means that a userspace-only event overflowing may result in an

> > exception being taken *after* entry to the kernel, allowing kernel

> > registers to be sampled. Depending on the amount of skid, this may only

> > leak the PC (breaking KASLR), or it may leak secrets which are currently

> > live in GPRs.

> > 

> > Let's avoid this by only sampling from the user registers when an event

> > is supposed to exclude the kernel, providing the illusion that the

> > overflow exception is taken from userspace.

> > 

> > We also have similar cases when sampling a guest, where we get the host

> > regs in some cases. It's not entirely clear to me how we should handle

> > these.

> 

> Would not a similar:

> 

> 	if ((event->attr.exclude_hv || event->attr.exclude_host) /* WTF both !? */ &&

> 	    perf_guest_cbs && !perf_guest_cbs->is_in_guest())

> 		return perf_guest_cbs->guest_pt_regs();

>

> work there? 


Mostly. It's fun if the user also passed exclude_guest -- I have no idea
what should be sampled there, if anything.

It's easy to get stuck in a rabbit hole looking at this.

> Of course, perf_guest_info_callbacks is currently lacking that

> guest_pt_regs() thingy..


Yeah; I started looking at implementing it, but ran away since it wasn't
clear to me how to build that on most architectures.

> > Signed-off-by: Mark Rutland <mark.rutland@arm.com>

> > Cc: Ingo Molnar <mingo@redhat.com>

> > Cc: Jin Yao <yao.jin@linux.intel.com>

> > Cc: Peter Zijlstra <peterz@infradead.org>

> > ---

> >  kernel/events/core.c | 28 ++++++++++++++++++++++++++++

> >  1 file changed, 28 insertions(+)

> > 

> > diff --git a/kernel/events/core.c b/kernel/events/core.c

> > index 8f0434a9951a..2ab2548b2e66 100644

> > --- a/kernel/events/core.c

> > +++ b/kernel/events/core.c

> > @@ -6361,6 +6361,32 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)

> >  	return callchain ?: &__empty_callchain;

> >  }

> >  

> > +static struct pt_regs *perf_get_sample_regs(struct perf_event *event,

> > +					    struct pt_regs *regs)

> > +{

> > +	/*

> > +	 * Due to interrupt latency (AKA "skid"), we may enter the kernel

> > +	 * before taking an overflow, even if the PMU is only counting user

> > +	 * events.

> > +	 *

> > +	 * If we're not counting kernel events, always use the user regs when

> > +	 * sampling.

> > +	 *

> > +	 * TODO: what do we do about sampling a guest's registers? The IP is

> > +	 * special-cased, but for the rest of the regs they'll get the

> > +	 * user/kernel regs depending on whether exclude_kernel is set, which

> > +	 * is nonsensical.

> > +	 *

> > +	 * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU

> > +	 * can't provide the GPRs), so should we just zero the GPRs when in a

> > +	 * guest? Or skip outputting the regs in perf_output_sample?

> 

> Seems daft Xen cannot provide registers; why is that? Boris?


The xen_pmu_regs structure simply doesn't have them, so I assume there's
no API to get them.

Given we don't currently sample the guest regs, I'd be tempted to just
zero them for now, or skip the sample at output time (if that doesn't
break some other case).

Thanks,
Mark.
Boris Ostrovsky July 9, 2018, 10:42 p.m. UTC | #3
On 07/02/2018 12:02 PM, Mark Rutland wrote:
> On Mon, Jul 02, 2018 at 05:46:55PM +0200, Peter Zijlstra wrote:

>> On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote:

>>> Users can request that general purpose registers, instruction pointer,

>>> etc, are sampled when a perf event counter overflows. To try to avoid

>>> this resulting in kernel state being leaked, unprivileged users are

>>> usually forbidden from opening events which count while the kernel is

>>> running.

>>>

>>> Unfortunately, this is not sufficient to avoid leading kernel state.

>> 'leaking' surely.

>>

>>> For various reasons, there can be a delay between the overflow occurring

>>> and the resulting overflow exception (e.g. an NMI) being taken. During

>>> this window, other instructions may be executed, resulting in skid.

>>>

>>> This skid means that a userspace-only event overflowing may result in an

>>> exception being taken *after* entry to the kernel, allowing kernel

>>> registers to be sampled. Depending on the amount of skid, this may only

>>> leak the PC (breaking KASLR), or it may leak secrets which are currently

>>> live in GPRs.

>>>

>>> Let's avoid this by only sampling from the user registers when an event

>>> is supposed to exclude the kernel, providing the illusion that the

>>> overflow exception is taken from userspace.

>>>

>>> We also have similar cases when sampling a guest, where we get the host

>>> regs in some cases. It's not entirely clear to me how we should handle

>>> these.

>> Would not a similar:

>>

>> 	if ((event->attr.exclude_hv || event->attr.exclude_host) /* WTF both !? */ &&

>> 	    perf_guest_cbs && !perf_guest_cbs->is_in_guest())

>> 		return perf_guest_cbs->guest_pt_regs();

>>

>> work there? 

> Mostly. It's fun if the user also passed exclude_guest -- I have no idea

> what should be sampled there, if anything.

>

> It's easy to get stuck in a rabbit hole looking at this.

>

>> Of course, perf_guest_info_callbacks is currently lacking that

>> guest_pt_regs() thingy..

> Yeah; I started looking at implementing it, but ran away since it wasn't

> clear to me how to build that on most architectures.

>

>>> Signed-off-by: Mark Rutland <mark.rutland@arm.com>

>>> Cc: Ingo Molnar <mingo@redhat.com>

>>> Cc: Jin Yao <yao.jin@linux.intel.com>

>>> Cc: Peter Zijlstra <peterz@infradead.org>

>>> ---

>>>  kernel/events/core.c | 28 ++++++++++++++++++++++++++++

>>>  1 file changed, 28 insertions(+)

>>>

>>> diff --git a/kernel/events/core.c b/kernel/events/core.c

>>> index 8f0434a9951a..2ab2548b2e66 100644

>>> --- a/kernel/events/core.c

>>> +++ b/kernel/events/core.c

>>> @@ -6361,6 +6361,32 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)

>>>  	return callchain ?: &__empty_callchain;

>>>  }

>>>  

>>> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event,

>>> +					    struct pt_regs *regs)

>>> +{

>>> +	/*

>>> +	 * Due to interrupt latency (AKA "skid"), we may enter the kernel

>>> +	 * before taking an overflow, even if the PMU is only counting user

>>> +	 * events.

>>> +	 *

>>> +	 * If we're not counting kernel events, always use the user regs when

>>> +	 * sampling.

>>> +	 *

>>> +	 * TODO: what do we do about sampling a guest's registers? The IP is

>>> +	 * special-cased, but for the rest of the regs they'll get the

>>> +	 * user/kernel regs depending on whether exclude_kernel is set, which

>>> +	 * is nonsensical.

>>> +	 *

>>> +	 * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU

>>> +	 * can't provide the GPRs), so should we just zero the GPRs when in a

>>> +	 * guest? Or skip outputting the regs in perf_output_sample?

>> Seems daft Xen cannot provide registers; why is that? Boris?

> The xen_pmu_regs structure simply doesn't have them, so I assume there's

> no API to get them.

>

> Given we don't currently sample the guest regs, I'd be tempted to just

> zero them for now, or skip the sample at output time (if that doesn't

> break some other case).


(Was out on vacation, couldn't respond earlier)

Yes, PV guests only get a limited set of registers passed to the handler
by the hypervisor. GPRs are not part of this set.

I think we need do

diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c
index 7d00d4a..95997e6 100644
--- a/arch/x86/xen/pmu.c
+++ b/arch/x86/xen/pmu.c
@@ -478,7 +478,7 @@ static void xen_convert_regs(const struct
xen_pmu_regs *xen_regs,
 irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id)
 {
        int err, ret = IRQ_NONE;
-       struct pt_regs regs;
+       struct pt_regs regs = {0};
        const struct xen_pmu_data *xenpmu_data = get_xenpmu_data();
        uint8_t xenpmu_flags = get_xenpmu_flags();


Do you want me to submit a separate patch or can you make this part of
yours?

Thanks.
-boris
Mark Rutland July 11, 2018, 5:59 a.m. UTC | #4
On Mon, Jul 09, 2018 at 06:42:29PM -0400, Boris Ostrovsky wrote:
> On 07/02/2018 12:02 PM, Mark Rutland wrote:

> > On Mon, Jul 02, 2018 at 05:46:55PM +0200, Peter Zijlstra wrote:

> >> On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote:

> >>> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event,

> >>> +					    struct pt_regs *regs)

> >>> +{

> >>> +	/*

> >>> +	 * Due to interrupt latency (AKA "skid"), we may enter the kernel

> >>> +	 * before taking an overflow, even if the PMU is only counting user

> >>> +	 * events.

> >>> +	 *

> >>> +	 * If we're not counting kernel events, always use the user regs when

> >>> +	 * sampling.

> >>> +	 *

> >>> +	 * TODO: what do we do about sampling a guest's registers? The IP is

> >>> +	 * special-cased, but for the rest of the regs they'll get the

> >>> +	 * user/kernel regs depending on whether exclude_kernel is set, which

> >>> +	 * is nonsensical.

> >>> +	 *

> >>> +	 * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU

> >>> +	 * can't provide the GPRs), so should we just zero the GPRs when in a

> >>> +	 * guest? Or skip outputting the regs in perf_output_sample?

> >> Seems daft Xen cannot provide registers; why is that? Boris?

> > The xen_pmu_regs structure simply doesn't have them, so I assume there's

> > no API to get them.

> >

> > Given we don't currently sample the guest regs, I'd be tempted to just

> > zero them for now, or skip the sample at output time (if that doesn't

> > break some other case).

> 

> (Was out on vacation, couldn't respond earlier)

> 

> Yes, PV guests only get a limited set of registers passed to the handler

> by the hypervisor. GPRs are not part of this set.


Is that also true for Dom0?

> I think we need do

> 

> diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c

> index 7d00d4a..95997e6 100644

> --- a/arch/x86/xen/pmu.c

> +++ b/arch/x86/xen/pmu.c

> @@ -478,7 +478,7 @@ static void xen_convert_regs(const struct

> xen_pmu_regs *xen_regs,

>  irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id)

>  {

>         int err, ret = IRQ_NONE;

> -       struct pt_regs regs;

> +       struct pt_regs regs = {0};

>         const struct xen_pmu_data *xenpmu_data = get_xenpmu_data();

>         uint8_t xenpmu_flags = get_xenpmu_flags();

> 

> 

> Do you want me to submit a separate patch or can you make this part of

> yours?


I think this is going to become a series rather than a single patch, but I can
have a go. I need to get my head around how the various cases interact with
each other.

Thanks,
Mark.
Boris Ostrovsky July 11, 2018, 3:33 p.m. UTC | #5
On 07/11/2018 01:59 AM, Mark Rutland wrote:
> On Mon, Jul 09, 2018 at 06:42:29PM -0400, Boris Ostrovsky wrote:

>> On 07/02/2018 12:02 PM, Mark Rutland wrote:

>>> On Mon, Jul 02, 2018 at 05:46:55PM +0200, Peter Zijlstra wrote:

>>>> On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote:

>>>>> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event,

>>>>> +					    struct pt_regs *regs)

>>>>> +{

>>>>> +	/*

>>>>> +	 * Due to interrupt latency (AKA "skid"), we may enter the kernel

>>>>> +	 * before taking an overflow, even if the PMU is only counting user

>>>>> +	 * events.

>>>>> +	 *

>>>>> +	 * If we're not counting kernel events, always use the user regs when

>>>>> +	 * sampling.

>>>>> +	 *

>>>>> +	 * TODO: what do we do about sampling a guest's registers? The IP is

>>>>> +	 * special-cased, but for the rest of the regs they'll get the

>>>>> +	 * user/kernel regs depending on whether exclude_kernel is set, which

>>>>> +	 * is nonsensical.

>>>>> +	 *

>>>>> +	 * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU

>>>>> +	 * can't provide the GPRs), so should we just zero the GPRs when in a

>>>>> +	 * guest? Or skip outputting the regs in perf_output_sample?

>>>> Seems daft Xen cannot provide registers; why is that? Boris?

>>> The xen_pmu_regs structure simply doesn't have them, so I assume there's

>>> no API to get them.

>>>

>>> Given we don't currently sample the guest regs, I'd be tempted to just

>>> zero them for now, or skip the sample at output time (if that doesn't

>>> break some other case).

>> (Was out on vacation, couldn't respond earlier)

>>

>> Yes, PV guests only get a limited set of registers passed to the handler

>> by the hypervisor. GPRs are not part of this set.

> Is that also true for Dom0?



Yes. As far as VPMU is concerned there is (almost) no differences
between privileged (i.e. dom0) and unprivileged PV guests.


-boris
Mark Rutland July 12, 2018, 11:56 a.m. UTC | #6
On Mon, Jul 09, 2018 at 06:42:29PM -0400, Boris Ostrovsky wrote:
> On 07/02/2018 12:02 PM, Mark Rutland wrote:

> > On Mon, Jul 02, 2018 at 05:46:55PM +0200, Peter Zijlstra wrote:

> >> On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote:

> >>> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event,

> >>> +					    struct pt_regs *regs)

> >>> +{

> >>> +	/*

> >>> +	 * Due to interrupt latency (AKA "skid"), we may enter the kernel

> >>> +	 * before taking an overflow, even if the PMU is only counting user

> >>> +	 * events.

> >>> +	 *

> >>> +	 * If we're not counting kernel events, always use the user regs when

> >>> +	 * sampling.

> >>> +	 *

> >>> +	 * TODO: what do we do about sampling a guest's registers? The IP is

> >>> +	 * special-cased, but for the rest of the regs they'll get the

> >>> +	 * user/kernel regs depending on whether exclude_kernel is set, which

> >>> +	 * is nonsensical.

> >>> +	 *

> >>> +	 * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU

> >>> +	 * can't provide the GPRs), so should we just zero the GPRs when in a

> >>> +	 * guest? Or skip outputting the regs in perf_output_sample?

> >> Seems daft Xen cannot provide registers; why is that? Boris?

> > The xen_pmu_regs structure simply doesn't have them, so I assume there's

> > no API to get them.

> >

> > Given we don't currently sample the guest regs, I'd be tempted to just

> > zero them for now, or skip the sample at output time (if that doesn't

> > break some other case).

> 

> (Was out on vacation, couldn't respond earlier)

> 

> Yes, PV guests only get a limited set of registers passed to the handler

> by the hypervisor. GPRs are not part of this set.

> 

> I think we need do

> 

> diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c

> index 7d00d4a..95997e6 100644

> --- a/arch/x86/xen/pmu.c

> +++ b/arch/x86/xen/pmu.c

> @@ -478,7 +478,7 @@ static void xen_convert_regs(const struct

> xen_pmu_regs *xen_regs,

>  irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id)

>  {

>         int err, ret = IRQ_NONE;

> -       struct pt_regs regs;

> +       struct pt_regs regs = {0};

>         const struct xen_pmu_data *xenpmu_data = get_xenpmu_data();

>         uint8_t xenpmu_flags = get_xenpmu_flags();

> 

> 

> Do you want me to submit a separate patch or can you make this part of

> yours?


I've only just realised that this is an issue today, without my
synthezied pt_regs changes. For any PERF_SAMPLE_REGS_* event under Xen,
we'll leak uninitialised kernel stack to userspace in the GPR fields.

Boris, I think it's worth spinning a patch to address that now, with Cc
stable, if you're still happy to do so?

Thanks,
Mark.
Boris Ostrovsky July 12, 2018, 4:32 p.m. UTC | #7
On 07/12/2018 07:56 AM, Mark Rutland wrote:
> On Mon, Jul 09, 2018 at 06:42:29PM -0400, Boris Ostrovsky wrote:

>> On 07/02/2018 12:02 PM, Mark Rutland wrote:

>>> On Mon, Jul 02, 2018 at 05:46:55PM +0200, Peter Zijlstra wrote:

>>>> On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote:

>>>>> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event,

>>>>> +					    struct pt_regs *regs)

>>>>> +{

>>>>> +	/*

>>>>> +	 * Due to interrupt latency (AKA "skid"), we may enter the kernel

>>>>> +	 * before taking an overflow, even if the PMU is only counting user

>>>>> +	 * events.

>>>>> +	 *

>>>>> +	 * If we're not counting kernel events, always use the user regs when

>>>>> +	 * sampling.

>>>>> +	 *

>>>>> +	 * TODO: what do we do about sampling a guest's registers? The IP is

>>>>> +	 * special-cased, but for the rest of the regs they'll get the

>>>>> +	 * user/kernel regs depending on whether exclude_kernel is set, which

>>>>> +	 * is nonsensical.

>>>>> +	 *

>>>>> +	 * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU

>>>>> +	 * can't provide the GPRs), so should we just zero the GPRs when in a

>>>>> +	 * guest? Or skip outputting the regs in perf_output_sample?

>>>> Seems daft Xen cannot provide registers; why is that? Boris?

>>> The xen_pmu_regs structure simply doesn't have them, so I assume there's

>>> no API to get them.

>>>

>>> Given we don't currently sample the guest regs, I'd be tempted to just

>>> zero them for now, or skip the sample at output time (if that doesn't

>>> break some other case).

>> (Was out on vacation, couldn't respond earlier)

>>

>> Yes, PV guests only get a limited set of registers passed to the handler

>> by the hypervisor. GPRs are not part of this set.

>>

>> I think we need do

>>

>> diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c

>> index 7d00d4a..95997e6 100644

>> --- a/arch/x86/xen/pmu.c

>> +++ b/arch/x86/xen/pmu.c

>> @@ -478,7 +478,7 @@ static void xen_convert_regs(const struct

>> xen_pmu_regs *xen_regs,

>>  irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id)

>>  {

>>         int err, ret = IRQ_NONE;

>> -       struct pt_regs regs;

>> +       struct pt_regs regs = {0};

>>         const struct xen_pmu_data *xenpmu_data = get_xenpmu_data();

>>         uint8_t xenpmu_flags = get_xenpmu_flags();

>>

>>

>> Do you want me to submit a separate patch or can you make this part of

>> yours?

> I've only just realised that this is an issue today, without my

> synthezied pt_regs changes. For any PERF_SAMPLE_REGS_* event under Xen,

> we'll leak uninitialised kernel stack to userspace in the GPR fields.

>

> Boris, I think it's worth spinning a patch to address that now, with Cc

> stable, if you're still happy to do so?



Sure, I can do this.

(As a side note, I also noticed this issue but wasn't especially
concerned about it since VPMU is not supported in Xen from security
perspective: http://xenbits.xenproject.org/xsa/advisory-163.html).

-boris
Mark Rutland Aug. 2, 2018, 8:57 a.m. UTC | #8
On Wed, Jul 11, 2018 at 06:59:28AM +0100, Mark Rutland wrote:
> On Mon, Jul 09, 2018 at 06:42:29PM -0400, Boris Ostrovsky wrote:

> > On 07/02/2018 12:02 PM, Mark Rutland wrote:

> > > On Mon, Jul 02, 2018 at 05:46:55PM +0200, Peter Zijlstra wrote:

> > >> On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote:

> > >>> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event,

> > >>> +					    struct pt_regs *regs)

> > >>> +{

> > >>> +	/*

> > >>> +	 * Due to interrupt latency (AKA "skid"), we may enter the kernel

> > >>> +	 * before taking an overflow, even if the PMU is only counting user

> > >>> +	 * events.

> > >>> +	 *

> > >>> +	 * If we're not counting kernel events, always use the user regs when

> > >>> +	 * sampling.

> > >>> +	 *

> > >>> +	 * TODO: what do we do about sampling a guest's registers? The IP is

> > >>> +	 * special-cased, but for the rest of the regs they'll get the

> > >>> +	 * user/kernel regs depending on whether exclude_kernel is set, which

> > >>> +	 * is nonsensical.

> > >>> +	 *

> > >>> +	 * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU

> > >>> +	 * can't provide the GPRs), so should we just zero the GPRs when in a

> > >>> +	 * guest? Or skip outputting the regs in perf_output_sample?

> > >> Seems daft Xen cannot provide registers; why is that? Boris?

> > > The xen_pmu_regs structure simply doesn't have them, so I assume there's

> > > no API to get them.

> > >

> > > Given we don't currently sample the guest regs, I'd be tempted to just

> > > zero them for now, or skip the sample at output time (if that doesn't

> > > break some other case).


> I think this is going to become a series rather than a single patch, but I can

> have a go. I need to get my head around how the various cases interact with

> each other.


As a heads-up, due to sabbatical leave I'm probably not going to have
the chance to dig into this until September.

If someone else is able to take a look at this, that would be great.
Otherwise, I'm happy to take a look once I get back.

Thanks,
Mark.
diff mbox series

Patch

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8f0434a9951a..2ab2548b2e66 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6361,6 +6361,32 @@  perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	return callchain ?: &__empty_callchain;
 }
 
+static struct pt_regs *perf_get_sample_regs(struct perf_event *event,
+					    struct pt_regs *regs)
+{
+	/*
+	 * Due to interrupt latency (AKA "skid"), we may enter the kernel
+	 * before taking an overflow, even if the PMU is only counting user
+	 * events.
+	 *
+	 * If we're not counting kernel events, always use the user regs when
+	 * sampling.
+	 *
+	 * TODO: what do we do about sampling a guest's registers? The IP is
+	 * special-cased, but for the rest of the regs they'll get the
+	 * user/kernel regs depending on whether exclude_kernel is set, which
+	 * is nonsensical.
+	 *
+	 * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU
+	 * can't provide the GPRs), so should we just zero the GPRs when in a
+	 * guest? Or skip outputting the regs in perf_output_sample?
+	 */
+	if (event->attr.exclude_kernel && !user_mode(regs))
+		return task_pt_regs(current);
+
+	return regs;
+}
+
 void perf_prepare_sample(struct perf_event_header *header,
 			 struct perf_sample_data *data,
 			 struct perf_event *event,
@@ -6368,6 +6394,8 @@  void perf_prepare_sample(struct perf_event_header *header,
 {
 	u64 sample_type = event->attr.sample_type;
 
+	regs = perf_get_sample_regs(event, regs);
+
 	header->type = PERF_RECORD_SAMPLE;
 	header->size = sizeof(*header) + event->header_size;