diff mbox

Possible race between CPU hotplug and perf_pmu_migrate_context

Message ID 20140903115013.GA3127@leverpostej
State New
Headers show

Commit Message

Mark Rutland Sept. 3, 2014, 11:50 a.m. UTC
Hi all,

Further to my earlier reply I've come up with a potential fix below,
which has survived my stress test for both my WIP driver and the intel
uncore imc driver.

As it's impossible to synchronize with the event->ctx I'd hoped it would
be possible to synchronize with a field on the event itself.
Unfortunately all I managed to come up with were some shiny new ABBA
deadlocks.

Instead I've followed the example set by perf_event_open and inhibited
CPU hotplug for the portion of put_event that removes an event from its
context, which will prevent perf_pmu_migrate_context from modifying
event->ctx under our feet.

While there's the potential to starve CPU hotplug, that's already the
case for the perf_event_open path, so I'm not sure if that's a big deal
or not.

Thoughts?

Mark.

---->8----
From 6465beace3ad9b12039127468f4596b8e87a53e8 Mon Sep 17 00:00:00 2001
From: Mark Rutland <mark.rutland@arm.com>
Date: Wed, 3 Sep 2014 11:06:22 +0100
Subject: [PATCH] perf: prevent hotplug race on event->ctx

The perf_pmu_migrate_context code introduced in 0cda4c023132 (perf:
Introduce perf_pmu_migrate_context()) didn't take the tear-down of
events into account, and left open a race with put_event on event->ctx.
A resulting duplicate put_ctx of an event's original context can lead to
the context being erroneously kfreed via RCU, resulting in the below
splat with the intel uncore_imc PMU driver:

[   66.621306] ------------[ cut here ]------------
[   66.625933] kernel BUG at mm/slub.c:3380!
[   66.629947] invalid opcode: 0000 [#1] SMP
[   66.634101] Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) x86_pkg_temp_thermal
[   66.643476] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O  3.16.1-uncore-pmu-test #2
[   66.653132] Hardware name: LENOVO 10A6A03EUK/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[   66.660530] task: ffff88040b584f50 ti: ffff88040b5d4000 task.ti: ffff88040b5d4000
[   66.668009] RIP: 0010:[<ffffffff8114a443>]  [<ffffffff8114a443>] kfree+0x133/0x140
[   66.675615] RSP: 0018:ffff88041dc43ea8  EFLAGS: 00010246
[   66.680930] RAX: 0200000000000400 RBX: ffff88041dc18100 RCX: 00000000000000c8
[   66.688066] RDX: 0200000000000000 RSI: ffff8800db601800 RDI: ffff88041dc18100
[   66.695202] RBP: ffff88041dc43ec0 R08: 00000000000156e0 R09: ffff88041dc556e0
[   66.702334] R10: ffffea0010770600 R11: ffffea00036d8000 R12: ffffffff81c3dec0
[   66.709472] R13: ffffffff8109dd33 R14: ffff880409b96b08 R15: 0000000000000006
[   66.716607] FS:  0000000000000000(0000) GS:ffff88041dc40000(0000) knlGS:0000000000000000
[   66.724697] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   66.730443] CR2: 00007fae8a93b000 CR3: 00000000dc962000 CR4: 00000000001407e0
[   66.737580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   66.744714] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   66.751852] Stack:
[   66.753873]  ffff88041dc4d300 ffffffff81c3dec0 000000000000000a ffff88041dc43f20
[   66.761371]  ffffffff8109dd33 ffff8800db600500 ffff88040b584f50 ffff88040b5d7fd8
[   66.768873]  ffff88041dc4d328 0000000000000000 0000000000000009 ffffffff81c090c8
[   66.776371] Call Trace:
[   66.778823]  <IRQ>
[   66.780759]  [<ffffffff8109dd33>] rcu_process_callbacks+0x1e3/0x540
[   66.787254]  [<ffffffff8104e70e>] __do_softirq+0xee/0x280
[   66.792654]  [<ffffffff8104eaad>] irq_exit+0x9d/0xb0
[   66.797625]  [<ffffffff81032b4f>] smp_apic_timer_interrupt+0x3f/0x50
[   66.803982]  [<ffffffff817de68a>] apic_timer_interrupt+0x6a/0x70
[   66.809994]  <EOI>
[   66.811926]  [<ffffffff81590ce7>] ? cpuidle_enter_state+0x47/0xc0
[   66.818250]  [<ffffffff81590e12>] cpuidle_enter+0x12/0x20
[   66.823650]  [<ffffffff81086aa6>] cpu_startup_entry+0x256/0x3f0
[   66.829572]  [<ffffffff81030d82>] start_secondary+0x192/0x200
[   66.835319] Code: 49 8b 02 31 f6 f6 c4 40 74 04 41 8b 72 68 4c 89 d7 e8 92 ed fb ff eb 93 4c 8b 50 30 48 8b 10 80 e6 80 4c 0f 44 d0 e9 36 ff ff ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 c7 c0 ea ff ff ff
[   66.855859] RIP  [<ffffffff8114a443>] kfree+0x133/0x140
[   66.861113]  RSP <ffff88041dc43ea8>
[   66.864617] ---[ end trace 825fa0ba52ca10eb ]---
[   66.869240] Kernel panic - not syncing: Fatal exception in interrupt
[   66.875616] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[   66.885791] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

In response to a CPU notifier an uncore PMU driver calls
perf_pmu_migrate context, which will remove all events from the old CPU
context before placing them all into the new CPU context. For a short
period the events are in limbo and are part of neither context, though
their event->ctx pointers still point at the old context.

During this period another CPU may enter put_event, which will try to
remove the event from event->ctx. As this may still point at the old
context, put_ctx can be called twice for a given event on the original
context. The context's refcount may drop to zero unexpectedly, whereupon
put_ctx will queue up a kfree with RCU. This blows up at the end of the
next grace period as the uncore PMU contexts are housed within
perf_cpu_context and weren't directly allocated with k*alloc.

This patch prevents the issue by inhibiting hotplug for the portion of
put_event which must access event->ctx, preventing the notifiers which
call perf_pmu_migrate_context from running concurrently. Once the event
has been removed from its context perf_pmu_migrate_context will no
longer be able to access it, so it is not necessary to inhibit hotplug
for the duration of event tear-down.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zheng, Yan <zheng.z.yan@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vince Weaver <vincent.weaver@maine.edu>
---
 kernel/events/core.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Sept. 4, 2014, 10:44 a.m. UTC | #1
On Wed, Sep 03, 2014 at 12:50:14PM +0100, Mark Rutland wrote:
> From 6465beace3ad9b12039127468f4596b8e87a53e8 Mon Sep 17 00:00:00 2001
> From: Mark Rutland <mark.rutland@arm.com>
> Date: Wed, 3 Sep 2014 11:06:22 +0100
> Subject: [PATCH] perf: prevent hotplug race on event->ctx
> 
> The perf_pmu_migrate_context code introduced in 0cda4c023132 (perf:
> Introduce perf_pmu_migrate_context()) didn't take the tear-down of
> events into account, and left open a race with put_event on event->ctx.
> A resulting duplicate put_ctx of an event's original context can lead to
> the context being erroneously kfreed via RCU, resulting in the below
> splat with the intel uncore_imc PMU driver:

<snip>

> In response to a CPU notifier an uncore PMU driver calls
> perf_pmu_migrate context, which will remove all events from the old CPU
> context before placing them all into the new CPU context. For a short
> period the events are in limbo and are part of neither context, though
> their event->ctx pointers still point at the old context.
> 
> During this period another CPU may enter put_event, which will try to
> remove the event from event->ctx. As this may still point at the old
> context, put_ctx can be called twice for a given event on the original
> context. The context's refcount may drop to zero unexpectedly, whereupon
> put_ctx will queue up a kfree with RCU. This blows up at the end of the
> next grace period as the uncore PMU contexts are housed within
> perf_cpu_context and weren't directly allocated with k*alloc.
> 
> This patch prevents the issue by inhibiting hotplug for the portion of
> put_event which must access event->ctx, preventing the notifiers which
> call perf_pmu_migrate_context from running concurrently. Once the event
> has been removed from its context perf_pmu_migrate_context will no
> longer be able to access it, so it is not necessary to inhibit hotplug
> for the duration of event tear-down.

Right, so that works I suppose. The thing is, get_online_cpus() is a
global lock and we can potentially do a lot of put_event()s. I had a
patch a while back that rewrote the cpuhotplug locking, but Linus didn't
particularly like that at the time.

I'll try and see if I can come up with anything else, but so far I've
only discovered a lot of ways that don't work (like I'm sure you did
too).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mark Rutland Sept. 4, 2014, 11:07 a.m. UTC | #2
On Thu, Sep 04, 2014 at 11:44:02AM +0100, Peter Zijlstra wrote:
> On Wed, Sep 03, 2014 at 12:50:14PM +0100, Mark Rutland wrote:
> > From 6465beace3ad9b12039127468f4596b8e87a53e8 Mon Sep 17 00:00:00 2001
> > From: Mark Rutland <mark.rutland@arm.com>
> > Date: Wed, 3 Sep 2014 11:06:22 +0100
> > Subject: [PATCH] perf: prevent hotplug race on event->ctx
> > 
> > The perf_pmu_migrate_context code introduced in 0cda4c023132 (perf:
> > Introduce perf_pmu_migrate_context()) didn't take the tear-down of
> > events into account, and left open a race with put_event on event->ctx.
> > A resulting duplicate put_ctx of an event's original context can lead to
> > the context being erroneously kfreed via RCU, resulting in the below
> > splat with the intel uncore_imc PMU driver:
> 
> <snip>
> 
> > In response to a CPU notifier an uncore PMU driver calls
> > perf_pmu_migrate context, which will remove all events from the old CPU
> > context before placing them all into the new CPU context. For a short
> > period the events are in limbo and are part of neither context, though
> > their event->ctx pointers still point at the old context.
> > 
> > During this period another CPU may enter put_event, which will try to
> > remove the event from event->ctx. As this may still point at the old
> > context, put_ctx can be called twice for a given event on the original
> > context. The context's refcount may drop to zero unexpectedly, whereupon
> > put_ctx will queue up a kfree with RCU. This blows up at the end of the
> > next grace period as the uncore PMU contexts are housed within
> > perf_cpu_context and weren't directly allocated with k*alloc.
> > 
> > This patch prevents the issue by inhibiting hotplug for the portion of
> > put_event which must access event->ctx, preventing the notifiers which
> > call perf_pmu_migrate_context from running concurrently. Once the event
> > has been removed from its context perf_pmu_migrate_context will no
> > longer be able to access it, so it is not necessary to inhibit hotplug
> > for the duration of event tear-down.
> 
> Right, so that works I suppose. The thing is, get_online_cpus() is a
> global lock and we can potentially do a lot of put_event()s. I had a
> patch a while back that rewrote the cpuhotplug locking, but Linus didn't
> particularly like that at the time.

Yeah, calling {get,put}_online_cpus() is far from ideal.

When testing open/close and hotplug had a rather noticeable effect on
each others' progress (per visible rate of output over serial, I didn't
make any actual measurements). Killing a few tasks with ~1024 events
open each would crawl to completion over a few seconds.

> I'll try and see if I can come up with anything else, but so far I've
> only discovered a lot of ways that don't work (like I'm sure you did
> too).

Yup; ABBA deadlock or too many/few put_ctx(old_ctx) calls.

Thanks for taking a look. If you have any ideas I'm happy to try another
approach.

Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Sept. 5, 2014, 3:41 p.m. UTC | #3
On Fri, Sep 5, 2014 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> How horrible is the below patch (performance wise). It does pretty much
> the same thing except that percpu_rw_semaphore is a lot saner, its
> read side performance should be minimal in the absence of writes.

Ugh. Why do any locking at all (whether a new 'perf_rwsem' or using
'get_online_cpus()').

Wouldn't it be much nicer to just do what memory management routines
are *supposed* to do, and get a reference count to the context while
having a pointer to it?

IOW, why doesn't put_event() just have a

   get_ctx(ctx);
   ..
   put_ctx(ctx);

around its use of the context pointer? So if the context ends up being
migrated during this time, it doesn't get freed.

However, the more fundamental question is "what protects accesses to
'events->ctx'". Why is "put_event()" so special that *it* gets locking
for the reading of "event->ctx", but none of the other cases of
reading the ctx pointer gets it or needs it?

I'm getting the feeling that this race is bigger than just put_event().

            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vince Weaver Sept. 5, 2014, 4:50 p.m. UTC | #4
On Fri, 5 Sep 2014, Linus Torvalds wrote:


> However, the more fundamental question is "what protects accesses to
> 'events->ctx'". Why is "put_event()" so special that *it* gets locking
> for the reading of "event->ctx", but none of the other cases of
> reading the ctx pointer gets it or needs it?
> 
> I'm getting the feeling that this race is bigger than just put_event().

I've been chasing a bug triggered by my perf_fuzzer program (with a 
forking workload) for the past few months.  It will reliably oops the 
machine or worse (I've had it somehow not only take down the test 
machine, but the whole local network somehow).

Often it seems to come from deep inside the perf_event context locking, in 
conjunction with complex open/fork/close/migrate workloads.

Here's a link to an older bug writeup, I've had it happen more recently 
but I've been too busy to bother writing it up.

	http://web.eece.maine.edu/~vweaver/projects/perf_events/fuzzer/3.15-rc5.get_cpu_context_gpf.html

Is there hope that we've finally found a plausible source for this bug?

Vince
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mark Rutland Sept. 5, 2014, 4:59 p.m. UTC | #5
On Fri, Sep 05, 2014 at 04:41:43PM +0100, Linus Torvalds wrote:
> On Fri, Sep 5, 2014 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > How horrible is the below patch (performance wise). It does pretty much
> > the same thing except that percpu_rw_semaphore is a lot saner, its
> > read side performance should be minimal in the absence of writes.
> 
> Ugh. Why do any locking at all (whether a new 'perf_rwsem' or using
> 'get_online_cpus()').
> 
> Wouldn't it be much nicer to just do what memory management routines
> are *supposed* to do, and get a reference count to the context while
> having a pointer to it?
> 
> IOW, why doesn't put_event() just have a
> 
>    get_ctx(ctx);
>    ..
>    put_ctx(ctx);
> 
> around its use of the context pointer? So if the context ends up being
> migrated during this time, it doesn't get freed.

For the duration of put_event, the event holds a ref on the context. That only
gets decremented _after_ we're done dealing with event->ctx, at the very end of
put_event. Follow the callchain:

	put_event(event)
	-> _free_event(event)
	-> __free_event(event)
	-> put_ctx(event->ctx).

As you point out below, the race on event->ctx is the fundamental issue. That
is what results in decrementing the refcount twice (once on a stale event->ctx
pointer).

> However, the more fundamental question is "what protects accesses to
> 'events->ctx'". Why is "put_event()" so special that *it* gets locking
> for the reading of "event->ctx", but none of the other cases of
> reading the ctx pointer gets it or needs it?

The key point is that it doesn't, which is precisely what this patch attempted
to correct. 

Regardless you're right that other uses of event->ctx are just as broken.  What
perf_pmu_migrate_context failed to take into account was that it is possible to
access an event without going via its owning context and holding ctx->mutex.

> I'm getting the feeling that this race is bigger than just put_event().

We certainly have at least one more race; for event groups perf_read can lock
the stale context.

Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Sept. 5, 2014, 5:31 p.m. UTC | #6
On Fri, Sep 5, 2014 at 9:59 AM, Mark Rutland <mark.rutland@arm.com> wrote:
>
> As you point out below, the race on event->ctx is the fundamental issue. That
> is what results in decrementing the refcount twice (once on a stale event->ctx
> pointer).

So quite frankly, the whole perf_pmu_migrate_context() thing looks
completely and fundamentally broken.

Your patch doesn't really solve anything in general, and would need to
be extended to do that get_online_cpus() around every single case
where we read it. Which isn't really acceptable.

perf_pmu_migrate_context() is just f*cked up. It needs to be reverted.
It seems to think that "the event->ctx" pointer is a RCU-protected
pointer, but it really isn't. Its lifetime is protected by
refcounting, and the pointer access itself isn't protected by
anything, because "event->ctx" isn't supposed to change over the
lifetime of "event". Nobody does all the necessary rcu_read_[un]lock()
around it, nor access it with rcu_dereference(), because all users
know that it's just fixed.

Quite frankly, I think commit 0cda4c023132 ("perf: Introduce
perf_pmu_migrate_context()") is unfixably broken. The whole thing
needs to be re-thought. It violates everything that everybody else
does with the context. No amount of (sane) locking can fix the
breakage, I suspect.

A possible (but unfortunate) real fix is to try to keep the broken
code around, and make its broken assumptions be actually true. Make
"event->ctx" truly be an RCU pointer. Add the RCU annotations, add the
required rcu read-locking, and make everything really do what that
currently completely broken perf_pmu_migrate_context() thinks it does.

   That implies, for example, that any time you use the thing in a
sleeping context, you'd need to do something like this (possibly with
a helper function to do that: "get_event_ctx()"

        rcu_read_lock();
        ctx = rcu_dereference(event->rcu);
        get_ctx(ctx);
        rcu_read_unlock();

        .. you can now sleep here and use 'ctx' ...

        put_ctx(ctx);

    and if you don't need to sleep, you can just do

        rcu_read_lock();
        ctx = rcu_dereference(event->ctx);
        .. use ctx in an atomic context ...
        rcu_read_unlock();

but that is a much bigger patch than either of the "introduce a
completely broken ad-hoc lock around just one random use case"
patches.

Alternatively (and this sounds like the better fix), use some way to
avoid that broken perf_pmu_migrate_context() model entirely. Never
"migrate" events. Create completely new ones on the new CPU, and leave
the old stale ones alone even on dead CPU's (they will presumably not
actually be activated, and who the hell cares?).

Or even just say: "if somebody takes down a CPU with existing uncore
events, those events are now effectively dead". Don't try to migrate
them, don't try to create new events on another CPU, just let it go.
The CPU is down, the events are inactive.

Keep it simple. Not the current completely broken thing that clearly
doesn't honor the actual real rules for "event->ctx". Because the
current rules are effectively that event->ctx is a constant over the
whole lifetime of the event. Agreed?

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 5, 2014, 7:54 p.m. UTC | #7
On Fri, Sep 05, 2014 at 10:31:49AM -0700, Linus Torvalds wrote:
> So quite frankly, the whole perf_pmu_migrate_context() thing looks
> completely and fundamentally broken.

Yes, agreed. We can go play very nasty games, but fundamentally agreed.

> Or even just say: "if somebody takes down a CPU with existing uncore
> events, those events are now effectively dead". Don't try to migrate
> them, don't try to create new events on another CPU, just let it go.
> The CPU is down, the events are inactive.
> 
> Keep it simple. Not the current completely broken thing that clearly
> doesn't honor the actual real rules for "event->ctx". Because the
> current rules are effectively that event->ctx is a constant over the
> whole lifetime of the event. Agreed?

I like this best, I've never really been a supporter of hotplug games. 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 8, 2014, 8:39 a.m. UTC | #8
On Fri, Sep 05, 2014 at 10:31:49AM -0700, Linus Torvalds wrote:
> So quite frankly, the whole perf_pmu_migrate_context() thing looks
> completely and fundamentally broken.
> 
> Your patch doesn't really solve anything in general, and would need to
> be extended to do that get_online_cpus() around every single case
> where we read it. Which isn't really acceptable.
> 
> perf_pmu_migrate_context() is just f*cked up. It needs to be reverted.
> It seems to think that "the event->ctx" pointer is a RCU-protected
> pointer, but it really isn't. Its lifetime is protected by
> refcounting, and the pointer access itself isn't protected by
> anything, because "event->ctx" isn't supposed to change over the
> lifetime of "event". Nobody does all the necessary rcu_read_[un]lock()
> around it, nor access it with rcu_dereference(), because all users
> know that it's just fixed.

Its worse; there's a second site that does exactly the same broken thing
:/

The 'move_group' logic in sys_perf_event_open() migrates a software
event into a hardware context when creating event groups.

> Quite frankly, I think commit 0cda4c023132 ("perf: Introduce
> perf_pmu_migrate_context()") is unfixably broken. The whole thing
> needs to be re-thought. It violates everything that everybody else
> does with the context. No amount of (sane) locking can fix the
> breakage, I suspect.

So yes and no; yes for sanity and consistency, no because after we've
done perf_remove_from_context() the event itself is inactive and we only
need to 'worry' about external (mostly userspace and exit paths)
interaction.

The hard exclusion provided by a rwsem/hotplug lock can do this, but I
agree its quite horrible.

> A possible (but unfortunate) real fix is to try to keep the broken
> code around, and make its broken assumptions be actually true. Make
> "event->ctx" truly be an RCU pointer. Add the RCU annotations, add the
> required rcu read-locking, and make everything really do what that
> currently completely broken perf_pmu_migrate_context() thinks it does.

I'm not entirely sure RCU works here.

>    That implies, for example, that any time you use the thing in a
> sleeping context, you'd need to do something like this (possibly with
> a helper function to do that: "get_event_ctx()"
> 
>         rcu_read_lock();
>         ctx = rcu_dereference(event->rcu);
>         get_ctx(ctx);
>         rcu_read_unlock();
> 
>         .. you can now sleep here and use 'ctx' ...
> 
>         put_ctx(ctx);
> 
>     and if you don't need to sleep, you can just do
> 
>         rcu_read_lock();
>         ctx = rcu_dereference(event->ctx);
>         .. use ctx in an atomic context ...
>         rcu_read_unlock();
> 
> but that is a much bigger patch than either of the "introduce a
> completely broken ad-hoc lock around just one random use case"
> patches.

So the problem with an RCU like approach is that it would allow multiple
users to see different ctxs. Even calling synchronize_rcu() in between
doesn't really solve that because we don't clear the event->ctx pointer
after remove_from_context, install_in_context simply sets a new one.

If we were to clear the context pointer, all usage sites would have to
read/wait until there's a valid pointer again, at which point we've just
build a rwsem with rcu.

> Alternatively (and this sounds like the better fix), use some way to
> avoid that broken perf_pmu_migrate_context() model entirely. Never
> "migrate" events. Create completely new ones on the new CPU, and leave
> the old stale ones alone even on dead CPU's (they will presumably not
> actually be activated, and who the hell cares?).

That appears to simply move the problem; instead of event->ctx being the
problem we then get filp->private_data the exact same problem. It might
be a better delimited problem perhaps, but adds a little complexity
because we need to copy the event state around.

Similarly we can't use RCU here because allowing concurrent operations
on different 'versions' of the event makes it very 'interesting' to sync
state between the old and new object.

> Or even just say: "if somebody takes down a CPU with existing uncore
> events, those events are now effectively dead". Don't try to migrate
> them, don't try to create new events on another CPU, just let it go.
> The CPU is down, the events are inactive.

Which, is indeed very attractive; whoever with the additional
'move_group' problem I'm not entirely seeing how we can make that
happen. It would end up disallowing certain event groups that are
currently possible. Most notable a software group leader with hardware
events -- a most useful construct for those people who don't have
interrupts on their hardware PMUs.

Lemme ponder this a bit more.
diff mbox

Patch

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f9c1ed0..9e44cae 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3317,7 +3317,7 @@  static void free_event(struct perf_event *event)
  */
 static void put_event(struct perf_event *event)
 {
-	struct perf_event_context *ctx = event->ctx;
+	struct perf_event_context *ctx;
 	struct task_struct *owner;
 
 	if (!atomic_long_dec_and_test(&event->refcount))
@@ -3356,6 +3356,16 @@  static void put_event(struct perf_event *event)
 		put_task_struct(owner);
 	}
 
+	/*
+	 * We must ensure that perf_pmu_migrate_context can't race on
+	 * event->ctx. Inhibit hotplug (and hence the notifiers
+	 * perf_pmu_migrate_context is called from) until the event is removed
+	 * from its current context.
+	 */
+	get_online_cpus();
+
+	ctx = event->ctx;
+
 	WARN_ON_ONCE(ctx->parent_ctx);
 	/*
 	 * There are two ways this annotation is useful:
@@ -3373,6 +3383,8 @@  static void put_event(struct perf_event *event)
 	perf_remove_from_context(event, true);
 	mutex_unlock(&ctx->mutex);
 
+	put_online_cpus();
+
 	_free_event(event);
 }