diff mbox

perf: fix pmu::filter_match for SW-led groups

Message ID 1465917041-15339-1-git-send-email-mark.rutland@arm.com
State Accepted
Commit 2c81a6477081966fe80b8c6daa68459bca896774
Headers show

Commit Message

Mark Rutland June 14, 2016, 3:10 p.m. UTC
Commit 66eb579e66ecfea5 ("perf: allow for PMU-specific event filtering")
added pmu::filter_match. This was intended to avoid HW constraints on
events from resulting in extremely pessimistic scheduling.

However, pmu::filter_match is only called for the leader of each event
group. When the leader is a SW event, we do not filter the groups, and
may fail at pmu::add time, and when this happens we'll give up on
scheduling any event groups later in the list until they are rotated
ahead of the failing group.

This can result in extremely sub-optimal scheduling behaviour, e.g. if
running the following on a big.LITTLE platform:

$ taskset -c 0 ./perf stat \
 -e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \
 -e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \
 ls

     <not counted>      context-switches                                              (0.00%)
     <not counted>      armv8_cortex_a57/config=0x11/                                     (0.00%)
                24      context-switches                                              (37.36%)
          57589154      armv8_cortex_a53/config=0x11/                                     (37.36%)

Here the 'a53' event group was always eligible to be scheduled, but the
a57 group never eligible to be scheduled, as the task was always affine
to a Cortex-A53 CPU. The SW (group leader) event in the 'a57' group was
eligible, but the HW event failed at pmu::add time, resulting in
ctx_flexible_sched_in giving up on scheduling further groups with HW
events.

One way of avoiding this is to check pmu::filter_match on siblings as
well as the group leader. If any of these fail their pmu::filter_match,
we must skip the entire group before attempting to add any events.

Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Mark Rutland <mark.rutland@arm.com>

---
 kernel/events/core.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

I've tried to find a better way of handling this (without needing to walk the
siblings list), but so far I'm at a loss. At least it's "only" O(n) in the size
of the sibling list we were going to walk anyway.

I suspect that at a more fundamental level, I need to stop sharing a
perf_hw_context between HW PMUs (i.e. replace task_struct::perf_event_ctxp with
something that can handle multiple HW PMUs). From previous attempts I'm not
sure if that's going to be possible.

Any ideas appreciated!

Mark.

-- 
1.9.1

Comments

Mark Rutland July 4, 2016, 6:05 p.m. UTC | #1
On Sat, Jul 02, 2016 at 06:40:25PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 14, 2016 at 04:10:41PM +0100, Mark Rutland wrote:

> > However, pmu::filter_match is only called for the leader of each event

> > group. When the leader is a SW event, we do not filter the groups, and

> > may fail at pmu::add time, and when this happens we'll give up on

> > scheduling any event groups later in the list until they are rotated

> > ahead of the failing group.

> 

> Ha! indeed.

> 

> > I've tried to find a better way of handling this (without needing to walk the

> > siblings list), but so far I'm at a loss. At least it's "only" O(n) in the size

> > of the sibling list we were going to walk anyway.

> > 

> > I suspect that at a more fundamental level, I need to stop sharing a

> > perf_hw_context between HW PMUs (i.e. replace task_struct::perf_event_ctxp with

> > something that can handle multiple HW PMUs). From previous attempts I'm not

> > sure if that's going to be possible.

> > 

> > Any ideas appreciated!

> 

> So I think I have half-cooked ideas.

> 

> One of the problems I've been wanting to solve for a long time is that

> the per-cpu flexible list has priority over the per-task flexible list.

> 

> I would like them to rotate together.


Makes sense.

> One of the ways I was looking at getting that done is a virtual runtime

> scheduler (just like cfs). The tricky point is merging two virtual

> runtime trees. But I think that should be doable if we sort the trees on

> lag.

> 

> In any case, the relevance to your question is that once we have a tree,

> we can play games with order; that is, if we first order on PMU-id and

> only second on lag, we get whole subtree clusters specific for a PMU.


Hmm... I'm not sure how that helps in this case. Wouldn't we still need
to walk the sibling list to get the HW PMU-id in the case of a SW group
leader?

For the heterogeenous case we'd need a different sort order per-cpu
(well, per microarchitecture), which sounds like we're going to have to
fully sort the events every time they move between CPUs. :/

> Lost of details missing in that picture, but I think something along

> those lines might get us what we want.


Perhaps! Hopefully I'm just missing those detail above. :)

I also had another though about solving the SW-led group case: if the
leader had a reference to the group's HW PMU (of which there should only
be one), we can filter on that alone, and can also use that in
group_sched_in rather than the ctx->pmu, avoiding the issue that
ctx->pmu is not the same as the group's HW PMU.

I'll have a play with that approach in the mean time.

Thanks,
Mark.
Mark Rutland July 5, 2016, 9:44 a.m. UTC | #2
On Tue, Jul 05, 2016 at 10:35:26AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 04, 2016 at 07:05:35PM +0100, Mark Rutland wrote:

> > On Sat, Jul 02, 2016 at 06:40:25PM +0200, Peter Zijlstra wrote:

> > > One of the ways I was looking at getting that done is a virtual runtime

> > > scheduler (just like cfs). The tricky point is merging two virtual

> > > runtime trees. But I think that should be doable if we sort the trees on

> > > lag.

> > > 

> > > In any case, the relevance to your question is that once we have a tree,

> > > we can play games with order; that is, if we first order on PMU-id and

> > > only second on lag, we get whole subtree clusters specific for a PMU.

> > 

> > Hmm... I'm not sure how that helps in this case. Wouldn't we still need

> > to walk the sibling list to get the HW PMU-id in the case of a SW group

> > leader?

> 

> Since there is a hardware even in the group, it must be part of the

> hardware pmu list/tree and would thus end up classified (and sorted) by

> that (hardware) PMU-id.

> 

> > For the heterogeenous case we'd need a different sort order per-cpu

> > (well, per microarchitecture), which sounds like we're going to have to

> > fully sort the events every time they move between CPUs. :/

> 

> Confused, I thought that for the HG case you had multiple events, one

> for each PMU. If we classify these events differently we'd simply use a

> different subtree depending on which CPU the task lands.


My bad; I assumed that for both PMUs we'd start at the root, and thus
would need to re-sort in order to get the current CPU's PMU ordered
first, much like currently with rotation.

I guess I'm having difficulty figuring out the structure of that tree.
If we can easily/cheaply find the relevant sub-tree then the above isn't
an issue.

> Currently we've munged the two PMUs together, because, well, that's the

> only way.


Yeah. Splitting them by any means would be great. In the past I'd looked
at changing task_struct::perf_event_ctxp into something that could
handle an arbitrary number of contexts, such that we could avoid
sharing, but ran away after considering the locking/rcu implications.

Thanks,
Mark.
Mark Rutland July 5, 2016, 12:52 p.m. UTC | #3
On Tue, Jul 05, 2016 at 02:04:26PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 05, 2016 at 10:44:48AM +0100, Mark Rutland wrote:

> > My bad; I assumed that for both PMUs we'd start at the root, and thus

> > would need to re-sort in order to get the current CPU's PMU ordered

> > first, much like currently with rotation.

> > 

> > I guess I'm having difficulty figuring out the structure of that tree.

> > If we can easily/cheaply find the relevant sub-tree then the above isn't

> > an issue.

> 

> struct event {

> 	struct rb_node node;

> 	int pmu_id;

> 	s64 lag;

> 	...

> };

> 

> bool event_less(struct rb_node *a, struct rb_node *b)

> {

> 	struct event *left = rb_entry(a, struct event, node);

> 	struct event *right = rb_entry(b, struct event, node);

> 

> 	if (a->pmu_id < b->pmu_id)

> 		return true;

> 

> 	if (b->pmu_id > a->pmu_id)

> 		return false;

> 

> 	/* a->pmu_id == b->pmu_id */

> 	if (a->lag < b->lag)

> 		return true;

> 

> 	return false;

> }

> 

> Will give you a tree with primary order @pmu_id and secondary order

> @lag.

> 

> Which you'd iterate like:

> 

> 	for (event = event_find(pmu_id); event->pmu_id == pmu_id; event = event_next(event)) {

> 	}

> 

> And get only the events matching @pmu_id in @lag order.


Cheers! Sorry for being thick; I think I understand now.

I'll have a tinker with the idea.

Thanks,
Mark.
diff mbox

Patch

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9c51ec3..c0b6db0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1678,12 +1678,32 @@  static bool is_orphaned_event(struct perf_event *event)
 	return event->state == PERF_EVENT_STATE_DEAD;
 }
 
-static inline int pmu_filter_match(struct perf_event *event)
+static inline int __pmu_filter_match(struct perf_event *event)
 {
 	struct pmu *pmu = event->pmu;
 	return pmu->filter_match ? pmu->filter_match(event) : 1;
 }
 
+/*
+ * Check whether we should attempt to schedule an event group based on
+ * PMU-specific filtering. An event group can consist of HW and SW events,
+ * potentially with a SW leader, so we must check all the filters to determine
+ * whether a group is schedulable.
+ */
+static inline int pmu_filter_match(struct perf_event *event)
+{
+	struct perf_event *child;
+
+	if (!__pmu_filter_match(event))
+		return 0;
+
+	list_for_each_entry(child, &event->sibling_list, group_entry)
+		if (!__pmu_filter_match(child))
+			return 0;
+
+	return 1;
+}
+
 static inline int
 event_filter_match(struct perf_event *event)
 {