diff mbox

[09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up

Message ID 1464001138-25063-10-git-send-email-morten.rasmussen@arm.com
State New
Headers show

Commit Message

Morten Rasmussen May 23, 2016, 10:58 a.m. UTC
Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
configurations SD_WAKE_AFFINE is only desirable if the waking task's
compute demand (utilization) is suitable for the cpu capacities
available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
balancing take over (find_idlest_{group, cpu}()).

The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
containing cpus with different capacities. This is enforced by a
previous patch based on the SD_ASYM_CPUCAPACITY flag.

Ideally, we shouldn't set 'want_affine' in the first place, but we don't
know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
traversing them.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

---
 kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

-- 
1.9.1

Comments

Morten Rasmussen May 24, 2016, 8:10 a.m. UTC | #1
On Tue, May 24, 2016 at 08:04:24AM +0800, Yuyang Du wrote:
> On Mon, May 23, 2016 at 11:58:51AM +0100, Morten Rasmussen wrote:

> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if

> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric

> > configurations SD_WAKE_AFFINE is only desirable if the waking task's

> > compute demand (utilization) is suitable for the cpu capacities

> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup

> > balancing take over (find_idlest_{group, cpu}()).

> > 

> > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain

> > containing cpus with different capacities. This is enforced by a

> > previous patch based on the SD_ASYM_CPUCAPACITY flag.

> > 

> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't

> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start

> > traversing them.

> > 

> > cc: Ingo Molnar <mingo@redhat.com>

> > cc: Peter Zijlstra <peterz@infradead.org>

> > 

> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

> > ---

> >  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-

> >  1 file changed, 27 insertions(+), 1 deletion(-)

> > 

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index 564215d..ce44fa7 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;

> >  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;

> >  #endif

> >  

> > +/*

> > + * The margin used when comparing utilization with cpu capacity:

> > + * util * 1024 < capacity * margin

> > + */

> > +unsigned int capacity_margin = 1280; /* ~20% */

> > +

> >  static inline void update_load_add(struct load_weight *lw, unsigned long inc)

> >  {

> >  	lw->weight += inc;

> > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)

> >  	return (util >= capacity) ? capacity : util;

> >  }

> >  

> > +static inline int task_util(struct task_struct *p)

> > +{

> > +	return p->se.avg.util_avg;

> > +}

> > +

> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)

> > +{

> > +	long delta;

> > +	long prev_cap = capacity_of(prev_cpu);

> > +

> > +	delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;

> > +

> > +	/* prev_cpu is fairly close to max, no need to abort wake_affine */

> > +	if (delta < prev_cap >> 3)

> > +		return 0;

> 

> delta can be negative? still return 0?


I could add an abs() around delta.

Do you have a specific scenario in mind? Under normal circumstances, I
don't think it can be negative?
Morten Rasmussen May 25, 2016, 9:49 a.m. UTC | #2
On Wed, May 25, 2016 at 02:57:00PM +0800, Wanpeng Li wrote:
> 2016-05-23 18:58 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:

> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if

> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric

> > configurations SD_WAKE_AFFINE is only desirable if the waking task's

> > compute demand (utilization) is suitable for the cpu capacities

> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup

> > balancing take over (find_idlest_{group, cpu}()).

> >

> > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain

> > containing cpus with different capacities. This is enforced by a

> > previous patch based on the SD_ASYM_CPUCAPACITY flag.

> >

> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't

> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start

> > traversing them.

> >

> > cc: Ingo Molnar <mingo@redhat.com>

> > cc: Peter Zijlstra <peterz@infradead.org>

> >

> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

> > ---

> >  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-

> >  1 file changed, 27 insertions(+), 1 deletion(-)

> >

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index 564215d..ce44fa7 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;

> >  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;

> >  #endif

> >

> > +/*

> > + * The margin used when comparing utilization with cpu capacity:

> > + * util * 1024 < capacity * margin

> > + */

> > +unsigned int capacity_margin = 1280; /* ~20% */

> > +

> >  static inline void update_load_add(struct load_weight *lw, unsigned long inc)

> >  {

> >         lw->weight += inc;

> > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)

> >         return (util >= capacity) ? capacity : util;

> >  }

> >

> > +static inline int task_util(struct task_struct *p)

> > +{

> > +       return p->se.avg.util_avg;

> > +}

> > +

> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)

> > +{

> > +       long delta;

> > +       long prev_cap = capacity_of(prev_cpu);

> > +

> > +       delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;

> > +

> > +       /* prev_cpu is fairly close to max, no need to abort wake_affine */

> > +       if (delta < prev_cap >> 3)

> > +               return 0;

> > +

> > +       return prev_cap * 1024 < task_util(p) * capacity_margin;

> > +}

> 

> If one task util_avg is SCHED_CAPACITY_SCALE and running on x86 box w/

> SMT enabled, then each HT has capacity 589, wake_cap() will result in

> always not wake affine, right?


The idea is that SMT systems would bail out already at the previous
condition. We should have max_cpu_capacity == prev_cap == 589, delta
should then be zero and make the first condition true and make
wake_cap() always return 0 for any system with symmetric capacities
regardless of their actual capacity values.

Note that this isn't entirely true as I used capacity_of() for prev_cap,
if I change that to capacity_orig_of() it should be true.

By making the !wake_cap() condition always true for want_affine, we
should preserve existing behaviour for SMT/SMP. The only overhead is the
capacity delta computation and comparison, which should be cheap.

Does that make sense?

Btw, task util_avg == SCHED_CAPACITY_SCALE should only be possible
temporarily, it should decay to util_avg <=
capacity_orig_of(task_cpu(p)) over time. That doesn't affect your
question though as the second condition would still evaluate true if
util_avg == capacity_orig_of(task_cpu(p)), but as said above the first
condition should bail out before we get here.

Morten

> > +

> >  /*

> >   * select_task_rq_fair: Select target runqueue for the waking task in domains

> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,

> > @@ -5316,7 +5341,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f

> >

> >         if (sd_flag & SD_BALANCE_WAKE) {

> >                 record_wakee(p);

> > -               want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));

> > +               want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)

> > +                             && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));

> >         }

> >

> >         rcu_read_lock();

> > --

> > 1.9.1

> >
Morten Rasmussen May 25, 2016, 10:54 a.m. UTC | #3
On Wed, May 25, 2016 at 06:29:33PM +0800, Wanpeng Li wrote:
> 2016-05-25 17:49 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:

> > On Wed, May 25, 2016 at 02:57:00PM +0800, Wanpeng Li wrote:

> >> 2016-05-23 18:58 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:

> >> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if

> >> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric

> >> > configurations SD_WAKE_AFFINE is only desirable if the waking task's

> >> > compute demand (utilization) is suitable for the cpu capacities

> >> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup

> >> > balancing take over (find_idlest_{group, cpu}()).

> >> >

> >> > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain

> >> > containing cpus with different capacities. This is enforced by a

> >> > previous patch based on the SD_ASYM_CPUCAPACITY flag.

> >> >

> >> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't

> >> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start

> >> > traversing them.

> >> >

> >> > cc: Ingo Molnar <mingo@redhat.com>

> >> > cc: Peter Zijlstra <peterz@infradead.org>

> >> >

> >> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

> >> > ---

> >> >  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-

> >> >  1 file changed, 27 insertions(+), 1 deletion(-)

> >> >

> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> >> > index 564215d..ce44fa7 100644

> >> > --- a/kernel/sched/fair.c

> >> > +++ b/kernel/sched/fair.c

> >> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;

> >> >  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;

> >> >  #endif

> >> >

> >> > +/*

> >> > + * The margin used when comparing utilization with cpu capacity:

> >> > + * util * 1024 < capacity * margin

> >> > + */

> >> > +unsigned int capacity_margin = 1280; /* ~20% */

> >> > +

> >> >  static inline void update_load_add(struct load_weight *lw, unsigned long inc)

> >> >  {

> >> >         lw->weight += inc;

> >> > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)

> >> >         return (util >= capacity) ? capacity : util;

> >> >  }

> >> >

> >> > +static inline int task_util(struct task_struct *p)

> >> > +{

> >> > +       return p->se.avg.util_avg;

> >> > +}

> >> > +

> >> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)

> >> > +{

> >> > +       long delta;

> >> > +       long prev_cap = capacity_of(prev_cpu);

> >> > +

> >> > +       delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;

> >> > +

> >> > +       /* prev_cpu is fairly close to max, no need to abort wake_affine */

> >> > +       if (delta < prev_cap >> 3)

> >> > +               return 0;

> >> > +

> >> > +       return prev_cap * 1024 < task_util(p) * capacity_margin;

> >> > +}

> >>

> >> If one task util_avg is SCHED_CAPACITY_SCALE and running on x86 box w/

> >> SMT enabled, then each HT has capacity 589, wake_cap() will result in

> >> always not wake affine, right?

> >

> > The idea is that SMT systems would bail out already at the previous

> > condition. We should have max_cpu_capacity == prev_cap == 589, delta

> > should then be zero and make the first condition true and make

> > wake_cap() always return 0 for any system with symmetric capacities

> > regardless of their actual capacity values.

> >

> > Note that this isn't entirely true as I used capacity_of() for prev_cap,

> > if I change that to capacity_orig_of() it should be true.

> >

> > By making the !wake_cap() condition always true for want_affine, we

> > should preserve existing behaviour for SMT/SMP. The only overhead is the

> > capacity delta computation and comparison, which should be cheap.

> >

> > Does that make sense?

> 

> Fair enough, thanks for your explanation.

> 

> >

> > Btw, task util_avg == SCHED_CAPACITY_SCALE should only be possible

> > temporarily, it should decay to util_avg <=

> > capacity_orig_of(task_cpu(p)) over time. That doesn't affect your

> 

> Sorry, I didn't find it will decay to capacity_orig in

> __update_load_avg(), could you elaborate?


I should have checked the code before writing that :-( I thought the
scaling by arch_scale_cpu_capacity() in __update_load_avg() would do
that, but it turns out that the default implementation of
arch_scale_cpu_capacity() doesn't do that when we pass a NULL pointer
for the sched_domain, it would have returned smt_gain/span_weight ==
capacity_orig_of(cpu) otherwise.

Sorry for the confusion, though I'm not sure if it is right to return
SCHED_CAPACITY_SCALE for SMT systems.
Morten Rasmussen June 8, 2016, 11:29 a.m. UTC | #4
On Thu, Jun 02, 2016 at 04:21:05PM +0200, Peter Zijlstra wrote:
> On Mon, May 23, 2016 at 11:58:51AM +0100, Morten Rasmussen wrote:

> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if

> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric

> > configurations SD_WAKE_AFFINE is only desirable if the waking task's

> > compute demand (utilization) is suitable for the cpu capacities

> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup

> > balancing take over (find_idlest_{group, cpu}()).

> > 

> > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain

> > containing cpus with different capacities. This is enforced by a

> > previous patch based on the SD_ASYM_CPUCAPACITY flag.

> > 

> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't

> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start

> > traversing them.

> 

> I'm a bit confused...

> 

> Lets assume a 2+2 big.little thing with shared LLC:

> 

> 

> 	---------- SD2 ----------

> 

> 	-- SD1 --	-- SD1 --

> 

> 	0	1	2	3

> 

> 

> SD1: WAKE_AFFINE, BALANCE_WAKE

> SD2: ASYM_CAPACITY, BALANCE_WAKE

> 

> t0 used to run on cpu1, t0 used to run on cpu2

> 

> cpu0 wakes t0:

> 

>   want_affine = 1

>   SD1:

>     WAKE_AFFINE

>       cpumask_test_cpu(prev_cpu, sd_mask) == true

>     affine_sd = SD1

>     break;

> 

>   affine_sd != NULL -> affine-wakeup

> 

> cpu0 wakes t1:

> 

>   want_affine = 1

>   SD1:

>     WAKE_AFFINE

>       cpumask_test_cpu(prev_cpu, sd_mask) == false

>   SD2:

>     BALANCE_WAKE

>       sd = SD2

> 

>   affine_sd == NULL, sd == SD2 -> find_idlest_*()

> 

> 

> All without this patch...

> 

> So what is this thing doing?


Not very much in those cases, but it makes one important difference in
one case. We could do fine without the patch if we could assume that all
tasks are already in the right SD according their PELT utilization and
if not they will be woken up by a cpu in the right SD (so we do
find_idlest_*()). But we can't :-(

Let's take your example above and add that t0 should really be running
on cpu2/3 due to its utilization, assuming SD1[01] are little cpus and
SD1[23] are big cpus. In that case we would still do affine-wakeup and
stick the task on cpu0 despite it being a little cpu.

To avoid that, this patch sets want_affine = 0 in that case so we go
find_idlest_*() to give the task a chance of being put on cpu2/3. The
patch is also setting want_affine = 0 for other cases which are already
taking the find_idlest_*() route due to the cpumask test as illustrated
by your example above.

We can have the current scenarios:

b = big cpu capacity/task util
l = little cpu capacity/task util
x = don't care

case	task util	prev_cpu	this_cpu	wakeup
-------------------------------------------------------------------
1	b		b		b		affine (b)
2	b		b		l		slow (b)
3	b		l		b		slow (b)
4	b		l		l		slow (b)
5	l		b		b		affine (x)
6	l		b		l		slow (x)
7	l		l		b		slow (x)
8	l		l		l		affine (x)

Without the patch we would do affine-wakeup on little in case 4, where
we want to wake up on a big cpu. We only do affine-wakeup when both
this_cpu and prev_cpu have the same capacity and that capacity is
sufficient.

Vincent pointed out that it is overly restrictive as it is perfectly
safe to do affine-wakeup in case 6 and 7, where the waker and the
previous cpu have sufficient capacity but they are not the same.

If we made wake_affine() consider cpu capacity, it should be possible to
do affine-wakeup even for case 2 and 3, leaving us with only case 4
requiring the find_idles_*() route.

There are more cases for taking the slow wakeup path if you have more
than two cpu capacities to deal with, but I'm going to spare you the
full detailed table ;-)
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 564215d..ce44fa7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -114,6 +114,12 @@  unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
 #endif
 
+/*
+ * The margin used when comparing utilization with cpu capacity:
+ * util * 1024 < capacity * margin
+ */
+unsigned int capacity_margin = 1280; /* ~20% */
+
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
 	lw->weight += inc;
@@ -5293,6 +5299,25 @@  static int cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
+static inline int task_util(struct task_struct *p)
+{
+	return p->se.avg.util_avg;
+}
+
+static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
+{
+	long delta;
+	long prev_cap = capacity_of(prev_cpu);
+
+	delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;
+
+	/* prev_cpu is fairly close to max, no need to abort wake_affine */
+	if (delta < prev_cap >> 3)
+		return 0;
+
+	return prev_cap * 1024 < task_util(p) * capacity_margin;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5316,7 +5341,8 @@  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
-		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
+			      && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 	}
 
 	rcu_read_lock();