diff mbox series

sched/fair: fix 1 task per CPU

Message ID 1536334803-21059-1-git-send-email-vincent.guittot@linaro.org
State Superseded
Headers show
Series sched/fair: fix 1 task per CPU | expand

Commit Message

Vincent Guittot Sept. 7, 2018, 3:40 p.m. UTC
When CPUs have different capacity because of RT/DL tasks or
micro-architecture or max frequency differences, there are situation where
the imbalance is not correctly set to migrate waiting task on the idle CPU.

The UC uses the force_balance case :
	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
	    busiest->group_no_capacity)
		goto force_balance;

But calculate_imbalance fails to set the right amount of load to migrate
a task because of the special condition:
  busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)

Add in fix_small_imbalance, this special case that triggered the force
balance in order to make sure that the amount of load to migrate will be
enough.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

-- 
2.7.4

Comments

Valentin Schneider Sept. 8, 2018, 8:17 p.m. UTC | #1
Hi Vincent,

On 07/09/18 08:40, Vincent Guittot wrote:
> When CPUs have different capacity because of RT/DL tasks or

> micro-architecture or max frequency differences, there are situation where

> the imbalance is not correctly set to migrate waiting task on the idle CPU.

> 


This is essentially always for down migrations (high capacity src CPU to lower
capacity dst CPU), right?

For instance, I've often seen that issue on HiKey960 (4x4 big.LITTLE) where
if you spawn 8 CPU-hogs you can end up with 5 on the bigs and 3 on the
LITTLES, but since avg_load scales with the inverse of group_capacity it's
seen as balanced. I don't recall seeing this problem for low capacity ->
big capacity migration, especially with the misfit logic that makes this a 
non-problem.

Also, a nit on the commit header: although we've called it "1 task per CPU"
in the past, it might be more explicit to call it "1 long running task per
CPU", if just for the sake of clarity for people not directly involved in
our previous discussions.

> The UC uses the force_balance case :

> 	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> 	    busiest->group_no_capacity)

> 		goto force_balance;

> 

> But calculate_imbalance fails to set the right amount of load to migrate

> a task because of the special condition:

>   busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)

> 

> Add in fix_small_imbalance, this special case that triggered the force

> balance in order to make sure that the amount of load to migrate will be

> enough.

> 

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---


So we happen to have come up with a (relatively) similar patch for Android
(see [1]) which also stemmed from the 1 always running task per CPU issue. It
only applies to SD_ASYM_CPUCAPACITY sched domains though, and targets
down-migrations (local->group_capacity < busiest->group_capacity).

I'd argue that we should make it so this kind of imbalance gets detected way
before we hit fix_small_imbalance(), but until we get there I think we do want
an extra condition like this.

>  kernel/sched/fair.c | 14 ++++++++++++++

>  1 file changed, 14 insertions(+)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 309c93f..57b4d83 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -8048,6 +8048,20 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

>  	local = &sds->local_stat;

>  	busiest = &sds->busiest_stat;

>  

> +	/*

> +	 * There is available capacity in local group and busiest group is

> +	 * overloaded but calculate_imbalance can't compute the amount of load

> +	 * to migrate because they became meaningless because asymetric


What do you mean by 'they'? 

Also, s/because/due to/ on the second "because".

> +	 * capacity between group. In such case, we only want to migrate at

> +	 * least one tasks of the busiest group and rely of the average load

> +	 * per task to ensure the migration.

> +	 */

> +	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> +	    busiest->group_no_capacity) {

> +		env->imbalance = busiest->load_per_task;

> +		return;

> +	}

> +


The condition is an exact copy of the one in that goto force_balance
path you mention (in find_busiest_group()), is that just a coincidence? If
not you might want to mention it in the comment.

>  	if (!local->sum_nr_running)

>  		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);

>  	else if (busiest->load_per_task > local->load_per_task)

> 


[1]: https://android.googlesource.com/kernel/common/+/9fae72e924961f3b32816193fcb56d19c1f644c2%5E%21/#F0
Vincent Guittot Sept. 10, 2018, 2:22 p.m. UTC | #2
On Sat, 8 Sep 2018 at 22:17, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>

> Hi Vincent,

>

> On 07/09/18 08:40, Vincent Guittot wrote:

> > When CPUs have different capacity because of RT/DL tasks or

> > micro-architecture or max frequency differences, there are situation where

> > the imbalance is not correctly set to migrate waiting task on the idle CPU.

> >

>

> This is essentially always for down migrations (high capacity src CPU to lower

> capacity dst CPU), right?


At least for the UC that I have created

>

> For instance, I've often seen that issue on HiKey960 (4x4 big.LITTLE) where

> if you spawn 8 CPU-hogs you can end up with 5 on the bigs and 3 on the

> LITTLES, but since avg_load scales with the inverse of group_capacity it's

> seen as balanced. I don't recall seeing this problem for low capacity ->

> big capacity migration, especially with the misfit logic that makes this a

> non-problem.

>

> Also, a nit on the commit header: although we've called it "1 task per CPU"

> in the past, it might be more explicit to call it "1 long running task per

> CPU", if just for the sake of clarity for people not directly involved in

> our previous discussions.


In fact it's not only long running task, I can reproduce the problem
with 1 task with a load of 30%

The problem comes from busiest->avg_load <= sds->avg_load ||
local->avg_load >= sds->avg_load

>

> > The UC uses the force_balance case :

> >       if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> >           busiest->group_no_capacity)

> >               goto force_balance;

> >

> > But calculate_imbalance fails to set the right amount of load to migrate

> > a task because of the special condition:

> >   busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)

> >

> > Add in fix_small_imbalance, this special case that triggered the force

> > balance in order to make sure that the amount of load to migrate will be

> > enough.

> >

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> > ---

>

> So we happen to have come up with a (relatively) similar patch for Android

> (see [1]) which also stemmed from the 1 always running task per CPU issue. It

> only applies to SD_ASYM_CPUCAPACITY sched domains though, and targets

> down-migrations (local->group_capacity < busiest->group_capacity).

>

> I'd argue that we should make it so this kind of imbalance gets detected way

> before we hit fix_small_imbalance(), but until we get there I think we do want

> an extra condition like this.

>

> >  kernel/sched/fair.c | 14 ++++++++++++++

> >  1 file changed, 14 insertions(+)

> >

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index 309c93f..57b4d83 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -8048,6 +8048,20 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

> >       local = &sds->local_stat;

> >       busiest = &sds->busiest_stat;

> >

> > +     /*

> > +      * There is available capacity in local group and busiest group is

> > +      * overloaded but calculate_imbalance can't compute the amount of load

> > +      * to migrate because they became meaningless because asymetric

>

> What do you mean by 'they'?


The loads

>

> Also, s/because/due to/ on the second "because".


ok

>

> > +      * capacity between group. In such case, we only want to migrate at

> > +      * least one tasks of the busiest group and rely of the average load

> > +      * per task to ensure the migration.

> > +      */

> > +     if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> > +         busiest->group_no_capacity) {

> > +             env->imbalance = busiest->load_per_task;

> > +             return;

> > +     }

> > +

>

> The condition is an exact copy of the one in that goto force_balance

> path you mention (in find_busiest_group()), is that just a coincidence? If


it's not a coincidence, it's to catch what have been missed by
calculate_imbalance

> not you might want to mention it in the comment.

>

> >       if (!local->sum_nr_running)

> >               local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);

> >       else if (busiest->load_per_task > local->load_per_task)

> >

>

> [1]: https://android.googlesource.com/kernel/common/+/9fae72e924961f3b32816193fcb56d19c1f644c2%5E%21/#F0
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93f..57b4d83 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8048,6 +8048,20 @@  void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
+	/*
+	 * There is available capacity in local group and busiest group is
+	 * overloaded but calculate_imbalance can't compute the amount of load
+	 * to migrate because they became meaningless because asymetric
+	 * capacity between group. In such case, we only want to migrate at
+	 * least one tasks of the busiest group and rely of the average load
+	 * per task to ensure the migration.
+	 */
+	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
+	    busiest->group_no_capacity) {
+		env->imbalance = busiest->load_per_task;
+		return;
+	}
+
 	if (!local->sum_nr_running)
 		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
 	else if (busiest->load_per_task > local->load_per_task)