diff mbox series

[v2] sched/fair: fix 1 task per CPU

Message ID 1536590589-437-1-git-send-email-vincent.guittot@linaro.org
State New
Headers show
Series [v2] sched/fair: fix 1 task per CPU | expand

Commit Message

Vincent Guittot Sept. 10, 2018, 2:43 p.m. UTC
When CPUs have different capacity because of RT/DL tasks or
micro-architecture or max frequency differences, there are situation where
the imbalance is not correctly set to migrate waiting task on the idle CPU.

The UC uses the force_balance case :
	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
	    busiest->group_no_capacity)
		goto force_balance;

But calculate_imbalance fails to set the right amount of load to migrate
a task because of the special condition:
  busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)

Add in fix_small_imbalance, this special case that triggered the force
balance in order to make sure that the amount of load to migrate will be
enough.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

-- 
2.7.4

Comments

Valentin Schneider Sept. 14, 2018, 3:22 a.m. UTC | #1
Hi,

On 10/09/18 07:43, Vincent Guittot wrote:
> When CPUs have different capacity because of RT/DL tasks or

> micro-architecture or max frequency differences, there are situation where

> the imbalance is not correctly set to migrate waiting task on the idle CPU.

> 

> The UC uses the force_balance case :

> 	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> 	    busiest->group_no_capacity)

> 		goto force_balance;

> 

> But calculate_imbalance fails to set the right amount of load to migrate

> a task because of the special condition:

>   busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)

> 

> Add in fix_small_imbalance, this special case that triggered the force

> balance in order to make sure that the amount of load to migrate will be

> enough.

> 

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>


Other than the commit nit, LGTM. Out of curiosity I ran some kernel compile
on my HiKey960 (-j8) but didn't see much change - something along the lines 
of ~1% speedup, and although it was consistent over a few iterations, I'd
need a whole lot more of them to back this up. 

I kind of expected it because some sporadic task can show up and tip the
scale in the right direction, so even without the patch the situation can
"fix itself" eventually, and it becomes less noticeable on really long
workloads.

I do see a difference by looking at the trace of a simple 8 100% tasks rt-app
workload though, as I no longer see that idling LITTLE I sometimes get
without the patch, which is what we expect, so:

Tested-by: Valentin Schneider <valentin.schneider@arm.com>


> ---


Again, I'd argue for a slightly more explicit header. As you pointed out in
v1, it's not just long running tasks, so maybe just "fix 1 *running* task per
CPU"? Otherwise I feel it's a tad obscure.

>  kernel/sched/fair.c | 14 ++++++++++++++

>  1 file changed, 14 insertions(+)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 309c93f..72bc5e8 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -8048,6 +8048,20 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

>  	local = &sds->local_stat;

>  	busiest = &sds->busiest_stat;

>  

> +	/*

> +	 * There is available capacity in local group and busiest group is

> +	 * overloaded but calculate_imbalance can't compute the amount of load

> +	 * to migrate because load_avg became meaningless due to asymetric

> +	 * capacity between groups.


Could you add something along the lines of "(see similar condition in
find_busiest_group())"?

In such case, we only want to migrate at
> +	 * least one tasks of the busiest group and rely of the average load

> +	 * per task to ensure the migration.

> +	 */

> +	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> +	    busiest->group_no_capacity) {

> +		env->imbalance = busiest->load_per_task;

> +		return;

> +	}

> +

>  	if (!local->sum_nr_running)

>  		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);

>  	else if (busiest->load_per_task > local->load_per_task)

>
Vincent Guittot Sept. 14, 2018, 4:26 p.m. UTC | #2
On Fri, 14 Sep 2018 at 05:22, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>

> Hi,

>

> On 10/09/18 07:43, Vincent Guittot wrote:

> > When CPUs have different capacity because of RT/DL tasks or

> > micro-architecture or max frequency differences, there are situation where

> > the imbalance is not correctly set to migrate waiting task on the idle CPU.

> >

> > The UC uses the force_balance case :

> >       if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> >           busiest->group_no_capacity)

> >               goto force_balance;

> >

> > But calculate_imbalance fails to set the right amount of load to migrate

> > a task because of the special condition:

> >   busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)

> >

> > Add in fix_small_imbalance, this special case that triggered the force

> > balance in order to make sure that the amount of load to migrate will be

> > enough.

> >

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>

> Other than the commit nit, LGTM. Out of curiosity I ran some kernel compile

> on my HiKey960 (-j8) but didn't see much change - something along the lines

> of ~1% speedup, and although it was consistent over a few iterations, I'd

> need a whole lot more of them to back this up.

>

> I kind of expected it because some sporadic task can show up and tip the

> scale in the right direction, so even without the patch the situation can

> "fix itself" eventually, and it becomes less noticeable on really long

> workloads.


I have seen a better stdev and shorter duration for the tests that you
used for misfit patch.
The test have been done with asym packing and the few fixes that I
sent in another patchset for asym packing

>

> I do see a difference by looking at the trace of a simple 8 100% tasks rt-app

> workload though, as I no longer see that idling LITTLE I sometimes get

> without the patch, which is what we expect, so:

>

> Tested-by: Valentin Schneider <valentin.schneider@arm.com>


Thanks

>

> > ---

>

> Again, I'd argue for a slightly more explicit header. As you pointed out in

> v1, it's not just long running tasks, so maybe just "fix 1 *running* task per

> CPU"? Otherwise I feel it's a tad obscure.


To be honest i don't mind about header but I don't see the benefit of
adding *running*.
So I let Peter or Ingo decide what they prefer

>

> >  kernel/sched/fair.c | 14 ++++++++++++++

> >  1 file changed, 14 insertions(+)

> >

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index 309c93f..72bc5e8 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -8048,6 +8048,20 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

> >       local = &sds->local_stat;

> >       busiest = &sds->busiest_stat;

> >

> > +     /*

> > +      * There is available capacity in local group and busiest group is

> > +      * overloaded but calculate_imbalance can't compute the amount of load

> > +      * to migrate because load_avg became meaningless due to asymetric

> > +      * capacity between groups.

>

> Could you add something along the lines of "(see similar condition in

> find_busiest_group())"?

>

> In such case, we only want to migrate at

> > +      * least one tasks of the busiest group and rely of the average load

> > +      * per task to ensure the migration.

> > +      */

> > +     if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> > +         busiest->group_no_capacity) {

> > +             env->imbalance = busiest->load_per_task;

> > +             return;

> > +     }

> > +

> >       if (!local->sum_nr_running)

> >               local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);

> >       else if (busiest->load_per_task > local->load_per_task)

> >
Vincent Guittot Dec. 13, 2018, 8:05 a.m. UTC | #3
Hi Peter, Ingo,

On Fri, 14 Sep 2018 at 18:26, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>

> On Fri, 14 Sep 2018 at 05:22, Valentin Schneider

> <valentin.schneider@arm.com> wrote:

> >

> > Hi,

> >

> > On 10/09/18 07:43, Vincent Guittot wrote:

> > > When CPUs have different capacity because of RT/DL tasks or

> > > micro-architecture or max frequency differences, there are situation where

> > > the imbalance is not correctly set to migrate waiting task on the idle CPU.

> > >

> > > The UC uses the force_balance case :

> > >       if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> > >           busiest->group_no_capacity)

> > >               goto force_balance;

> > >

> > > But calculate_imbalance fails to set the right amount of load to migrate

> > > a task because of the special condition:

> > >   busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)

> > >

> > > Add in fix_small_imbalance, this special case that triggered the force

> > > balance in order to make sure that the amount of load to migrate will be

> > > enough.

> > >

> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> >

> > Other than the commit nit, LGTM. Out of curiosity I ran some kernel compile

> > on my HiKey960 (-j8) but didn't see much change - something along the lines

> > of ~1% speedup, and although it was consistent over a few iterations, I'd

> > need a whole lot more of them to back this up.

> >

> > I kind of expected it because some sporadic task can show up and tip the

> > scale in the right direction, so even without the patch the situation can

> > "fix itself" eventually, and it becomes less noticeable on really long

> > workloads.

>

> I have seen a better stdev and shorter duration for the tests that you

> used for misfit patch.

> The test have been done with asym packing and the few fixes that I

> sent in another patchset for asym packing

>

> >

> > I do see a difference by looking at the trace of a simple 8 100% tasks rt-app

> > workload though, as I no longer see that idling LITTLE I sometimes get

> > without the patch, which is what we expect, so:

> >

> > Tested-by: Valentin Schneider <valentin.schneider@arm.com>

>

> Thanks

>

> >

> > > ---

> >

> > Again, I'd argue for a slightly more explicit header. As you pointed out in

> > v1, it's not just long running tasks, so maybe just "fix 1 *running* task per

> > CPU"? Otherwise I feel it's a tad obscure.

>

> To be honest i don't mind about header but I don't see the benefit of

> adding *running*.

> So I let Peter or Ingo decide what they prefer


This patch has been in the list for a while.
What should i do to move forward with it ?

Thanks
Vincent

>

> >

> > >  kernel/sched/fair.c | 14 ++++++++++++++

> > >  1 file changed, 14 insertions(+)

> > >

> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > > index 309c93f..72bc5e8 100644

> > > --- a/kernel/sched/fair.c

> > > +++ b/kernel/sched/fair.c

> > > @@ -8048,6 +8048,20 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

> > >       local = &sds->local_stat;

> > >       busiest = &sds->busiest_stat;

> > >

> > > +     /*

> > > +      * There is available capacity in local group and busiest group is

> > > +      * overloaded but calculate_imbalance can't compute the amount of load

> > > +      * to migrate because load_avg became meaningless due to asymetric

> > > +      * capacity between groups.

> >

> > Could you add something along the lines of "(see similar condition in

> > find_busiest_group())"?

> >

> > In such case, we only want to migrate at

> > > +      * least one tasks of the busiest group and rely of the average load

> > > +      * per task to ensure the migration.

> > > +      */

> > > +     if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> > > +         busiest->group_no_capacity) {

> > > +             env->imbalance = busiest->load_per_task;

> > > +             return;

> > > +     }

> > > +

> > >       if (!local->sum_nr_running)

> > >               local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);

> > >       else if (busiest->load_per_task > local->load_per_task)

> > >
Peter Zijlstra Dec. 13, 2018, 10:44 a.m. UTC | #4
On Mon, Sep 10, 2018 at 04:43:09PM +0200, Vincent Guittot wrote:
> When CPUs have different capacity because of RT/DL tasks or

> micro-architecture or max frequency differences, there are situation where

> the imbalance is not correctly set to migrate waiting task on the idle CPU.

> 

> The UC uses the force_balance case:

>

> 	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> 	    busiest->group_no_capacity)

> 		goto force_balance;

> 

> But calculate_imbalance fails to set the right amount of load to migrate

> a task because of the special condition:

>

>   busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)

> 

> Add in fix_small_imbalance, this special case that triggered the force

> balance in order to make sure that the amount of load to migrate will be

> enough.


So I think this patch is going in the wrong direction for a number of
reasons:

 - we'd like to get rid of fix_small_imbalance(), and this adds to it;

 - the whole load_per_task stuff is terminally broken, it _cannot_ work
   right.


What I've suggested in the past is parameterizing the load balancer and
picking different criteria to balance on:

 - nr_running ; if there are idle CPUs around
 - utilization ; if there's idle time (u<1)
 - weight

And I suppose you're hitting one of the nr_running cases here.

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---

>  kernel/sched/fair.c | 14 ++++++++++++++

>  1 file changed, 14 insertions(+)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 309c93f..72bc5e8 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -8048,6 +8048,20 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

>  	local = &sds->local_stat;

>  	busiest = &sds->busiest_stat;

>  

> +	/*

> +	 * There is available capacity in local group and busiest group is

> +	 * overloaded but calculate_imbalance can't compute the amount of load

> +	 * to migrate because load_avg became meaningless due to asymetric

> +	 * capacity between groups. In such case, we only want to migrate at

> +	 * least one tasks of the busiest group and rely of the average load

> +	 * per task to ensure the migration.

> +	 */

> +	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> +	    busiest->group_no_capacity) {

> +		env->imbalance = busiest->load_per_task;

> +		return;

> +	}

> +

>  	if (!local->sum_nr_running)

>  		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);

>  	else if (busiest->load_per_task > local->load_per_task)

> -- 

> 2.7.4

>
Vincent Guittot Dec. 13, 2018, 11:04 a.m. UTC | #5
On Thu, 13 Dec 2018 at 11:44, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Mon, Sep 10, 2018 at 04:43:09PM +0200, Vincent Guittot wrote:

> > When CPUs have different capacity because of RT/DL tasks or

> > micro-architecture or max frequency differences, there are situation where

> > the imbalance is not correctly set to migrate waiting task on the idle CPU.

> >

> > The UC uses the force_balance case:

> >

> >       if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> >           busiest->group_no_capacity)

> >               goto force_balance;

> >

> > But calculate_imbalance fails to set the right amount of load to migrate

> > a task because of the special condition:

> >

> >   busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)

> >

> > Add in fix_small_imbalance, this special case that triggered the force

> > balance in order to make sure that the amount of load to migrate will be

> > enough.

>

> So I think this patch is going in the wrong direction for a number of

> reasons:

>

>  - we'd like to get rid of fix_small_imbalance(), and this adds to it;

>

>  - the whole load_per_task stuff is terminally broken, it _cannot_ work

>    right.

>

>

> What I've suggested in the past is parameterizing the load balancer and

> picking different criteria to balance on:


This patch is clearly a fix of the current implementation.
What you suggest below makes sense but implies a significant rework in
the calculate_imbalance and the load_balancer in general and will need
more time to reach a stable state.
Nevertheless, I will have a look at that

I imagine that your feedback for https://lkml.org/lkml/2018/10/2/283
will be the same ?

Vincent

>

>  - nr_running ; if there are idle CPUs around

>  - utilization ; if there's idle time (u<1)

>  - weight

>

> And I suppose you're hitting one of the nr_running cases here.

>

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> > ---

> >  kernel/sched/fair.c | 14 ++++++++++++++

> >  1 file changed, 14 insertions(+)

> >

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index 309c93f..72bc5e8 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -8048,6 +8048,20 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

> >       local = &sds->local_stat;

> >       busiest = &sds->busiest_stat;

> >

> > +     /*

> > +      * There is available capacity in local group and busiest group is

> > +      * overloaded but calculate_imbalance can't compute the amount of load

> > +      * to migrate because load_avg became meaningless due to asymetric

> > +      * capacity between groups. In such case, we only want to migrate at

> > +      * least one tasks of the busiest group and rely of the average load

> > +      * per task to ensure the migration.

> > +      */

> > +     if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> > +         busiest->group_no_capacity) {

> > +             env->imbalance = busiest->load_per_task;

> > +             return;

> > +     }

> > +

> >       if (!local->sum_nr_running)

> >               local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);

> >       else if (busiest->load_per_task > local->load_per_task)

> > --

> > 2.7.4

> >
Peter Zijlstra Dec. 13, 2018, 1:12 p.m. UTC | #6
On Thu, Dec 13, 2018 at 12:04:20PM +0100, Vincent Guittot wrote:
> On Thu, 13 Dec 2018 at 11:44, Peter Zijlstra <peterz@infradead.org> wrote:

> >

> > On Mon, Sep 10, 2018 at 04:43:09PM +0200, Vincent Guittot wrote:

> > > When CPUs have different capacity because of RT/DL tasks or

> > > micro-architecture or max frequency differences, there are situation where

> > > the imbalance is not correctly set to migrate waiting task on the idle CPU.

> > >

> > > The UC uses the force_balance case:

> > >

> > >       if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> > >           busiest->group_no_capacity)

> > >               goto force_balance;

> > >

> > > But calculate_imbalance fails to set the right amount of load to migrate

> > > a task because of the special condition:

> > >

> > >   busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)

> > >

> > > Add in fix_small_imbalance, this special case that triggered the force

> > > balance in order to make sure that the amount of load to migrate will be

> > > enough.

> >

> > So I think this patch is going in the wrong direction for a number of

> > reasons:

> >

> >  - we'd like to get rid of fix_small_imbalance(), and this adds to it;

> >

> >  - the whole load_per_task stuff is terminally broken, it _cannot_ work

> >    right.

> >

> >

> > What I've suggested in the past is parameterizing the load balancer and

> > picking different criteria to balance on:

> 

> This patch is clearly a fix of the current implementation.

> What you suggest below makes sense but implies a significant rework in

> the calculate_imbalance and the load_balancer in general and will need

> more time to reach a stable state.

> Nevertheless, I will have a look at that

> 

> I imagine that your feedback for https://lkml.org/lkml/2018/10/2/283

> will be the same ?


No; those actually look ok. It is mostly that I really don't think
load_per_task makes any kind of sense.

It sorta works when all tasks are of the same weight, but if you start
using nice -- or way worse, cgroups -- then the number is complete
bollocks.
Vincent Guittot Dec. 13, 2018, 3:14 p.m. UTC | #7
On Thu, 13 Dec 2018 at 14:12, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Thu, Dec 13, 2018 at 12:04:20PM +0100, Vincent Guittot wrote:

> > On Thu, 13 Dec 2018 at 11:44, Peter Zijlstra <peterz@infradead.org> wrote:

> > >

> > > On Mon, Sep 10, 2018 at 04:43:09PM +0200, Vincent Guittot wrote:

> > > > When CPUs have different capacity because of RT/DL tasks or

> > > > micro-architecture or max frequency differences, there are situation where

> > > > the imbalance is not correctly set to migrate waiting task on the idle CPU.

> > > >

> > > > The UC uses the force_balance case:

> > > >

> > > >       if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> > > >           busiest->group_no_capacity)

> > > >               goto force_balance;

> > > >

> > > > But calculate_imbalance fails to set the right amount of load to migrate

> > > > a task because of the special condition:

> > > >

> > > >   busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)

> > > >

> > > > Add in fix_small_imbalance, this special case that triggered the force

> > > > balance in order to make sure that the amount of load to migrate will be

> > > > enough.

> > >

> > > So I think this patch is going in the wrong direction for a number of

> > > reasons:

> > >

> > >  - we'd like to get rid of fix_small_imbalance(), and this adds to it;

> > >

> > >  - the whole load_per_task stuff is terminally broken, it _cannot_ work

> > >    right.

> > >

> > >

> > > What I've suggested in the past is parameterizing the load balancer and

> > > picking different criteria to balance on:

> >

> > This patch is clearly a fix of the current implementation.

> > What you suggest below makes sense but implies a significant rework in

> > the calculate_imbalance and the load_balancer in general and will need

> > more time to reach a stable state.

> > Nevertheless, I will have a look at that

> >

> > I imagine that your feedback for https://lkml.org/lkml/2018/10/2/283

> > will be the same ?

>

> No; those actually look ok. It is mostly that I really don't think

> load_per_task makes any kind of sense.

>

> It sorta works when all tasks are of the same weight, but if you start

> using nice -- or way worse, cgroups -- then the number is complete

> bollocks.


Yes. As soon as we goes out of a simple load balance UC, load_per_task
becomes a kind of default random choice
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93f..72bc5e8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8048,6 +8048,20 @@  void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
+	/*
+	 * There is available capacity in local group and busiest group is
+	 * overloaded but calculate_imbalance can't compute the amount of load
+	 * to migrate because load_avg became meaningless due to asymetric
+	 * capacity between groups. In such case, we only want to migrate at
+	 * least one tasks of the busiest group and rely of the average load
+	 * per task to ensure the migration.
+	 */
+	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
+	    busiest->group_no_capacity) {
+		env->imbalance = busiest->load_per_task;
+		return;
+	}
+
 	if (!local->sum_nr_running)
 		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
 	else if (busiest->load_per_task > local->load_per_task)