sched/fair: fix load_balance redo for null imbalance

Message ID	1536306664-29827-1-git-send-email-vincent.guittot@linaro.org
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Vincent Guittot <vincent.guittot@linaro.org> To: peterz@infradead.org, mingo@kernel.org, linux-kernel@vger.kernel.org Cc: dietmar.eggemann@arm.com, jhugo@codeaurora.org, Vincent Guittot <vincent.guittot@linaro.org> Subject: [PATCH] sched/fair: fix load_balance redo for null imbalance Date: Fri, 7 Sep 2018 09:51:04 +0200 Message-Id: <1536306664-29827-1-git-send-email-vincent.guittot@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	sched/fair: fix load_balance redo for null imbalance \| expand sched/fair: fix load_balance redo for null imbalance

Vincent Guittot Sept. 7, 2018, 7:51 a.m. UTC

It can happen that load_balance finds a busiest group and then a busiest rq
but the calculated imbalance is in fact null.

In such situation, detach_tasks returns immediately and lets the flag
LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to have pinned
tasks and removed from the load balance mask. then, we redo a load balance
without the busiest CPU. This creates wrong load balance situation and
generates wrong task migration.

If the calculated imbalance is null, it's useless to try to find a busiest
rq as no task will be migrated and we can return immediately.

This situation can happen with heterogeneous system or smp system when RT
tasks are decreasing the capacity of some CPUs.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.7.4

Peter Zijlstra Sept. 7, 2018, 11:37 a.m. UTC | #1

On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:
> It can happen that load_balance finds a busiest group and then a busiest rq

> but the calculated imbalance is in fact null.


Cute. Does that happen often?

> If the calculated imbalance is null, it's useless to try to find a busiest

> rq as no task will be migrated and we can return immediately.

> 

> This situation can happen with heterogeneous system or smp system when RT

> tasks are decreasing the capacity of some CPUs.


Is it the result of one of those "force_balance" conditions in
find_busiest_group() ? Should we not fix that to then return NULL
instead?

Vincent Guittot Sept. 7, 2018, 12:35 p.m. UTC | #2

Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit :
> On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:

> > It can happen that load_balance finds a busiest group and then a busiest rq

> > but the calculated imbalance is in fact null.

> 

> Cute. Does that happen often?


I have a use case with RT tasks that reproduces the problem regularly.
It happens at least when we have CPUs with different capacity either because
of heterogeous CPU or because of RT/DL reducing available capacity for cfs
I have put the call path that trigs the problem below and accroding to the
comment it seems that we can reach similar state when playing with priority.

> 

> > If the calculated imbalance is null, it's useless to try to find a busiest

> > rq as no task will be migrated and we can return immediately.

> > 

> > This situation can happen with heterogeneous system or smp system when RT

> > tasks are decreasing the capacity of some CPUs.

> 

> Is it the result of one of those "force_balance" conditions in

> find_busiest_group() ? Should we not fix that to then return NULL

> instead?


The UC is:
We have a newly_idle load balance that is triggered when RT task becomes idle
( but I think that I have seen that with idle load balance too)

we trigs:
	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
	    busiest->group_no_capacity)
		goto force_balance;

In calculate_imbalance we use the path
	/*
	 * Avg load of busiest sg can be less and avg load of local sg can
	 * be greater than avg load across all sgs of sd because avg load
	 * factors in sg capacity and sgs with smaller group_type are
	 * skipped when updating the busiest sg:
	 */
	if (busiest->avg_load <= sds->avg_load ||
	    local->avg_load >= sds->avg_load) {
		env->imbalance = 0;
		return fix_small_imbalance(env, sds);
	}

but fix_small_imbalance finally decides to return without modifying imbalance
like here
	if (busiest->avg_load + scaled_busy_load_per_task >=
	    local->avg_load + (scaled_busy_load_per_task * imbn)) {
		env->imbalance = busiest->load_per_task;
		return;
	}

Beside this patch, I'm preparing another patch in fix small imbalance to
ensure 1 task per CPU in similar situation but according to the comment above,
we can reach this situation because of tasks priority

Vincent Guittot Sept. 7, 2018, 12:55 p.m. UTC | #3

On Fri, 7 Sep 2018 at 14:35, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>

> Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit :

> > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:

> > > It can happen that load_balance finds a busiest group and then a busiest rq

> > > but the calculated imbalance is in fact null.

> >

> > Cute. Does that happen often?

>

> I have a use case with RT tasks that reproduces the problem regularly.

> It happens at least when we have CPUs with different capacity either because

> of heterogeous CPU or because of RT/DL reducing available capacity for cfs

> I have put the call path that trigs the problem below and accroding to the

> comment it seems that we can reach similar state when playing with priority.

>

> >

> > > If the calculated imbalance is null, it's useless to try to find a busiest

> > > rq as no task will be migrated and we can return immediately.

> > >

> > > This situation can happen with heterogeneous system or smp system when RT

> > > tasks are decreasing the capacity of some CPUs.

> >

> > Is it the result of one of those "force_balance" conditions in

> > find_busiest_group() ? Should we not fix that to then return NULL

> > instead?

>

> The UC is:

> We have a newly_idle load balance that is triggered when RT task becomes idle

> ( but I think that I have seen that with idle load balance too)

>

> we trigs:

>         if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

>             busiest->group_no_capacity)

>                 goto force_balance;

>

> In calculate_imbalance we use the path

>         /*

>          * Avg load of busiest sg can be less and avg load of local sg can

>          * be greater than avg load across all sgs of sd because avg load

>          * factors in sg capacity and sgs with smaller group_type are

>          * skipped when updating the busiest sg:

>          */

>         if (busiest->avg_load <= sds->avg_load ||

>             local->avg_load >= sds->avg_load) {

>                 env->imbalance = 0;

>                 return fix_small_imbalance(env, sds);

>         }

>

> but fix_small_imbalance finally decides to return without modifying imbalance

> like here

>         if (busiest->avg_load + scaled_busy_load_per_task >=

>             local->avg_load + (scaled_busy_load_per_task * imbn)) {

>                 env->imbalance = busiest->load_per_task;

>                 return;

>         }

>

> Beside this patch, I'm preparing another patch in fix small imbalance to

> ensure 1 task per CPU in similar situation but according to the comment above,

> we can reach this situation because of tasks priority


I have just done a quick test on my smp hikey board (dual quad core
arm64) by adding a log in dmesg each time we have the condition
busiest != null and imbalance == 0. The log happens from time to time
when I generate some activity on the baord like syncing the filesystem
before running a test. But I don't have the details. The logs happen
with and without the next patch that I mentioned above. So it probably
means that we can trig this situation with other UCs

>

Peter Zijlstra Sept. 7, 2018, 12:55 p.m. UTC | #4

On Fri, Sep 07, 2018 at 02:35:51PM +0200, Vincent Guittot wrote:
> Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit :

> > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:

> > > It can happen that load_balance finds a busiest group and then a busiest rq

> > > but the calculated imbalance is in fact null.

> > 

> > Cute. Does that happen often?

> 

> I have a use case with RT tasks that reproduces the problem regularly.

> It happens at least when we have CPUs with different capacity either because

> of heterogeous CPU or because of RT/DL reducing available capacity for cfs

> I have put the call path that trigs the problem below and accroding to the

> comment it seems that we can reach similar state when playing with priority.

> 

> > 

> > > If the calculated imbalance is null, it's useless to try to find a busiest

> > > rq as no task will be migrated and we can return immediately.

> > > 

> > > This situation can happen with heterogeneous system or smp system when RT

> > > tasks are decreasing the capacity of some CPUs.

> > 

> > Is it the result of one of those "force_balance" conditions in

> > find_busiest_group() ? Should we not fix that to then return NULL

> > instead?

> 

> The UC is:

> We have a newly_idle load balance that is triggered when RT task becomes idle

> ( but I think that I have seen that with idle load balance too)

> 

> we trigs:

> 	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> 	    busiest->group_no_capacity)

> 		goto force_balance;

> 

> In calculate_imbalance we use the path

> 	/*

> 	 * Avg load of busiest sg can be less and avg load of local sg can

> 	 * be greater than avg load across all sgs of sd because avg load

> 	 * factors in sg capacity and sgs with smaller group_type are

> 	 * skipped when updating the busiest sg:

> 	 */

> 	if (busiest->avg_load <= sds->avg_load ||

> 	    local->avg_load >= sds->avg_load) {

> 		env->imbalance = 0;

> 		return fix_small_imbalance(env, sds);

> 	}

> 

> but fix_small_imbalance finally decides to return without modifying imbalance

> like here

> 	if (busiest->avg_load + scaled_busy_load_per_task >=

> 	    local->avg_load + (scaled_busy_load_per_task * imbn)) {

> 		env->imbalance = busiest->load_per_task;

> 		return;

> 	}


That one actually does modify imbalance :-) But I get your point.

> Beside this patch, I'm preparing another patch in fix small imbalance to

> ensure 1 task per CPU in similar situation but according to the comment above,

> we can reach this situation because of tasks priority


Didn't we all hate fix_small_imbalance() ?

Anyway, I think I'd prefer something like the below; although it might
be nicer to thread the return value through calculate_imbalance() and
fix_small_imbalance(), but looking at them that's not going to be
particularly nicer.

Do you agree with this?, If so, I'll stick your orignal Changelog on it
and pretend this is what you send me :-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..0596a29f3d2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8269,7 +8269,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
 	calculate_imbalance(env, &sds);
-	return sds.busiest;
+	return env->imbalance ? sds.busiest : NULL;
 
 out_balanced:
 	env->imbalance = 0;

Peter Zijlstra Sept. 7, 2018, 12:57 p.m. UTC | #5

On Fri, Sep 07, 2018 at 02:55:54PM +0200, Peter Zijlstra wrote:
> > Beside this patch, I'm preparing another patch in fix small imbalance to

> > ensure 1 task per CPU in similar situation but according to the comment above,

> > we can reach this situation because of tasks priority

> 

> Didn't we all hate fix_small_imbalance() ?


That is, all that load_per_task business is complete rubbish in this
cgroup world.

Vincent Guittot Sept. 7, 2018, 1:08 p.m. UTC | #6

On Fri, 7 Sep 2018 at 14:56, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Fri, Sep 07, 2018 at 02:35:51PM +0200, Vincent Guittot wrote:

> > Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit :

> > > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:

> > > > It can happen that load_balance finds a busiest group and then a busiest rq

> > > > but the calculated imbalance is in fact null.

> > >

> > > Cute. Does that happen often?

> >

> > I have a use case with RT tasks that reproduces the problem regularly.

> > It happens at least when we have CPUs with different capacity either because

> > of heterogeous CPU or because of RT/DL reducing available capacity for cfs

> > I have put the call path that trigs the problem below and accroding to the

> > comment it seems that we can reach similar state when playing with priority.

> >

> > >

> > > > If the calculated imbalance is null, it's useless to try to find a busiest

> > > > rq as no task will be migrated and we can return immediately.

> > > >

> > > > This situation can happen with heterogeneous system or smp system when RT

> > > > tasks are decreasing the capacity of some CPUs.

> > >

> > > Is it the result of one of those "force_balance" conditions in

> > > find_busiest_group() ? Should we not fix that to then return NULL

> > > instead?

> >

> > The UC is:

> > We have a newly_idle load balance that is triggered when RT task becomes idle

> > ( but I think that I have seen that with idle load balance too)

> >

> > we trigs:

> >       if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&

> >           busiest->group_no_capacity)

> >               goto force_balance;

> >

> > In calculate_imbalance we use the path

> >       /*

> >        * Avg load of busiest sg can be less and avg load of local sg can

> >        * be greater than avg load across all sgs of sd because avg load

> >        * factors in sg capacity and sgs with smaller group_type are

> >        * skipped when updating the busiest sg:

> >        */

> >       if (busiest->avg_load <= sds->avg_load ||

> >           local->avg_load >= sds->avg_load) {

> >               env->imbalance = 0;

> >               return fix_small_imbalance(env, sds);

> >       }

> >

> > but fix_small_imbalance finally decides to return without modifying imbalance

> > like here

> >       if (busiest->avg_load + scaled_busy_load_per_task >=

> >           local->avg_load + (scaled_busy_load_per_task * imbn)) {

> >               env->imbalance = busiest->load_per_task;

> >               return;

> >       }

>

> That one actually does modify imbalance :-) But I get your point.

>

> > Beside this patch, I'm preparing another patch in fix small imbalance to

> > ensure 1 task per CPU in similar situation but according to the comment above,

> > we can reach this situation because of tasks priority

>

> Didn't we all hate fix_small_imbalance() ?


yes.  the rational of some decisions in the function are more and more
opaque to me. For now I'm trying to fix all UCs that I can find to
then try to get a generic rule
As an example, I have some situations (other than those discussed
here) in fix_small_imbalance where the task migration decision mainly
depends of a variation of +/-5 in the load of a task. This finally
generates good fairness between tasks but the root cause of this
fairness is a bit random.

>

> Anyway, I think I'd prefer something like the below; although it might

> be nicer to thread the return value through calculate_imbalance() and

> fix_small_imbalance(), but looking at them that's not going to be

> particularly nicer.

>

> Do you agree with this?, If so, I'll stick your orignal Changelog on it

> and pretend this is what you send me :-)


That's fine to me :-)

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index b39fb596f6c1..0596a29f3d2a 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -8269,7 +8269,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)

>  force_balance:

>         /* Looks like there is an imbalance. Compute it */

>         calculate_imbalance(env, &sds);

> -       return sds.busiest;

> +       return env->imbalance ? sds.busiest : NULL;

>

>  out_balanced:

>         env->imbalance = 0;

sched/fair: fix load_balance redo for null imbalance

Commit Message

Comments

Patch