diff mbox series

[v3,04/10] sched/fair: rework load_balance

Message ID 1568878421-12301-5-git-send-email-vincent.guittot@linaro.org
State Superseded
Headers show
Series sched/fair: rework the CFS load balance | expand

Commit Message

Vincent Guittot Sept. 19, 2019, 7:33 a.m. UTC
The load_balance algorithm contains some heuristics which have become
meaningless since the rework of the scheduler's metrics like the
introduction of PELT.

Furthermore, load is an ill-suited metric for solving certain task
placement imbalance scenarios. For instance, in the presence of idle CPUs,
we should simply try to get at least one task per CPU, whereas the current
load-based algorithm can actually leave idle CPUs alone simply because the
load is somewhat balanced. The current algorithm ends up creating virtual
and meaningless value like the avg_load_per_task or tweaks the state of a
group to make it overloaded whereas it's not, in order to try to migrate
tasks.

load_balance should better qualify the imbalance of the group and clearly
define what has to be moved to fix this imbalance.

The type of sched_group has been extended to better reflect the type of
imbalance. We now have :
	group_has_spare
	group_fully_busy
	group_misfit_task
	group_asym_capacity
	group_imbalanced
	group_overloaded

Based on the type fo sched_group, load_balance now sets what it wants to
move in order to fix the imbalance. It can be some load as before but also
some utilization, a number of task or a type of task:
	migrate_task
	migrate_util
	migrate_load
	migrate_misfit

This new load_balance algorithm fixes several pending wrong tasks
placement:
- the 1 task per CPU case with asymmetrics system
- the case of cfs task preempted by other class
- the case of tasks not evenly spread on groups with spare capacity

Also the load balance decisions have been consolidated in the 3 functions
below after removing the few bypasses and hacks of the current code:
- update_sd_pick_busiest() select the busiest sched_group.
- find_busiest_group() checks if there is an imbalance between local and
  busiest group.
- calculate_imbalance() decides what have to be moved.

Finally, the now unused field total_running of struct sd_lb_stats has been
removed.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 380 insertions(+), 205 deletions(-)

-- 
2.7.4

Comments

Rik van Riel Sept. 30, 2019, 1:12 a.m. UTC | #1
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> 

> Also the load balance decisions have been consolidated in the 3

> functions

> below after removing the few bypasses and hacks of the current code:

> - update_sd_pick_busiest() select the busiest sched_group.

> - find_busiest_group() checks if there is an imbalance between local

> and

>   busiest group.

> - calculate_imbalance() decides what have to be moved.


I really like the direction this series is going.

However, I suppose I should run these patches for
a few days with some of our test workloads before
I send out an ack for this patch :)

-- 
All Rights Reversed.
Vincent Guittot Sept. 30, 2019, 7:44 a.m. UTC | #2
On Mon, 30 Sep 2019 at 03:13, Rik van Riel <riel@surriel.com> wrote:
>

> On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:

> >

> > Also the load balance decisions have been consolidated in the 3

> > functions

> > below after removing the few bypasses and hacks of the current code:

> > - update_sd_pick_busiest() select the busiest sched_group.

> > - find_busiest_group() checks if there is an imbalance between local

> > and

> >   busiest group.

> > - calculate_imbalance() decides what have to be moved.

>

> I really like the direction this series is going.


Thanks

>

> However, I suppose I should run these patches for

> a few days with some of our test workloads before

> I send out an ack for this patch :)


Yes more tests on different platform are welcome

>

> --

> All Rights Reversed.
Dietmar Eggemann Sept. 30, 2019, 4:24 p.m. UTC | #3
Hi Vincent,

On 19/09/2019 09:33, Vincent Guittot wrote:

these are just some comments & questions based on a code study. Haven't
run any tests with it yet.

[...]

> The type of sched_group has been extended to better reflect the type of

> imbalance. We now have :

>         group_has_spare

>         group_fully_busy

>         group_misfit_task

>         group_asym_capacity


s/group_asym_capacity/group_asym_packing

>         group_imbalanced

>         group_overloaded

>

> Based on the type fo sched_group, load_balance now sets what it wants to


s/fo/for

> move in order to fix the imbalance. It can be some load as before but also

> some utilization, a number of task or a type of task:

>         migrate_task

>         migrate_util

>         migrate_load

>         migrate_misfit

>

> This new load_balance algorithm fixes several pending wrong tasks

> placement:

> - the 1 task per CPU case with asymmetrics system


s/asymmetrics/asymmetric

This stands for ASYM_CPUCAPACITY and ASYM_PACKING, right?

[...]

>   #define LBF_ALL_PINNED       0x01

> @@ -7115,7 +7130,7 @@ struct lb_env {

>         unsigned int            loop_max;

>  

>         enum fbq_type           fbq_type;

> -     enum group_type         src_grp_type;

> +     enum migration_type     balance_type;


Minor thing:

Why not
     enum migration_type migration_type;
or
     enum balance_type balance_type;

We do the same for other enums like fbq_type or group_type.

>         struct list_head        tasks;

>   };

>  


The detach_tasks() comment still mentions only runnable load.

> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)

>   {

>         struct list_head *tasks = &env->src_rq->cfs_tasks;

>         struct task_struct *p;

> -     unsigned long load;

> +     unsigned long util, load;


Minor: Order by length or reduce scope to while loop ?

>         int detached = 0;

>  

>         lockdep_assert_held(&env->src_rq->lock);

> @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)

>                 if (!can_migrate_task(p, env))

>                         goto next;

>  

> -             load = task_h_load(p);

> +             switch (env->balance_type) {

> +             case migrate_load:

> +                     load = task_h_load(p);

>  

> -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)

> -                     goto next;

> +                     if (sched_feat(LB_MIN) &&

> +                         load < 16 && !env->sd->nr_balance_failed)

> +                             goto next;

>  

> -             if ((load / 2) > env->imbalance)

> -                     goto next;

> +                     if ((load / 2) > env->imbalance)

> +                             goto next;

> +

> +                     env->imbalance -= load;

> +                     break;

> +

> +             case migrate_util:

> +                     util = task_util_est(p);

> +

> +                     if (util > env->imbalance)

> +                             goto next;

> +

> +                     env->imbalance -= util;

> +                     break;

> +

> +             case migrate_task:

> +                     /* Migrate task */


Minor: IMHO, this comment is not necessary.

> +                     env->imbalance--;

> +                     break;

> +

> +             case migrate_misfit:

> +                     load = task_h_load(p);

> +

> +                     /*

> +                      * utilization of misfit task might decrease a bit


This patch still uses load. IMHO this comments becomes true only with 08/10.

[...]

> @@ -7707,13 +7755,11 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)

>         *sds = (struct sd_lb_stats){

>                 .busiest = NULL,

>                 .local = NULL,

> -             .total_running = 0UL,

>                 .total_load = 0UL,

>                 .total_capacity = 0UL,

>                 .busiest_stat = {

> -                     .avg_load = 0UL,


There is a sentence in the comment above explaining why avg_load has to
be cleared. And IMHO local group isn't cleared anymore but set/initialized.

> -                     .sum_h_nr_running = 0,

> -                     .group_type = group_other,

> +                     .idle_cpus = UINT_MAX,

> +                     .group_type = group_has_spare,

>                 },

>         };

>   }


[...]

> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,

>                 }

>         }

>  

> -     /* Adjust by relative CPU capacity of the group */

> +     /* Check if dst cpu is idle and preferred to this group */


s/preferred to/preferred by ? or the preferred CPU of this group ?

[...]

> @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

>    */

>   static inline void calculate_imbalance(struct lb_env *env, struct

sd_lb_stats *sds)
>   {

> -     unsigned long max_pull, load_above_capacity = ~0UL;

>         struct sg_lb_stats *local, *busiest;

>  

>         local = &sds->local_stat;

>         busiest = &sds->busiest_stat;

>  

> -     if (busiest->group_asym_packing) {

> +     if (busiest->group_type == group_misfit_task) {

> +             /* Set imbalance to allow misfit task to be balanced. */

> +             env->balance_type = migrate_misfit;

> +             env->imbalance = busiest->group_misfit_task_load;

> +             return;

> +     }

> +

> +     if (busiest->group_type == group_asym_packing) {

> +             /*

> +              * In case of asym capacity, we will try to migrate all load to


Does asym capacity stands for asym packing or asym cpu capacity?

> +              * the preferred CPU.

> +              */

> +             env->balance_type = migrate_load;

>                 env->imbalance = busiest->group_load;

>                 return;

>         }

>  

> +     if (busiest->group_type == group_imbalanced) {

> +             /*

> +              * In the group_imb case we cannot rely on group-wide averages

> +              * to ensure CPU-load equilibrium, try to move any task to fix

> +              * the imbalance. The next load balance will take care of

> +              * balancing back the system.


balancing back ?

> +              */

> +             env->balance_type = migrate_task;

> +             env->imbalance = 1;

> +             return;

> +     }

> +

>         /*

> -      * Avg load of busiest sg can be less and avg load of local sg can

> -      * be greater than avg load across all sgs of sd because avg load

> -      * factors in sg capacity and sgs with smaller group_type are

> -      * skipped when updating the busiest sg:

> +      * Try to use spare capacity of local group without overloading it or

> +      * emptying busiest

>          */

> -     if (busiest->group_type != group_misfit_task &&

> -         (busiest->avg_load <= sds->avg_load ||

> -          local->avg_load >= sds->avg_load)) {

> -             env->imbalance = 0;

> +     if (local->group_type == group_has_spare) {

> +             if (busiest->group_type > group_fully_busy) {


So this could be 'busiest->group_type == group_overloaded' here to match
the comment below? Since you handle group_misfit_task,
group_asym_packing, group_imbalanced above and return.

> +                     /*

> +                      * If busiest is overloaded, try to fill spare

> +                      * capacity. This might end up creating spare capacity

> +                      * in busiest or busiest still being overloaded but

> +                      * there is no simple way to directly compute the

> +                      * amount of load to migrate in order to balance the

> +                      * system.

> +                      */

> +                     env->balance_type = migrate_util;

> +                     env->imbalance = max(local->group_capacity, local->group_util) -

> +                                 local->group_util;

> +                     return;

> +             }

> +

> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {

> +                     /*

> +                      * When prefer sibling, evenly spread running tasks on

> +                      * groups.

> +                      */

> +                     env->balance_type = migrate_task;

> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

> +                     return;

> +             }

> +

> +             /*

> +              * If there is no overload, we just want to even the number of

> +              * idle cpus.

> +              */

> +             env->balance_type = migrate_task;

> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);


Why do we need a max_t(long, 0, ...) here and not for the 'if
(busiest->group_weight == 1 || sds->prefer_sibling)' case?

>                 return;

>         }

>  

>         /*

> -      * If there aren't any idle CPUs, avoid creating some.

> +      * Local is fully busy but have to take more load to relieve the


s/have/has

> +      * busiest group

>          */


I thought that 'local->group_type == group_imbalanced' is allowed as
well? So 'if (local->group_type < group_overloaded)' (further down)
could include that?

> -     if (busiest->group_type == group_overloaded &&

> -         local->group_type   == group_overloaded) {

> -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;

> -             if (load_above_capacity > busiest->group_capacity) {

> -                     load_above_capacity -= busiest->group_capacity;

> -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);

> -                     load_above_capacity /= busiest->group_capacity;

> -             } else

> -                     load_above_capacity = ~0UL;

> +     if (local->group_type < group_overloaded) {

> +             /*

> +              * Local will become overloaded so the avg_load metrics are

> +              * finally needed.

> +              */


How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
It would be nice to be able to match all allowed fields of dm to code sections.

[...]

>   /******* find_busiest_group() helpers end here *********************/

>  

> +/*

> + * Decision matrix according to the local and busiest group state


Minor s/state/type ?

> + *

> + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded

> + * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced

> + * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced

> + * misfit_task      force     N/A        N/A    N/A  force      force

> + * asym_capacity    force     force      N/A    N/A  force      force


s/asym_capacity/asym_packing

> + * imbalanced       force     force      N/A    N/A  force      force

> + * overloaded       force     force      N/A    N/A  force      avg_load

> + *

> + * N/A :      Not Applicable because already filtered while updating

> + *            statistics.

> + * balanced : The system is balanced for these 2 groups.

> + * force :    Calculate the imbalance as load migration is probably needed.

> + * avg_load : Only if imbalance is significant enough.

> + * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite

> + *            different in groups.

> + */

> +

>   /**

>    * find_busiest_group - Returns the busiest group within the sched_domain

>    * if there is an imbalance.

> @@ -8380,17 +8524,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)

>         local = &sds.local_stat;

>         busiest = &sds.busiest_stat;

>  

> -     /* ASYM feature bypasses nice load balance check */

> -     if (busiest->group_asym_packing)

> -             goto force_balance;

> -

>         /* There is no busy sibling group to pull tasks from */

> -     if (!sds.busiest || busiest->sum_h_nr_running == 0)

> +     if (!sds.busiest)

>                 goto out_balanced;

>  

> -     /* XXX broken for overlapping NUMA groups */

> -     sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)

> -                                             / sds.total_capacity;

> +     /* Misfit tasks should be dealt with regardless of the avg load */

> +     if (busiest->group_type == group_misfit_task)

> +             goto force_balance;

> +

> +     /* ASYM feature bypasses nice load balance check */


Minor: s/ASYM feature/ASYM_PACKING ... to distinguish clearly from
ASYM_CPUCAPACITY.

> +     if (busiest->group_type == group_asym_packing)

> +             goto force_balance;


[...]
Vincent Guittot Oct. 1, 2019, 8:14 a.m. UTC | #4
On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>

> Hi Vincent,

>

> On 19/09/2019 09:33, Vincent Guittot wrote:

>

> these are just some comments & questions based on a code study. Haven't

> run any tests with it yet.

>

> [...]

>

> > The type of sched_group has been extended to better reflect the type of

> > imbalance. We now have :

> >         group_has_spare

> >         group_fully_busy

> >         group_misfit_task

> >         group_asym_capacity

>

> s/group_asym_capacity/group_asym_packing


Yes, I forgot to change the commit message and the comments

>

> >         group_imbalanced

> >         group_overloaded

> >

> > Based on the type fo sched_group, load_balance now sets what it wants to

>

> s/fo/for


s/fo/of/

>

> > move in order to fix the imbalance. It can be some load as before but also

> > some utilization, a number of task or a type of task:

> >         migrate_task

> >         migrate_util

> >         migrate_load

> >         migrate_misfit

> >

> > This new load_balance algorithm fixes several pending wrong tasks

> > placement:

> > - the 1 task per CPU case with asymmetrics system

>

> s/asymmetrics/asymmetric


yes

>

> This stands for ASYM_CPUCAPACITY and ASYM_PACKING, right?


yes

>

> [...]

>

> >   #define LBF_ALL_PINNED       0x01

> > @@ -7115,7 +7130,7 @@ struct lb_env {

> >         unsigned int            loop_max;

> >

> >         enum fbq_type           fbq_type;

> > -     enum group_type         src_grp_type;

> > +     enum migration_type     balance_type;

>

> Minor thing:

>

> Why not

>      enum migration_type migration_type;

> or

>      enum balance_type balance_type;

>

> We do the same for other enums like fbq_type or group_type.


yes, I can align

>

> >         struct list_head        tasks;

> >   };

> >

>

> The detach_tasks() comment still mentions only runnable load.


ok

>

> > @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)

> >   {

> >         struct list_head *tasks = &env->src_rq->cfs_tasks;

> >         struct task_struct *p;

> > -     unsigned long load;

> > +     unsigned long util, load;

>

> Minor: Order by length or reduce scope to while loop ?


I don't get your point here

>

> >         int detached = 0;

> >

> >         lockdep_assert_held(&env->src_rq->lock);

> > @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)

> >                 if (!can_migrate_task(p, env))

> >                         goto next;

> >

> > -             load = task_h_load(p);

> > +             switch (env->balance_type) {

> > +             case migrate_load:

> > +                     load = task_h_load(p);

> >

> > -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)

> > -                     goto next;

> > +                     if (sched_feat(LB_MIN) &&

> > +                         load < 16 && !env->sd->nr_balance_failed)

> > +                             goto next;

> >

> > -             if ((load / 2) > env->imbalance)

> > -                     goto next;

> > +                     if ((load / 2) > env->imbalance)

> > +                             goto next;

> > +

> > +                     env->imbalance -= load;

> > +                     break;

> > +

> > +             case migrate_util:

> > +                     util = task_util_est(p);

> > +

> > +                     if (util > env->imbalance)

> > +                             goto next;

> > +

> > +                     env->imbalance -= util;

> > +                     break;

> > +

> > +             case migrate_task:

> > +                     /* Migrate task */

>

> Minor: IMHO, this comment is not necessary.


yes

>

> > +                     env->imbalance--;

> > +                     break;

> > +

> > +             case migrate_misfit:

> > +                     load = task_h_load(p);

> > +

> > +                     /*

> > +                      * utilization of misfit task might decrease a bit

>

> This patch still uses load. IMHO this comments becomes true only with 08/10.


even with 8/10 it's not correct and it has been removed.
I'm going to remove it.

>

> [...]

>

> > @@ -7707,13 +7755,11 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)

> >         *sds = (struct sd_lb_stats){

> >                 .busiest = NULL,

> >                 .local = NULL,

> > -             .total_running = 0UL,

> >                 .total_load = 0UL,

> >                 .total_capacity = 0UL,

> >                 .busiest_stat = {

> > -                     .avg_load = 0UL,

>

> There is a sentence in the comment above explaining why avg_load has to

> be cleared. And IMHO local group isn't cleared anymore but set/initialized.


Yes, I have to update it

>

> > -                     .sum_h_nr_running = 0,

> > -                     .group_type = group_other,

> > +                     .idle_cpus = UINT_MAX,

> > +                     .group_type = group_has_spare,

> >                 },

> >         };

> >   }

>

> [...]

>

> > @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,

> >                 }

> >         }

> >

> > -     /* Adjust by relative CPU capacity of the group */

> > +     /* Check if dst cpu is idle and preferred to this group */

>

> s/preferred to/preferred by ? or the preferred CPU of this group ?


dst cpu doesn't belong to this group. We compare asym_prefer_cpu of
this group vs dst_cpu which belongs to another group

>

> [...]

>

> > @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

> >    */

> >   static inline void calculate_imbalance(struct lb_env *env, struct

> sd_lb_stats *sds)

> >   {

> > -     unsigned long max_pull, load_above_capacity = ~0UL;

> >         struct sg_lb_stats *local, *busiest;

> >

> >         local = &sds->local_stat;

> >         busiest = &sds->busiest_stat;

> >

> > -     if (busiest->group_asym_packing) {

> > +     if (busiest->group_type == group_misfit_task) {

> > +             /* Set imbalance to allow misfit task to be balanced. */

> > +             env->balance_type = migrate_misfit;

> > +             env->imbalance = busiest->group_misfit_task_load;

> > +             return;

> > +     }

> > +

> > +     if (busiest->group_type == group_asym_packing) {

> > +             /*

> > +              * In case of asym capacity, we will try to migrate all load to

>

> Does asym capacity stands for asym packing or asym cpu capacity?


busiest->group_type == group_asym_packing

will fix it

>

> > +              * the preferred CPU.

> > +              */

> > +             env->balance_type = migrate_load;

> >                 env->imbalance = busiest->group_load;

> >                 return;

> >         }

> >

> > +     if (busiest->group_type == group_imbalanced) {

> > +             /*

> > +              * In the group_imb case we cannot rely on group-wide averages

> > +              * to ensure CPU-load equilibrium, try to move any task to fix

> > +              * the imbalance. The next load balance will take care of

> > +              * balancing back the system.

>

> balancing back ?


In case of imbalance, we don't try to balance the system but only try
to get rid of the pinned tasks problem. The system will still be
unbalanced after the migration and the next load balance will take
care of balancing the system

>

> > +              */

> > +             env->balance_type = migrate_task;

> > +             env->imbalance = 1;

> > +             return;

> > +     }

> > +

> >         /*

> > -      * Avg load of busiest sg can be less and avg load of local sg can

> > -      * be greater than avg load across all sgs of sd because avg load

> > -      * factors in sg capacity and sgs with smaller group_type are

> > -      * skipped when updating the busiest sg:

> > +      * Try to use spare capacity of local group without overloading it or

> > +      * emptying busiest

> >          */

> > -     if (busiest->group_type != group_misfit_task &&

> > -         (busiest->avg_load <= sds->avg_load ||

> > -          local->avg_load >= sds->avg_load)) {

> > -             env->imbalance = 0;

> > +     if (local->group_type == group_has_spare) {

> > +             if (busiest->group_type > group_fully_busy) {

>

> So this could be 'busiest->group_type == group_overloaded' here to match

> the comment below? Since you handle group_misfit_task,

> group_asym_packing, group_imbalanced above and return.


This is just to be more robust in case some new states are added later

>

> > +                     /*

> > +                      * If busiest is overloaded, try to fill spare

> > +                      * capacity. This might end up creating spare capacity

> > +                      * in busiest or busiest still being overloaded but

> > +                      * there is no simple way to directly compute the

> > +                      * amount of load to migrate in order to balance the

> > +                      * system.

> > +                      */

> > +                     env->balance_type = migrate_util;

> > +                     env->imbalance = max(local->group_capacity, local->group_util) -

> > +                                 local->group_util;

> > +                     return;

> > +             }

> > +

> > +             if (busiest->group_weight == 1 || sds->prefer_sibling) {

> > +                     /*

> > +                      * When prefer sibling, evenly spread running tasks on

> > +                      * groups.

> > +                      */

> > +                     env->balance_type = migrate_task;

> > +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

> > +                     return;

> > +             }

> > +

> > +             /*

> > +              * If there is no overload, we just want to even the number of

> > +              * idle cpus.

> > +              */

> > +             env->balance_type = migrate_task;

> > +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);

>

> Why do we need a max_t(long, 0, ...) here and not for the 'if

> (busiest->group_weight == 1 || sds->prefer_sibling)' case?


For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

either we have sds->prefer_sibling && busiest->sum_nr_running >
local->sum_nr_running + 1

or busiest->group_weight == 1 and env->idle != CPU_NOT_IDLE so local
cpu is idle or newly idle

>

> >                 return;

> >         }

> >

> >         /*

> > -      * If there aren't any idle CPUs, avoid creating some.

> > +      * Local is fully busy but have to take more load to relieve the

>

> s/have/has

>

> > +      * busiest group

> >          */

>

> I thought that 'local->group_type == group_imbalanced' is allowed as

> well? So 'if (local->group_type < group_overloaded)' (further down)

> could include that?


yes.
Imbalance state is not very useful for local group but only reflects
that the group is not overloaded so either fully busy or has spare
capacity.
In this case we assume the worst : fully_busy

>

> > -     if (busiest->group_type == group_overloaded &&

> > -         local->group_type   == group_overloaded) {

> > -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;

> > -             if (load_above_capacity > busiest->group_capacity) {

> > -                     load_above_capacity -= busiest->group_capacity;

> > -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);

> > -                     load_above_capacity /= busiest->group_capacity;

> > -             } else

> > -                     load_above_capacity = ~0UL;

> > +     if (local->group_type < group_overloaded) {

> > +             /*

> > +              * Local will become overloaded so the avg_load metrics are

> > +              * finally needed.

> > +              */

>

> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.

> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.

> It would be nice to be able to match all allowed fields of dm to code sections.


decision_matrix describes how it decides between balanced or unbalanced.
In case of dm[overload, overload], we use the avg_load to decide if it
is balanced or not
In case of dm[fully_busy, overload], the groups are unbalanced because
fully_busy < overload and we force the balance. Then
calculate_imbalance() uses the avg_load to decide how much will be
moved

dm[overload, overload]=force means that we force the balance and we
will compute later the imbalance. avg_load may be used to calculate
the imbalance
dm[overload, overload]=avg_load means that we compare the avg_load to
decide whether we need to balance load between groups
dm[overload, overload]=nr_idle means that we compare the number of
idle cpus to decide whether we need to balance.  In fact this is no
more true with patch 7 because we also take into account the number of
nr_h_running when weight =1

>

> [...]

>

> >   /******* find_busiest_group() helpers end here *********************/

> >

> > +/*

> > + * Decision matrix according to the local and busiest group state

>

> Minor s/state/type ?


ok

>

> > + *

> > + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded

> > + * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced

> > + * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced

> > + * misfit_task      force     N/A        N/A    N/A  force      force

> > + * asym_capacity    force     force      N/A    N/A  force      force

>

> s/asym_capacity/asym_packing


yes

>

> > + * imbalanced       force     force      N/A    N/A  force      force

> > + * overloaded       force     force      N/A    N/A  force      avg_load

> > + *

> > + * N/A :      Not Applicable because already filtered while updating

> > + *            statistics.

> > + * balanced : The system is balanced for these 2 groups.

> > + * force :    Calculate the imbalance as load migration is probably needed.

> > + * avg_load : Only if imbalance is significant enough.

> > + * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite

> > + *            different in groups.

> > + */

> > +

> >   /**

> >    * find_busiest_group - Returns the busiest group within the sched_domain

> >    * if there is an imbalance.

> > @@ -8380,17 +8524,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)

> >         local = &sds.local_stat;

> >         busiest = &sds.busiest_stat;

> >

> > -     /* ASYM feature bypasses nice load balance check */

> > -     if (busiest->group_asym_packing)

> > -             goto force_balance;

> > -

> >         /* There is no busy sibling group to pull tasks from */

> > -     if (!sds.busiest || busiest->sum_h_nr_running == 0)

> > +     if (!sds.busiest)

> >                 goto out_balanced;

> >

> > -     /* XXX broken for overlapping NUMA groups */

> > -     sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)

> > -                                             / sds.total_capacity;

> > +     /* Misfit tasks should be dealt with regardless of the avg load */

> > +     if (busiest->group_type == group_misfit_task)

> > +             goto force_balance;

> > +

> > +     /* ASYM feature bypasses nice load balance check */

>

> Minor: s/ASYM feature/ASYM_PACKING ... to distinguish clearly from

> ASYM_CPUCAPACITY.


yes

>

> > +     if (busiest->group_type == group_asym_packing)

> > +             goto force_balance;

>

> [...]

>
Dietmar Eggemann Oct. 1, 2019, 8:15 a.m. UTC | #5
On 19/09/2019 09:33, Vincent Guittot wrote:


[...]

> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,

>  		}

>  	}

>  

> -	/* Adjust by relative CPU capacity of the group */

> +	/* Check if dst cpu is idle and preferred to this group */

> +	if (env->sd->flags & SD_ASYM_PACKING &&

> +	    env->idle != CPU_NOT_IDLE &&

> +	    sgs->sum_h_nr_running &&

> +	    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {

> +		sgs->group_asym_packing = 1;

> +	}

> +


Since asym_packing check is done per-sg rather per-CPU (like misfit_task), can you
not check for asym_packing in group_classify() directly (like for overloaded) and
so get rid of struct sg_lb_stats::group_asym_packing?

Something like (only compile tested).

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..b2848d6f8a2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7692,7 +7692,6 @@ struct sg_lb_stats {
        unsigned int idle_cpus;
        unsigned int group_weight;
        enum group_type group_type;
-       unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
        unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
        unsigned int nr_numa_running;
@@ -7952,6 +7951,20 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
        return false;
 }
 
+static inline bool
+group_has_asym_packing(struct lb_env *env, struct sched_group *sg,
+                      struct sg_lb_stats *sgs)
+{
+       if (env->sd->flags & SD_ASYM_PACKING &&
+           env->idle != CPU_NOT_IDLE &&
+           sgs->sum_h_nr_running &&
+           sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
+               return true;
+       }
+
+       return false;
+}
+
 /*
  * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
  * per-CPU capacity than sched_group ref.
@@ -7983,7 +7996,7 @@ group_type group_classify(struct lb_env *env,
        if (sg_imbalanced(group))
                return group_imbalanced;
 
-       if (sgs->group_asym_packing)
+       if (group_has_asym_packing(env, group, sgs))
                return group_asym_packing;
 
        if (sgs->group_misfit_task_load)
@@ -8076,14 +8089,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
                }
        }
 
-       /* Check if dst cpu is idle and preferred to this group */
-       if (env->sd->flags & SD_ASYM_PACKING &&
-           env->idle != CPU_NOT_IDLE &&
-           sgs->sum_h_nr_running &&
-           sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
-               sgs->group_asym_packing = 1;
-       }
-
        sgs->group_capacity = group->sgc->capacity;

[...]
Vincent Guittot Oct. 1, 2019, 9:14 a.m. UTC | #6
group_asym_packing

On Tue, 1 Oct 2019 at 10:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>

> On 19/09/2019 09:33, Vincent Guittot wrote:

>

>

> [...]

>

> > @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,

> >               }

> >       }

> >

> > -     /* Adjust by relative CPU capacity of the group */

> > +     /* Check if dst cpu is idle and preferred to this group */

> > +     if (env->sd->flags & SD_ASYM_PACKING &&

> > +         env->idle != CPU_NOT_IDLE &&

> > +         sgs->sum_h_nr_running &&

> > +         sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {

> > +             sgs->group_asym_packing = 1;

> > +     }

> > +

>

> Since asym_packing check is done per-sg rather per-CPU (like misfit_task), can you

> not check for asym_packing in group_classify() directly (like for overloaded) and

> so get rid of struct sg_lb_stats::group_asym_packing?


asym_packing uses a lot of fields of env to set group_asym_packing but
env is not statistics and i'd like to remove env from
group_classify() argument so it will use only statistics.
At now, env is an arg of group_classify only for imbalance_pct field
and I have replaced env by imabalnce_pct in a patch that is under
preparation.
With this change, I can use group_classify outside load_balance()

>

> Something like (only compile tested).

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index d0c3aa1dc290..b2848d6f8a2a 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7692,7 +7692,6 @@ struct sg_lb_stats {

>         unsigned int idle_cpus;

>         unsigned int group_weight;

>         enum group_type group_type;

> -       unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */

>         unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */

>  #ifdef CONFIG_NUMA_BALANCING

>         unsigned int nr_numa_running;

> @@ -7952,6 +7951,20 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)

>         return false;

>  }

>

> +static inline bool

> +group_has_asym_packing(struct lb_env *env, struct sched_group *sg,

> +                      struct sg_lb_stats *sgs)

> +{

> +       if (env->sd->flags & SD_ASYM_PACKING &&

> +           env->idle != CPU_NOT_IDLE &&

> +           sgs->sum_h_nr_running &&

> +           sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {

> +               return true;

> +       }

> +

> +       return false;

> +}

> +

>  /*

>   * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller

>   * per-CPU capacity than sched_group ref.

> @@ -7983,7 +7996,7 @@ group_type group_classify(struct lb_env *env,

>         if (sg_imbalanced(group))

>                 return group_imbalanced;

>

> -       if (sgs->group_asym_packing)

> +       if (group_has_asym_packing(env, group, sgs))

>                 return group_asym_packing;

>

>         if (sgs->group_misfit_task_load)

> @@ -8076,14 +8089,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,

>                 }

>         }

>

> -       /* Check if dst cpu is idle and preferred to this group */

> -       if (env->sd->flags & SD_ASYM_PACKING &&

> -           env->idle != CPU_NOT_IDLE &&

> -           sgs->sum_h_nr_running &&

> -           sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {

> -               sgs->group_asym_packing = 1;

> -       }

> -

>         sgs->group_capacity = group->sgc->capacity;

>

> [...]
Dietmar Eggemann Oct. 1, 2019, 4:52 p.m. UTC | #7
On 01/10/2019 10:14, Vincent Guittot wrote:
> On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>>

>> Hi Vincent,

>>

>> On 19/09/2019 09:33, Vincent Guittot wrote:


[...]

>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)

>>>   {

>>>         struct list_head *tasks = &env->src_rq->cfs_tasks;

>>>         struct task_struct *p;

>>> -     unsigned long load;

>>> +     unsigned long util, load;

>>

>> Minor: Order by length or reduce scope to while loop ?

> 

> I don't get your point here


Nothing dramatic here! Just

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..a08f342ead89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7333,8 +7333,8 @@ static const unsigned int sched_nr_migrate_break = 32;
 static int detach_tasks(struct lb_env *env)
 {
        struct list_head *tasks = &env->src_rq->cfs_tasks;
-       struct task_struct *p;
        unsigned long load, util;
+       struct task_struct *p;
        int detached = 0;

        lockdep_assert_held(&env->src_rq->lock);

or

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..4d1864d43ed7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7334,7 +7334,6 @@ static int detach_tasks(struct lb_env *env)
 {
        struct list_head *tasks = &env->src_rq->cfs_tasks;
        struct task_struct *p;
-       unsigned long load, util;
        int detached = 0;

        lockdep_assert_held(&env->src_rq->lock);
@@ -7343,6 +7342,8 @@ static int detach_tasks(struct lb_env *env)
                return 0;

        while (!list_empty(tasks)) {
+               unsigned long load, util;
+
                /*

[...]

>>> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,

>>>                 }

>>>         }

>>>

>>> -     /* Adjust by relative CPU capacity of the group */

>>> +     /* Check if dst cpu is idle and preferred to this group */

>>

>> s/preferred to/preferred by ? or the preferred CPU of this group ?

> 

> dst cpu doesn't belong to this group. We compare asym_prefer_cpu of

> this group vs dst_cpu which belongs to another group


Ah, in the sense of 'preferred over'. Got it now!

[...]

>>> +     if (busiest->group_type == group_imbalanced) {

>>> +             /*

>>> +              * In the group_imb case we cannot rely on group-wide averages

>>> +              * to ensure CPU-load equilibrium, try to move any task to fix

>>> +              * the imbalance. The next load balance will take care of

>>> +              * balancing back the system.

>>

>> balancing back ?

> 

> In case of imbalance, we don't try to balance the system but only try

> to get rid of the pinned tasks problem. The system will still be

> unbalanced after the migration and the next load balance will take

> care of balancing the system


OK.

[...]

>>>         /*

>>> -      * Avg load of busiest sg can be less and avg load of local sg can

>>> -      * be greater than avg load across all sgs of sd because avg load

>>> -      * factors in sg capacity and sgs with smaller group_type are

>>> -      * skipped when updating the busiest sg:

>>> +      * Try to use spare capacity of local group without overloading it or

>>> +      * emptying busiest

>>>          */

>>> -     if (busiest->group_type != group_misfit_task &&

>>> -         (busiest->avg_load <= sds->avg_load ||

>>> -          local->avg_load >= sds->avg_load)) {

>>> -             env->imbalance = 0;

>>> +     if (local->group_type == group_has_spare) {

>>> +             if (busiest->group_type > group_fully_busy) {

>>

>> So this could be 'busiest->group_type == group_overloaded' here to match

>> the comment below? Since you handle group_misfit_task,

>> group_asym_packing, group_imbalanced above and return.

> 

> This is just to be more robust in case some new states are added later


OK, although I doubt that additional states can be added easily w/o
carefully auditing the entire lb code ;-)

[...]

>>> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {

>>> +                     /*

>>> +                      * When prefer sibling, evenly spread running tasks on

>>> +                      * groups.

>>> +                      */

>>> +                     env->balance_type = migrate_task;

>>> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

>>> +                     return;

>>> +             }

>>> +

>>> +             /*

>>> +              * If there is no overload, we just want to even the number of

>>> +              * idle cpus.

>>> +              */

>>> +             env->balance_type = migrate_task;

>>> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);

>>

>> Why do we need a max_t(long, 0, ...) here and not for the 'if

>> (busiest->group_weight == 1 || sds->prefer_sibling)' case?

> 

> For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

> 

> either we have sds->prefer_sibling && busiest->sum_nr_running >

> local->sum_nr_running + 1


I see, this corresponds to

/* Try to move all excess tasks to child's sibling domain */
       if (sds.prefer_sibling && local->group_type == group_has_spare &&
           busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
               goto force_balance;

in find_busiest_group, I assume.

Haven't been able to recreate this yet on my arm64 platform since there
is no prefer_sibling and in case local and busiest have
group_type=group_has_spare they bailout in

         if (busiest->group_type != group_overloaded &&
              (env->idle == CPU_NOT_IDLE ||
               local->idle_cpus <= (busiest->idle_cpus + 1)))
                 goto out_balanced;


[...]

>>> -     if (busiest->group_type == group_overloaded &&

>>> -         local->group_type   == group_overloaded) {

>>> -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;

>>> -             if (load_above_capacity > busiest->group_capacity) {

>>> -                     load_above_capacity -= busiest->group_capacity;

>>> -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);

>>> -                     load_above_capacity /= busiest->group_capacity;

>>> -             } else

>>> -                     load_above_capacity = ~0UL;

>>> +     if (local->group_type < group_overloaded) {

>>> +             /*

>>> +              * Local will become overloaded so the avg_load metrics are

>>> +              * finally needed.

>>> +              */

>>

>> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.

>> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.

>> It would be nice to be able to match all allowed fields of dm to code sections.

> 

> decision_matrix describes how it decides between balanced or unbalanced.

> In case of dm[overload, overload], we use the avg_load to decide if it

> is balanced or not


OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only
for 'sgs->group_type == group_overloaded'.

> In case of dm[fully_busy, overload], the groups are unbalanced because

> fully_busy < overload and we force the balance. Then

> calculate_imbalance() uses the avg_load to decide how much will be

> moved


And in this case 'local->group_type < group_overloaded' in
calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be
calculated before using them in env->imbalance = min(...).

OK, got it now.

> dm[overload, overload]=force means that we force the balance and we

> will compute later the imbalance. avg_load may be used to calculate

> the imbalance

> dm[overload, overload]=avg_load means that we compare the avg_load to

> decide whether we need to balance load between groups

> dm[overload, overload]=nr_idle means that we compare the number of

> idle cpus to decide whether we need to balance.  In fact this is no

> more true with patch 7 because we also take into account the number of

> nr_h_running when weight =1


This becomes clearer now ... slowly.

[...]
Dietmar Eggemann Oct. 1, 2019, 4:57 p.m. UTC | #8
On 01/10/2019 11:14, Vincent Guittot wrote:
> group_asym_packing

> 

> On Tue, 1 Oct 2019 at 10:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>>

>> On 19/09/2019 09:33, Vincent Guittot wrote:

>>

>>

>> [...]

>>

>>> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,

>>>               }

>>>       }

>>>

>>> -     /* Adjust by relative CPU capacity of the group */

>>> +     /* Check if dst cpu is idle and preferred to this group */

>>> +     if (env->sd->flags & SD_ASYM_PACKING &&

>>> +         env->idle != CPU_NOT_IDLE &&

>>> +         sgs->sum_h_nr_running &&

>>> +         sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {

>>> +             sgs->group_asym_packing = 1;

>>> +     }

>>> +

>>

>> Since asym_packing check is done per-sg rather per-CPU (like misfit_task), can you

>> not check for asym_packing in group_classify() directly (like for overloaded) and

>> so get rid of struct sg_lb_stats::group_asym_packing?

> 

> asym_packing uses a lot of fields of env to set group_asym_packing but

> env is not statistics and i'd like to remove env from

> group_classify() argument so it will use only statistics.

> At now, env is an arg of group_classify only for imbalance_pct field

> and I have replaced env by imabalnce_pct in a patch that is under

> preparation.

> With this change, I can use group_classify outside load_balance()


OK, I see. This relates to the 'update find_idlest_group() to be more
aligned with load_balance()' point mentioned in the cover letter I
assume. To make sure we can use load instead of runnable_load there as well.

[...]
Valentin Schneider Oct. 1, 2019, 5:47 p.m. UTC | #9
On 19/09/2019 08:33, Vincent Guittot wrote:

[...]

> @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

>   */

>  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

>  {

> -	unsigned long max_pull, load_above_capacity = ~0UL;

>  	struct sg_lb_stats *local, *busiest;

>  

>  	local = &sds->local_stat;

>  	busiest = &sds->busiest_stat;

>  

> -	if (busiest->group_asym_packing) {

> +	if (busiest->group_type == group_misfit_task) {

> +		/* Set imbalance to allow misfit task to be balanced. */

> +		env->balance_type = migrate_misfit;

> +		env->imbalance = busiest->group_misfit_task_load;

> +		return;

> +	}

> +

> +	if (busiest->group_type == group_asym_packing) {

> +		/*

> +		 * In case of asym capacity, we will try to migrate all load to

> +		 * the preferred CPU.

> +		 */

> +		env->balance_type = migrate_load;

>  		env->imbalance = busiest->group_load;

>  		return;

>  	}

>  

> +	if (busiest->group_type == group_imbalanced) {

> +		/*

> +		 * In the group_imb case we cannot rely on group-wide averages

> +		 * to ensure CPU-load equilibrium, try to move any task to fix

> +		 * the imbalance. The next load balance will take care of

> +		 * balancing back the system.

> +		 */

> +		env->balance_type = migrate_task;

> +		env->imbalance = 1;

> +		return;

> +	}

> +

>  	/*

> -	 * Avg load of busiest sg can be less and avg load of local sg can

> -	 * be greater than avg load across all sgs of sd because avg load

> -	 * factors in sg capacity and sgs with smaller group_type are

> -	 * skipped when updating the busiest sg:

> +	 * Try to use spare capacity of local group without overloading it or

> +	 * emptying busiest

>  	 */

> -	if (busiest->group_type != group_misfit_task &&

> -	    (busiest->avg_load <= sds->avg_load ||

> -	     local->avg_load >= sds->avg_load)) {

> -		env->imbalance = 0;

> +	if (local->group_type == group_has_spare) {

> +		if (busiest->group_type > group_fully_busy) {

> +			/*

> +			 * If busiest is overloaded, try to fill spare

> +			 * capacity. This might end up creating spare capacity

> +			 * in busiest or busiest still being overloaded but

> +			 * there is no simple way to directly compute the

> +			 * amount of load to migrate in order to balance the

> +			 * system.

> +			 */

> +			env->balance_type = migrate_util;

> +			env->imbalance = max(local->group_capacity, local->group_util) -

> +				    local->group_util;

> +			return;

> +		}

> +

> +		if (busiest->group_weight == 1 || sds->prefer_sibling) {

> +			/*

> +			 * When prefer sibling, evenly spread running tasks on

> +			 * groups.

> +			 */

> +			env->balance_type = migrate_task;

> +			env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;


Isn't that one somewhat risky?

Say both groups are classified group_has_spare and we do prefer_sibling.
We'd select busiest as the one with the maximum number of busy CPUs, but it
could be so that busiest.sum_h_nr_running < local.sum_h_nr_running (because
pinned tasks or wakeup failed to properly spread stuff).

The thing should be unsigned so at least we save ourselves from right
shifting a negative value, but we still end up with a gygornous imbalance
(which we then store into env.imbalance which *is* signed... Urgh).

[...]
Vincent Guittot Oct. 2, 2019, 6:44 a.m. UTC | #10
On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>

> On 01/10/2019 10:14, Vincent Guittot wrote:

> > On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

> >>

> >> Hi Vincent,

> >>

> >> On 19/09/2019 09:33, Vincent Guittot wrote:

>

> [...]

>

> >>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)

> >>>   {

> >>>         struct list_head *tasks = &env->src_rq->cfs_tasks;

> >>>         struct task_struct *p;

> >>> -     unsigned long load;

> >>> +     unsigned long util, load;

> >>

> >> Minor: Order by length or reduce scope to while loop ?

> >

> > I don't get your point here

>

> Nothing dramatic here! Just

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index d0c3aa1dc290..a08f342ead89 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7333,8 +7333,8 @@ static const unsigned int sched_nr_migrate_break = 32;

>  static int detach_tasks(struct lb_env *env)

>  {

>         struct list_head *tasks = &env->src_rq->cfs_tasks;

> -       struct task_struct *p;

>         unsigned long load, util;

> +       struct task_struct *p;


hmm... I still don't get this.
We usually gather pointers instead of interleaving them with other varaiables

>         int detached = 0;

>

>         lockdep_assert_held(&env->src_rq->lock);

>

> or

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index d0c3aa1dc290..4d1864d43ed7 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7334,7 +7334,6 @@ static int detach_tasks(struct lb_env *env)

>  {

>         struct list_head *tasks = &env->src_rq->cfs_tasks;

>         struct task_struct *p;

> -       unsigned long load, util;

>         int detached = 0;

>

>         lockdep_assert_held(&env->src_rq->lock);

> @@ -7343,6 +7342,8 @@ static int detach_tasks(struct lb_env *env)

>                 return 0;

>

>         while (!list_empty(tasks)) {

> +               unsigned long load, util;

> +

>                 /*

>

> [...]

>

> >>> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,

> >>>                 }

> >>>         }

> >>>

> >>> -     /* Adjust by relative CPU capacity of the group */

> >>> +     /* Check if dst cpu is idle and preferred to this group */

> >>

> >> s/preferred to/preferred by ? or the preferred CPU of this group ?

> >

> > dst cpu doesn't belong to this group. We compare asym_prefer_cpu of

> > this group vs dst_cpu which belongs to another group

>

> Ah, in the sense of 'preferred over'. Got it now!

>

> [...]

>

> >>> +     if (busiest->group_type == group_imbalanced) {

> >>> +             /*

> >>> +              * In the group_imb case we cannot rely on group-wide averages

> >>> +              * to ensure CPU-load equilibrium, try to move any task to fix

> >>> +              * the imbalance. The next load balance will take care of

> >>> +              * balancing back the system.

> >>

> >> balancing back ?

> >

> > In case of imbalance, we don't try to balance the system but only try

> > to get rid of the pinned tasks problem. The system will still be

> > unbalanced after the migration and the next load balance will take

> > care of balancing the system

>

> OK.

>

> [...]

>

> >>>         /*

> >>> -      * Avg load of busiest sg can be less and avg load of local sg can

> >>> -      * be greater than avg load across all sgs of sd because avg load

> >>> -      * factors in sg capacity and sgs with smaller group_type are

> >>> -      * skipped when updating the busiest sg:

> >>> +      * Try to use spare capacity of local group without overloading it or

> >>> +      * emptying busiest

> >>>          */

> >>> -     if (busiest->group_type != group_misfit_task &&

> >>> -         (busiest->avg_load <= sds->avg_load ||

> >>> -          local->avg_load >= sds->avg_load)) {

> >>> -             env->imbalance = 0;

> >>> +     if (local->group_type == group_has_spare) {

> >>> +             if (busiest->group_type > group_fully_busy) {

> >>

> >> So this could be 'busiest->group_type == group_overloaded' here to match

> >> the comment below? Since you handle group_misfit_task,

> >> group_asym_packing, group_imbalanced above and return.

> >

> > This is just to be more robust in case some new states are added later

>

> OK, although I doubt that additional states can be added easily w/o

> carefully auditing the entire lb code ;-)

>

> [...]

>

> >>> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {

> >>> +                     /*

> >>> +                      * When prefer sibling, evenly spread running tasks on

> >>> +                      * groups.

> >>> +                      */

> >>> +                     env->balance_type = migrate_task;

> >>> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

> >>> +                     return;

> >>> +             }

> >>> +

> >>> +             /*

> >>> +              * If there is no overload, we just want to even the number of

> >>> +              * idle cpus.

> >>> +              */

> >>> +             env->balance_type = migrate_task;

> >>> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);

> >>

> >> Why do we need a max_t(long, 0, ...) here and not for the 'if

> >> (busiest->group_weight == 1 || sds->prefer_sibling)' case?

> >

> > For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

> >

> > either we have sds->prefer_sibling && busiest->sum_nr_running >

> > local->sum_nr_running + 1

>

> I see, this corresponds to

>

> /* Try to move all excess tasks to child's sibling domain */

>        if (sds.prefer_sibling && local->group_type == group_has_spare &&

>            busiest->sum_h_nr_running > local->sum_h_nr_running + 1)

>                goto force_balance;

>

> in find_busiest_group, I assume.


yes, it is

>

> Haven't been able to recreate this yet on my arm64 platform since there

> is no prefer_sibling and in case local and busiest have


You probably have a b.L platform for which the flag is cleared because
the hikey (dual quad cores arm64) takes advantage of  prefer sibling
at DIE level to spread tasks

> group_type=group_has_spare they bailout in

>

>          if (busiest->group_type != group_overloaded &&

>               (env->idle == CPU_NOT_IDLE ||

>                local->idle_cpus <= (busiest->idle_cpus + 1)))

>                  goto out_balanced;

>

>

> [...]

>

> >>> -     if (busiest->group_type == group_overloaded &&

> >>> -         local->group_type   == group_overloaded) {

> >>> -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;

> >>> -             if (load_above_capacity > busiest->group_capacity) {

> >>> -                     load_above_capacity -= busiest->group_capacity;

> >>> -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);

> >>> -                     load_above_capacity /= busiest->group_capacity;

> >>> -             } else

> >>> -                     load_above_capacity = ~0UL;

> >>> +     if (local->group_type < group_overloaded) {

> >>> +             /*

> >>> +              * Local will become overloaded so the avg_load metrics are

> >>> +              * finally needed.

> >>> +              */

> >>

> >> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.

> >> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.

> >> It would be nice to be able to match all allowed fields of dm to code sections.

> >

> > decision_matrix describes how it decides between balanced or unbalanced.

> > In case of dm[overload, overload], we use the avg_load to decide if it

> > is balanced or not

>

> OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only

> for 'sgs->group_type == group_overloaded'.

>

> > In case of dm[fully_busy, overload], the groups are unbalanced because

> > fully_busy < overload and we force the balance. Then

> > calculate_imbalance() uses the avg_load to decide how much will be

> > moved

>

> And in this case 'local->group_type < group_overloaded' in

> calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be

> calculated before using them in env->imbalance = min(...).

>

> OK, got it now.

>

> > dm[overload, overload]=force means that we force the balance and we

> > will compute later the imbalance. avg_load may be used to calculate

> > the imbalance

> > dm[overload, overload]=avg_load means that we compare the avg_load to

> > decide whether we need to balance load between groups

> > dm[overload, overload]=nr_idle means that we compare the number of

> > idle cpus to decide whether we need to balance.  In fact this is no

> > more true with patch 7 because we also take into account the number of

> > nr_h_running when weight =1

>

> This becomes clearer now ... slowly.

>

> [...]
Vincent Guittot Oct. 2, 2019, 8:23 a.m. UTC | #11
On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>

> On 01/10/2019 10:14, Vincent Guittot wrote:

> > On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

> >>

> >> Hi Vincent,

> >>

> >> On 19/09/2019 09:33, Vincent Guittot wrote:

>

[...]

>

> >>> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {

> >>> +                     /*

> >>> +                      * When prefer sibling, evenly spread running tasks on

> >>> +                      * groups.

> >>> +                      */

> >>> +                     env->balance_type = migrate_task;

> >>> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

> >>> +                     return;

> >>> +             }

> >>> +

> >>> +             /*

> >>> +              * If there is no overload, we just want to even the number of

> >>> +              * idle cpus.

> >>> +              */

> >>> +             env->balance_type = migrate_task;

> >>> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);

> >>

> >> Why do we need a max_t(long, 0, ...) here and not for the 'if

> >> (busiest->group_weight == 1 || sds->prefer_sibling)' case?

> >

> > For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

> >

> > either we have sds->prefer_sibling && busiest->sum_nr_running >

> > local->sum_nr_running + 1

>

> I see, this corresponds to

>

> /* Try to move all excess tasks to child's sibling domain */

>        if (sds.prefer_sibling && local->group_type == group_has_spare &&

>            busiest->sum_h_nr_running > local->sum_h_nr_running + 1)

>                goto force_balance;

>

> in find_busiest_group, I assume.


yes. But it seems that I missed a case:

prefer_sibling is set
busiest->sum_h_nr_running <= local->sum_h_nr_running + 1 so we skip
goto force_balance above
But env->idle != CPU_NOT_IDLE  and local->idle_cpus >
(busiest->idle_cpus + 1) so we also skip goto out_balance and finally
call calculate_imbalance()

in calculate_imbalance with prefer_sibling set, imbalance =
(busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

so we probably want something similar to max_t(long, 0,
(busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1)

>

> Haven't been able to recreate this yet on my arm64 platform since there

> is no prefer_sibling and in case local and busiest have

> group_type=group_has_spare they bailout in

>

>          if (busiest->group_type != group_overloaded &&

>               (env->idle == CPU_NOT_IDLE ||

>                local->idle_cpus <= (busiest->idle_cpus + 1)))

>                  goto out_balanced;

>

>

> [...]

>

> >>> -     if (busiest->group_type == group_overloaded &&

> >>> -         local->group_type   == group_overloaded) {

> >>> -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;

> >>> -             if (load_above_capacity > busiest->group_capacity) {

> >>> -                     load_above_capacity -= busiest->group_capacity;

> >>> -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);

> >>> -                     load_above_capacity /= busiest->group_capacity;

> >>> -             } else

> >>> -                     load_above_capacity = ~0UL;

> >>> +     if (local->group_type < group_overloaded) {

> >>> +             /*

> >>> +              * Local will become overloaded so the avg_load metrics are

> >>> +              * finally needed.

> >>> +              */

> >>

> >> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.

> >> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.

> >> It would be nice to be able to match all allowed fields of dm to code sections.

> >

> > decision_matrix describes how it decides between balanced or unbalanced.

> > In case of dm[overload, overload], we use the avg_load to decide if it

> > is balanced or not

>

> OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only

> for 'sgs->group_type == group_overloaded'.

>

> > In case of dm[fully_busy, overload], the groups are unbalanced because

> > fully_busy < overload and we force the balance. Then

> > calculate_imbalance() uses the avg_load to decide how much will be

> > moved

>

> And in this case 'local->group_type < group_overloaded' in

> calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be

> calculated before using them in env->imbalance = min(...).

>

> OK, got it now.

>

> > dm[overload, overload]=force means that we force the balance and we

> > will compute later the imbalance. avg_load may be used to calculate

> > the imbalance

> > dm[overload, overload]=avg_load means that we compare the avg_load to

> > decide whether we need to balance load between groups

> > dm[overload, overload]=nr_idle means that we compare the number of

> > idle cpus to decide whether we need to balance.  In fact this is no

> > more true with patch 7 because we also take into account the number of

> > nr_h_running when weight =1

>

> This becomes clearer now ... slowly.

>

> [...]
Vincent Guittot Oct. 2, 2019, 8:30 a.m. UTC | #12
On Tue, 1 Oct 2019 at 19:47, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>

> On 19/09/2019 08:33, Vincent Guittot wrote:

>

> [...]

>

> > @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

> >   */

> >  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

> >  {

> > -     unsigned long max_pull, load_above_capacity = ~0UL;

> >       struct sg_lb_stats *local, *busiest;

> >

> >       local = &sds->local_stat;

> >       busiest = &sds->busiest_stat;

> >

> > -     if (busiest->group_asym_packing) {

> > +     if (busiest->group_type == group_misfit_task) {

> > +             /* Set imbalance to allow misfit task to be balanced. */

> > +             env->balance_type = migrate_misfit;

> > +             env->imbalance = busiest->group_misfit_task_load;

> > +             return;

> > +     }

> > +

> > +     if (busiest->group_type == group_asym_packing) {

> > +             /*

> > +              * In case of asym capacity, we will try to migrate all load to

> > +              * the preferred CPU.

> > +              */

> > +             env->balance_type = migrate_load;

> >               env->imbalance = busiest->group_load;

> >               return;

> >       }

> >

> > +     if (busiest->group_type == group_imbalanced) {

> > +             /*

> > +              * In the group_imb case we cannot rely on group-wide averages

> > +              * to ensure CPU-load equilibrium, try to move any task to fix

> > +              * the imbalance. The next load balance will take care of

> > +              * balancing back the system.

> > +              */

> > +             env->balance_type = migrate_task;

> > +             env->imbalance = 1;

> > +             return;

> > +     }

> > +

> >       /*

> > -      * Avg load of busiest sg can be less and avg load of local sg can

> > -      * be greater than avg load across all sgs of sd because avg load

> > -      * factors in sg capacity and sgs with smaller group_type are

> > -      * skipped when updating the busiest sg:

> > +      * Try to use spare capacity of local group without overloading it or

> > +      * emptying busiest

> >        */

> > -     if (busiest->group_type != group_misfit_task &&

> > -         (busiest->avg_load <= sds->avg_load ||

> > -          local->avg_load >= sds->avg_load)) {

> > -             env->imbalance = 0;

> > +     if (local->group_type == group_has_spare) {

> > +             if (busiest->group_type > group_fully_busy) {

> > +                     /*

> > +                      * If busiest is overloaded, try to fill spare

> > +                      * capacity. This might end up creating spare capacity

> > +                      * in busiest or busiest still being overloaded but

> > +                      * there is no simple way to directly compute the

> > +                      * amount of load to migrate in order to balance the

> > +                      * system.

> > +                      */

> > +                     env->balance_type = migrate_util;

> > +                     env->imbalance = max(local->group_capacity, local->group_util) -

> > +                                 local->group_util;

> > +                     return;

> > +             }

> > +

> > +             if (busiest->group_weight == 1 || sds->prefer_sibling) {

> > +                     /*

> > +                      * When prefer sibling, evenly spread running tasks on

> > +                      * groups.

> > +                      */

> > +                     env->balance_type = migrate_task;

> > +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

>

> Isn't that one somewhat risky?

>

> Say both groups are classified group_has_spare and we do prefer_sibling.

> We'd select busiest as the one with the maximum number of busy CPUs, but it

> could be so that busiest.sum_h_nr_running < local.sum_h_nr_running (because

> pinned tasks or wakeup failed to properly spread stuff).

>

> The thing should be unsigned so at least we save ourselves from right

> shifting a negative value, but we still end up with a gygornous imbalance

> (which we then store into env.imbalance which *is* signed... Urgh).


so it's not clear what happen with a right shift on negative signed
value and this seems to be compiler dependent so even
max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1) might be wrong

I'm going to update it


>

> [...]
Dietmar Eggemann Oct. 2, 2019, 9:21 a.m. UTC | #13
On 02/10/2019 08:44, Vincent Guittot wrote:
> On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>>

>> On 01/10/2019 10:14, Vincent Guittot wrote:

>>> On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>>>>

>>>> Hi Vincent,

>>>>

>>>> On 19/09/2019 09:33, Vincent Guittot wrote:

>>

>> [...]

>>

>>>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)

>>>>>   {

>>>>>         struct list_head *tasks = &env->src_rq->cfs_tasks;

>>>>>         struct task_struct *p;

>>>>> -     unsigned long load;

>>>>> +     unsigned long util, load;

>>>>

>>>> Minor: Order by length or reduce scope to while loop ?

>>>

>>> I don't get your point here

>>

>> Nothing dramatic here! Just

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index d0c3aa1dc290..a08f342ead89 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -7333,8 +7333,8 @@ static const unsigned int sched_nr_migrate_break = 32;

>>  static int detach_tasks(struct lb_env *env)

>>  {

>>         struct list_head *tasks = &env->src_rq->cfs_tasks;

>> -       struct task_struct *p;

>>         unsigned long load, util;

>> +       struct task_struct *p;

> 

> hmm... I still don't get this.

> We usually gather pointers instead of interleaving them with other varaiables


I thought we should always order local variable declarations from
longest to shortest line but can't find this rule in coding-style.rst
either.

[...]
Dietmar Eggemann Oct. 2, 2019, 9:24 a.m. UTC | #14
On 02/10/2019 10:23, Vincent Guittot wrote:
> On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>>

>> On 01/10/2019 10:14, Vincent Guittot wrote:

>>> On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>>>>

>>>> Hi Vincent,

>>>>

>>>> On 19/09/2019 09:33, Vincent Guittot wrote:

>>

> [...]

> 

>>

>>>>> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {

>>>>> +                     /*

>>>>> +                      * When prefer sibling, evenly spread running tasks on

>>>>> +                      * groups.

>>>>> +                      */

>>>>> +                     env->balance_type = migrate_task;

>>>>> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

>>>>> +                     return;

>>>>> +             }

>>>>> +

>>>>> +             /*

>>>>> +              * If there is no overload, we just want to even the number of

>>>>> +              * idle cpus.

>>>>> +              */

>>>>> +             env->balance_type = migrate_task;

>>>>> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);

>>>>

>>>> Why do we need a max_t(long, 0, ...) here and not for the 'if

>>>> (busiest->group_weight == 1 || sds->prefer_sibling)' case?

>>>

>>> For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

>>>

>>> either we have sds->prefer_sibling && busiest->sum_nr_running >

>>> local->sum_nr_running + 1

>>

>> I see, this corresponds to

>>

>> /* Try to move all excess tasks to child's sibling domain */

>>        if (sds.prefer_sibling && local->group_type == group_has_spare &&

>>            busiest->sum_h_nr_running > local->sum_h_nr_running + 1)

>>                goto force_balance;

>>

>> in find_busiest_group, I assume.

> 

> yes. But it seems that I missed a case:

> 

> prefer_sibling is set

> busiest->sum_h_nr_running <= local->sum_h_nr_running + 1 so we skip

> goto force_balance above

> But env->idle != CPU_NOT_IDLE  and local->idle_cpus >

> (busiest->idle_cpus + 1) so we also skip goto out_balance and finally

> call calculate_imbalance()

> 

> in calculate_imbalance with prefer_sibling set, imbalance =

> (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

> 

> so we probably want something similar to max_t(long, 0,

> (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1)


Makes sense.

Caught a couple of

[  369.310464] 0-3->4-7 2->5 env->imbalance = 2147483646
[  369.310796] 0-3->4-7 2->4 env->imbalance = 2147483647

in this if condition on h620 running hackbench.
Valentin Schneider Oct. 2, 2019, 10:47 a.m. UTC | #15
On 02/10/2019 09:30, Vincent Guittot wrote:
>> Isn't that one somewhat risky?

>>

>> Say both groups are classified group_has_spare and we do prefer_sibling.

>> We'd select busiest as the one with the maximum number of busy CPUs, but it

>> could be so that busiest.sum_h_nr_running < local.sum_h_nr_running (because

>> pinned tasks or wakeup failed to properly spread stuff).

>>

>> The thing should be unsigned so at least we save ourselves from right

>> shifting a negative value, but we still end up with a gygornous imbalance

>> (which we then store into env.imbalance which *is* signed... Urgh).

> 

> so it's not clear what happen with a right shift on negative signed

> value and this seems to be compiler dependent so even

> max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1) might be wrong

> 


Yeah, right shift on signed negative values are implementation defined. This
is what I was worried about initially, but I think the expression resulting
from the subtraction is unsigned (both terms are unsigned) so this would
just wrap when busiest < local - but that is still a problem.


((local->idle_cpus - busiest->idle_cpus) >> 1) should be fine because we do
have this check in find_busiest_group() before heading off to
calculate_imbalance():

  if (busiest->group_type != group_overloaded &&
      (env->idle == CPU_NOT_IDLE ||
       local->idle_cpus <= (busiest->idle_cpus + 1)))
	  /* ... */
	  goto out_balanced;

which ensures the subtraction will be at least 2. We're missing something
equivalent for the sum_h_nr_running case.

> I'm going to update it

> 

> 

>>

>> [...]
Peter Zijlstra Oct. 8, 2019, 1:02 p.m. UTC | #16
On Wed, Oct 02, 2019 at 11:21:20AM +0200, Dietmar Eggemann wrote:

> I thought we should always order local variable declarations from

> longest to shortest line but can't find this rule in coding-style.rst

> either.


You're right though, that is generally encouraged. From last years
(2018) KS there was the notion of a subsystem handbook and the tip
tree's submissions thereto can be found there:

  https://lore.kernel.org/lkml/20181107171149.165693799@linutronix.de/

But for some raisin that never actually went anywhere...
Peter Zijlstra Oct. 8, 2019, 2:16 p.m. UTC | #17
On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:

> Yeah, right shift on signed negative values are implementation defined.


Seriously? Even under -fno-strict-overflow? There is a perfectly
sensible operation for signed shift right, this stuff should not be
undefined.
Valentin Schneider Oct. 8, 2019, 2:34 p.m. UTC | #18
On 08/10/2019 15:16, Peter Zijlstra wrote:
> On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:

> 

>> Yeah, right shift on signed negative values are implementation defined.

> 

> Seriously? Even under -fno-strict-overflow? There is a perfectly

> sensible operation for signed shift right, this stuff should not be

> undefined.

> 


Mmm good point. I didn't see anything relevant in the description of that
flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:

"""
The result of E1 >> E2 [...] If E1 has a signed type and a negative value,
the resulting value is implementation-defined.
"""

Arithmetic shift would make sense, but I think this stems from twos'
complement not being imposed: 6.2.6.2.2 says sign can be done with
sign + magnitude, twos complement or ones' complement...

I suppose when you really just want a division you should ask for division
semantics - i.e. use '/'. I'd expect compilers to be smart enough to turn
that into a shift if a power of 2 is involved, and to do something else
if negative values can be involved.
Vincent Guittot Oct. 8, 2019, 3:30 p.m. UTC | #19
Le Tuesday 08 Oct 2019 à 15:34:04 (+0100), Valentin Schneider a écrit :
> On 08/10/2019 15:16, Peter Zijlstra wrote:

> > On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:

> > 

> >> Yeah, right shift on signed negative values are implementation defined.

> > 

> > Seriously? Even under -fno-strict-overflow? There is a perfectly

> > sensible operation for signed shift right, this stuff should not be

> > undefined.

> > 

> 

> Mmm good point. I didn't see anything relevant in the description of that

> flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:

> 

> """

> The result of E1 >> E2 [...] If E1 has a signed type and a negative value,

> the resulting value is implementation-defined.

> """

> 

> Arithmetic shift would make sense, but I think this stems from twos'

> complement not being imposed: 6.2.6.2.2 says sign can be done with

> sign + magnitude, twos complement or ones' complement...

> 

> I suppose when you really just want a division you should ask for division

> semantics - i.e. use '/'. I'd expect compilers to be smart enough to turn

> that into a shift if a power of 2 is involved, and to do something else

> if negative values can be involved.


This is how I plan to get ride of the problem:
+		if (busiest->group_weight == 1 || sds->prefer_sibling) {
+			unsigned int nr_diff = busiest->sum_h_nr_running;
+			/*
+			 * When prefer sibling, evenly spread running tasks on
+			 * groups.
+			 */
+			env->migration_type = migrate_task;
+			lsub_positive(&nr_diff, local->sum_h_nr_running);
+			env->imbalance = nr_diff >> 1;
+			return;
+		}
Valentin Schneider Oct. 8, 2019, 3:48 p.m. UTC | #20
On 08/10/2019 16:30, Vincent Guittot wrote:
[...]
> 

> This is how I plan to get ride of the problem:

> +		if (busiest->group_weight == 1 || sds->prefer_sibling) {

> +			unsigned int nr_diff = busiest->sum_h_nr_running;

> +			/*

> +			 * When prefer sibling, evenly spread running tasks on

> +			 * groups.

> +			 */

> +			env->migration_type = migrate_task;

> +			lsub_positive(&nr_diff, local->sum_h_nr_running);

> +			env->imbalance = nr_diff >> 1;

> +			return;

> +		}

> 


I think this wants a 

/* Local could have more tasks than busiest */

atop the lsub, otherwise yeah that ought to work.
Peter Zijlstra Oct. 8, 2019, 4:33 p.m. UTC | #21
On Tue, Oct 08, 2019 at 03:34:04PM +0100, Valentin Schneider wrote:
> On 08/10/2019 15:16, Peter Zijlstra wrote:

> > On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:

> > 

> >> Yeah, right shift on signed negative values are implementation defined.

> > 

> > Seriously? Even under -fno-strict-overflow? There is a perfectly

> > sensible operation for signed shift right, this stuff should not be

> > undefined.

> > 

> 

> Mmm good point. I didn't see anything relevant in the description of that

> flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:

> 

> """

> The result of E1 >> E2 [...] If E1 has a signed type and a negative value,

> the resulting value is implementation-defined.

> """

> 

> Arithmetic shift would make sense, but I think this stems from twos'

> complement not being imposed: 6.2.6.2.2 says sign can be done with

> sign + magnitude, twos complement or ones' complement...


But -fno-strict-overflow mandates 2s complement for all such signed
issues.
Valentin Schneider Oct. 8, 2019, 4:39 p.m. UTC | #22
On 08/10/2019 17:33, Peter Zijlstra wrote:
> On Tue, Oct 08, 2019 at 03:34:04PM +0100, Valentin Schneider wrote:

>> On 08/10/2019 15:16, Peter Zijlstra wrote:

>>> On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:

>>>

>>>> Yeah, right shift on signed negative values are implementation defined.

>>>

>>> Seriously? Even under -fno-strict-overflow? There is a perfectly

>>> sensible operation for signed shift right, this stuff should not be

>>> undefined.

>>>

>>

>> Mmm good point. I didn't see anything relevant in the description of that

>> flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:

>>

>> """

>> The result of E1 >> E2 [...] If E1 has a signed type and a negative value,

>> the resulting value is implementation-defined.

>> """

>>

>> Arithmetic shift would make sense, but I think this stems from twos'

>> complement not being imposed: 6.2.6.2.2 says sign can be done with

>> sign + magnitude, twos complement or ones' complement...

> 

> But -fno-strict-overflow mandates 2s complement for all such signed

> issues.

> 


So then there really shouldn't be any ambiguity. I have no idea if
-fno-strict-overflow then also lifts the undefinedness of the right shifts,
gotta get my spade and dig some more.
Valentin Schneider Oct. 8, 2019, 5:36 p.m. UTC | #23
On 08/10/2019 17:39, Valentin Schneider wrote:
>>

>> But -fno-strict-overflow mandates 2s complement for all such signed

>> issues.

>>

> 

> So then there really shouldn't be any ambiguity. I have no idea if

> -fno-strict-overflow then also lifts the undefinedness of the right shifts,

> gotta get my spade and dig some more.

> 


Bleh, no luck really.

Thinking about it some more you can't overflow by right shifting
(logical/arithmetic), so I dunno if signed right shift counts as "such
signed issues".
Peter Zijlstra Oct. 8, 2019, 5:39 p.m. UTC | #24
On Tue, Oct 08, 2019 at 05:30:02PM +0200, Vincent Guittot wrote:

> This is how I plan to get ride of the problem:

> +		if (busiest->group_weight == 1 || sds->prefer_sibling) {

> +			unsigned int nr_diff = busiest->sum_h_nr_running;

> +			/*

> +			 * When prefer sibling, evenly spread running tasks on

> +			 * groups.

> +			 */

> +			env->migration_type = migrate_task;

> +			lsub_positive(&nr_diff, local->sum_h_nr_running);

> +			env->imbalance = nr_diff >> 1;

> +			return;

> +		}


I'm thinking the max_t(long, 0, ...); variant reads a lot simpler and
really _should_ work given that -fno-strict-overflow / -fwrapv mandates
2s complement.
Peter Zijlstra Oct. 8, 2019, 5:55 p.m. UTC | #25
On Thu, Sep 19, 2019 at 09:33:35AM +0200, Vincent Guittot wrote:
> +	if (busiest->group_type == group_asym_packing) {

> +		/*

> +		 * In case of asym capacity, we will try to migrate all load to

> +		 * the preferred CPU.

> +		 */

> +		env->balance_type = migrate_load;

>  		env->imbalance = busiest->group_load;

>  		return;

>  	}


I was a bit surprised with this; I sorta expected a migrate_task,1 here.

The asym_packing thing has always been a nr_running issue to me. If
there is something to run, we should run on as many siblings/cores as
possible, but preferably the 'highest' ranked siblings/cores.
Vincent Guittot Oct. 8, 2019, 6:45 p.m. UTC | #26
On Tue, 8 Oct 2019 at 19:39, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Tue, Oct 08, 2019 at 05:30:02PM +0200, Vincent Guittot wrote:

>

> > This is how I plan to get ride of the problem:

> > +             if (busiest->group_weight == 1 || sds->prefer_sibling) {

> > +                     unsigned int nr_diff = busiest->sum_h_nr_running;

> > +                     /*

> > +                      * When prefer sibling, evenly spread running tasks on

> > +                      * groups.

> > +                      */

> > +                     env->migration_type = migrate_task;

> > +                     lsub_positive(&nr_diff, local->sum_h_nr_running);

> > +                     env->imbalance = nr_diff >> 1;

> > +                     return;

> > +             }

>

> I'm thinking the max_t(long, 0, ...); variant reads a lot simpler and

> really _should_ work given that -fno-strict-overflow / -fwrapv mandates

> 2s complement.


Another point that I have overlooked is that sum_h_nr_running is
unsigned int whereas imbalance is long

In fact, (long) (unsigned long A - unsigned long B) >> 1 works correctly
but
(long) (unsigned int A - unsigned int B) >> 1 doesn't
Vincent Guittot Oct. 8, 2019, 6:47 p.m. UTC | #27
On Tue, 8 Oct 2019 at 19:55, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Thu, Sep 19, 2019 at 09:33:35AM +0200, Vincent Guittot wrote:

> > +     if (busiest->group_type == group_asym_packing) {

> > +             /*

> > +              * In case of asym capacity, we will try to migrate all load to

> > +              * the preferred CPU.

> > +              */

> > +             env->balance_type = migrate_load;

> >               env->imbalance = busiest->group_load;

> >               return;

> >       }

>

> I was a bit surprised with this; I sorta expected a migrate_task,1 here.


I have just kept current mechanism

>

> The asym_packing thing has always been a nr_running issue to me. If

> there is something to run, we should run on as many siblings/cores as

> possible, but preferably the 'highest' ranked siblings/cores.


I can probably set the number of running task to migrate instead of the load
>
Parth Shah Oct. 16, 2019, 7:21 a.m. UTC | #28
On 9/19/19 1:03 PM, Vincent Guittot wrote:

[...]

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---

>  kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------

>  1 file changed, 380 insertions(+), 205 deletions(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 017aad0..d33379c 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;

> 

>  enum fbq_type { regular, remote, all };

> 

> +/*

> + * group_type describes the group of CPUs at the moment of the load balance.

> + * The enum is ordered by pulling priority, with the group with lowest priority

> + * first so the groupe_type can be simply compared when selecting the busiest

> + * group. see update_sd_pick_busiest().

> + */

>  enum group_type {

> -	group_other = 0,

> +	group_has_spare = 0,

> +	group_fully_busy,

>  	group_misfit_task,

> +	group_asym_packing,

>  	group_imbalanced,

> -	group_overloaded,

> +	group_overloaded

> +};

> +

> +enum migration_type {

> +	migrate_load = 0,

> +	migrate_util,

> +	migrate_task,

> +	migrate_misfit

>  };

> 

>  #define LBF_ALL_PINNED	0x01

> @@ -7115,7 +7130,7 @@ struct lb_env {

>  	unsigned int		loop_max;

> 

>  	enum fbq_type		fbq_type;

> -	enum group_type		src_grp_type;

> +	enum migration_type	balance_type;

>  	struct list_head	tasks;

>  };

> 

> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)

>  {

>  	struct list_head *tasks = &env->src_rq->cfs_tasks;

>  	struct task_struct *p;

> -	unsigned long load;

> +	unsigned long util, load;

>  	int detached = 0;

> 

>  	lockdep_assert_held(&env->src_rq->lock);

> @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)

>  		if (!can_migrate_task(p, env))

>  			goto next;

> 

> -		load = task_h_load(p);

> +		switch (env->balance_type) {

> +		case migrate_load:

> +			load = task_h_load(p);

> 

> -		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)

> -			goto next;

> +			if (sched_feat(LB_MIN) &&

> +			    load < 16 && !env->sd->nr_balance_failed)

> +				goto next;

> 

> -		if ((load / 2) > env->imbalance)

> -			goto next;

> +			if ((load / 2) > env->imbalance)

> +				goto next;

> +

> +			env->imbalance -= load;

> +			break;

> +

> +		case migrate_util:

> +			util = task_util_est(p);

> +

> +			if (util > env->imbalance)


Can you please explain what would happen for
`if (util/2 > env->imbalance)` ?
just like when migrating load, even util shouldn't be migrated if
env->imbalance is just near the utilization of the task being moved, isn't it?

> +				goto next;

> +

> +			env->imbalance -= util;

> +			break;

> +[ ... ]


Thanks,
Parth
Vincent Guittot Oct. 16, 2019, 11:56 a.m. UTC | #29
On Wed, 16 Oct 2019 at 09:21, Parth Shah <parth@linux.ibm.com> wrote:
>

>

>

> On 9/19/19 1:03 PM, Vincent Guittot wrote:

>

> [...]

>

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> > ---

> >  kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------

> >  1 file changed, 380 insertions(+), 205 deletions(-)

> >

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index 017aad0..d33379c 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;

> >

> >  enum fbq_type { regular, remote, all };

> >

> > +/*

> > + * group_type describes the group of CPUs at the moment of the load balance.

> > + * The enum is ordered by pulling priority, with the group with lowest priority

> > + * first so the groupe_type can be simply compared when selecting the busiest

> > + * group. see update_sd_pick_busiest().

> > + */

> >  enum group_type {

> > -     group_other = 0,

> > +     group_has_spare = 0,

> > +     group_fully_busy,

> >       group_misfit_task,

> > +     group_asym_packing,

> >       group_imbalanced,

> > -     group_overloaded,

> > +     group_overloaded

> > +};

> > +

> > +enum migration_type {

> > +     migrate_load = 0,

> > +     migrate_util,

> > +     migrate_task,

> > +     migrate_misfit

> >  };

> >

> >  #define LBF_ALL_PINNED       0x01

> > @@ -7115,7 +7130,7 @@ struct lb_env {

> >       unsigned int            loop_max;

> >

> >       enum fbq_type           fbq_type;

> > -     enum group_type         src_grp_type;

> > +     enum migration_type     balance_type;

> >       struct list_head        tasks;

> >  };

> >

> > @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)

> >  {

> >       struct list_head *tasks = &env->src_rq->cfs_tasks;

> >       struct task_struct *p;

> > -     unsigned long load;

> > +     unsigned long util, load;

> >       int detached = 0;

> >

> >       lockdep_assert_held(&env->src_rq->lock);

> > @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)

> >               if (!can_migrate_task(p, env))

> >                       goto next;

> >

> > -             load = task_h_load(p);

> > +             switch (env->balance_type) {

> > +             case migrate_load:

> > +                     load = task_h_load(p);

> >

> > -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)

> > -                     goto next;

> > +                     if (sched_feat(LB_MIN) &&

> > +                         load < 16 && !env->sd->nr_balance_failed)

> > +                             goto next;

> >

> > -             if ((load / 2) > env->imbalance)

> > -                     goto next;

> > +                     if ((load / 2) > env->imbalance)

> > +                             goto next;

> > +

> > +                     env->imbalance -= load;

> > +                     break;

> > +

> > +             case migrate_util:

> > +                     util = task_util_est(p);

> > +

> > +                     if (util > env->imbalance)

>

> Can you please explain what would happen for

> `if (util/2 > env->imbalance)` ?

> just like when migrating load, even util shouldn't be migrated if

> env->imbalance is just near the utilization of the task being moved, isn't it?


I have chosen uti and not util/2 to be conservative because
migrate_util is used to fill spare capacity.
With `if (util/2 > env->imbalance)`, we can more easily overload the
local group or pick too much utilization from the overloaded group.

>

> > +                             goto next;

> > +

> > +                     env->imbalance -= util;

> > +                     break;

> > +[ ... ]

>

> Thanks,

> Parth

>
Parth Shah Oct. 18, 2019, 5:34 a.m. UTC | #30
On 10/16/19 5:26 PM, Vincent Guittot wrote:
> On Wed, 16 Oct 2019 at 09:21, Parth Shah <parth@linux.ibm.com> wrote:

>>

>>

>>

>> On 9/19/19 1:03 PM, Vincent Guittot wrote:

>>

>> [...]

>>

>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>>> ---

>>>  kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------

>>>  1 file changed, 380 insertions(+), 205 deletions(-)

>>>

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>>> index 017aad0..d33379c 100644

>>> --- a/kernel/sched/fair.c

>>> +++ b/kernel/sched/fair.c

>>> @@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;

>>>

>>>  enum fbq_type { regular, remote, all };

>>>

>>> +/*

>>> + * group_type describes the group of CPUs at the moment of the load balance.

>>> + * The enum is ordered by pulling priority, with the group with lowest priority

>>> + * first so the groupe_type can be simply compared when selecting the busiest

>>> + * group. see update_sd_pick_busiest().

>>> + */

>>>  enum group_type {

>>> -     group_other = 0,

>>> +     group_has_spare = 0,

>>> +     group_fully_busy,

>>>       group_misfit_task,

>>> +     group_asym_packing,

>>>       group_imbalanced,

>>> -     group_overloaded,

>>> +     group_overloaded

>>> +};

>>> +

>>> +enum migration_type {

>>> +     migrate_load = 0,

>>> +     migrate_util,

>>> +     migrate_task,

>>> +     migrate_misfit

>>>  };

>>>

>>>  #define LBF_ALL_PINNED       0x01

>>> @@ -7115,7 +7130,7 @@ struct lb_env {

>>>       unsigned int            loop_max;

>>>

>>>       enum fbq_type           fbq_type;

>>> -     enum group_type         src_grp_type;

>>> +     enum migration_type     balance_type;

>>>       struct list_head        tasks;

>>>  };

>>>

>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)

>>>  {

>>>       struct list_head *tasks = &env->src_rq->cfs_tasks;

>>>       struct task_struct *p;

>>> -     unsigned long load;

>>> +     unsigned long util, load;

>>>       int detached = 0;

>>>

>>>       lockdep_assert_held(&env->src_rq->lock);

>>> @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)

>>>               if (!can_migrate_task(p, env))

>>>                       goto next;

>>>

>>> -             load = task_h_load(p);

>>> +             switch (env->balance_type) {

>>> +             case migrate_load:

>>> +                     load = task_h_load(p);

>>>

>>> -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)

>>> -                     goto next;

>>> +                     if (sched_feat(LB_MIN) &&

>>> +                         load < 16 && !env->sd->nr_balance_failed)

>>> +                             goto next;

>>>

>>> -             if ((load / 2) > env->imbalance)

>>> -                     goto next;

>>> +                     if ((load / 2) > env->imbalance)

>>> +                             goto next;

>>> +

>>> +                     env->imbalance -= load;

>>> +                     break;

>>> +

>>> +             case migrate_util:

>>> +                     util = task_util_est(p);

>>> +

>>> +                     if (util > env->imbalance)

>>

>> Can you please explain what would happen for

>> `if (util/2 > env->imbalance)` ?

>> just like when migrating load, even util shouldn't be migrated if

>> env->imbalance is just near the utilization of the task being moved, isn't it?

> 

> I have chosen uti and not util/2 to be conservative because

> migrate_util is used to fill spare capacity.

> With `if (util/2 > env->imbalance)`, we can more easily overload the

> local group or pick too much utilization from the overloaded group.

> 


fair enough. I missed the point that unlike migrate_load, with
migrate_util, env->imbalance is just spare capacity of the local group.

Thanks,
Parth

>>

>>> +                             goto next;

>>> +

>>> +                     env->imbalance -= util;

>>> +                     break;

>>> +[ ... ]

>>

>> Thanks,

>> Parth

>>
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 017aad0..d33379c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7078,11 +7078,26 @@  static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+/*
+ * group_type describes the group of CPUs at the moment of the load balance.
+ * The enum is ordered by pulling priority, with the group with lowest priority
+ * first so the groupe_type can be simply compared when selecting the busiest
+ * group. see update_sd_pick_busiest().
+ */
 enum group_type {
-	group_other = 0,
+	group_has_spare = 0,
+	group_fully_busy,
 	group_misfit_task,
+	group_asym_packing,
 	group_imbalanced,
-	group_overloaded,
+	group_overloaded
+};
+
+enum migration_type {
+	migrate_load = 0,
+	migrate_util,
+	migrate_task,
+	migrate_misfit
 };
 
 #define LBF_ALL_PINNED	0x01
@@ -7115,7 +7130,7 @@  struct lb_env {
 	unsigned int		loop_max;
 
 	enum fbq_type		fbq_type;
-	enum group_type		src_grp_type;
+	enum migration_type	balance_type;
 	struct list_head	tasks;
 };
 
@@ -7347,7 +7362,7 @@  static int detach_tasks(struct lb_env *env)
 {
 	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	struct task_struct *p;
-	unsigned long load;
+	unsigned long util, load;
 	int detached = 0;
 
 	lockdep_assert_held(&env->src_rq->lock);
@@ -7380,19 +7395,53 @@  static int detach_tasks(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			goto next;
 
-		load = task_h_load(p);
+		switch (env->balance_type) {
+		case migrate_load:
+			load = task_h_load(p);
 
-		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
-			goto next;
+			if (sched_feat(LB_MIN) &&
+			    load < 16 && !env->sd->nr_balance_failed)
+				goto next;
 
-		if ((load / 2) > env->imbalance)
-			goto next;
+			if ((load / 2) > env->imbalance)
+				goto next;
+
+			env->imbalance -= load;
+			break;
+
+		case migrate_util:
+			util = task_util_est(p);
+
+			if (util > env->imbalance)
+				goto next;
+
+			env->imbalance -= util;
+			break;
+
+		case migrate_task:
+			/* Migrate task */
+			env->imbalance--;
+			break;
+
+		case migrate_misfit:
+			load = task_h_load(p);
+
+			/*
+			 * utilization of misfit task might decrease a bit
+			 * since it has been recorded. Be conservative in the
+			 * condition.
+			 */
+			if (load < env->imbalance)
+				goto next;
+
+			env->imbalance = 0;
+			break;
+		}
 
 		detach_task(p, env);
 		list_add(&p->se.group_node, &env->tasks);
 
 		detached++;
-		env->imbalance -= load;
 
 #ifdef CONFIG_PREEMPT
 		/*
@@ -7406,7 +7455,7 @@  static int detach_tasks(struct lb_env *env)
 
 		/*
 		 * We only want to steal up to the prescribed amount of
-		 * runnable load.
+		 * load/util/tasks.
 		 */
 		if (env->imbalance <= 0)
 			break;
@@ -7671,7 +7720,6 @@  struct sg_lb_stats {
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_no_capacity;
 	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
 	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
@@ -7687,10 +7735,10 @@  struct sg_lb_stats {
 struct sd_lb_stats {
 	struct sched_group *busiest;	/* Busiest group in this sd */
 	struct sched_group *local;	/* Local group in this sd */
-	unsigned long total_running;
 	unsigned long total_load;	/* Total load of all groups in sd */
 	unsigned long total_capacity;	/* Total capacity of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
+	unsigned int prefer_sibling; /* tasks should go to sibling first */
 
 	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
 	struct sg_lb_stats local_stat;	/* Statistics of the local group */
@@ -7707,13 +7755,11 @@  static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 	*sds = (struct sd_lb_stats){
 		.busiest = NULL,
 		.local = NULL,
-		.total_running = 0UL,
 		.total_load = 0UL,
 		.total_capacity = 0UL,
 		.busiest_stat = {
-			.avg_load = 0UL,
-			.sum_h_nr_running = 0,
-			.group_type = group_other,
+			.idle_cpus = UINT_MAX,
+			.group_type = group_has_spare,
 		},
 	};
 }
@@ -7955,19 +8001,26 @@  group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 }
 
 static inline enum
-group_type group_classify(struct sched_group *group,
+group_type group_classify(struct lb_env *env,
+			  struct sched_group *group,
 			  struct sg_lb_stats *sgs)
 {
-	if (sgs->group_no_capacity)
+	if (group_is_overloaded(env, sgs))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
+	if (sgs->group_asym_packing)
+		return group_asym_packing;
+
 	if (sgs->group_misfit_task_load)
 		return group_misfit_task;
 
-	return group_other;
+	if (!group_has_capacity(env, sgs))
+		return group_fully_busy;
+
+	return group_has_spare;
 }
 
 static bool update_nohz_stats(struct rq *rq, bool force)
@@ -8004,10 +8057,12 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 				      struct sg_lb_stats *sgs,
 				      int *sg_status)
 {
-	int i, nr_running;
+	int i, nr_running, local_group;
 
 	memset(sgs, 0, sizeof(*sgs));
 
+	local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
+
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
@@ -8032,9 +8087,16 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
-		if (!nr_running && idle_cpu(i))
+		if (!nr_running && idle_cpu(i)) {
 			sgs->idle_cpus++;
+			/* Idle cpu can't have misfit task */
+			continue;
+		}
+
+		if (local_group)
+			continue;
 
+		/* Check for a misfit task on the cpu */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
 		    sgs->group_misfit_task_load < rq->misfit_task_load) {
 			sgs->group_misfit_task_load = rq->misfit_task_load;
@@ -8042,14 +8104,24 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 	}
 
-	/* Adjust by relative CPU capacity of the group */
+	/* Check if dst cpu is idle and preferred to this group */
+	if (env->sd->flags & SD_ASYM_PACKING &&
+	    env->idle != CPU_NOT_IDLE &&
+	    sgs->sum_h_nr_running &&
+	    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
+		sgs->group_asym_packing = 1;
+	}
+
 	sgs->group_capacity = group->sgc->capacity;
-	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
 
 	sgs->group_weight = group->group_weight;
 
-	sgs->group_no_capacity = group_is_overloaded(env, sgs);
-	sgs->group_type = group_classify(group, sgs);
+	sgs->group_type = group_classify(env, group, sgs);
+
+	/* Computing avg_load makes sense only when group is overloaded */
+	if (sgs->group_type == group_overloaded)
+		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
+				sgs->group_capacity;
 }
 
 /**
@@ -8072,6 +8144,10 @@  static bool update_sd_pick_busiest(struct lb_env *env,
 {
 	struct sg_lb_stats *busiest = &sds->busiest_stat;
 
+	/* Make sure that there is at least one task to pull */
+	if (!sgs->sum_h_nr_running)
+		return false;
+
 	/*
 	 * Don't try to pull misfit tasks we can't help.
 	 * We can use max_capacity here as reduction in capacity on some
@@ -8080,7 +8156,7 @@  static bool update_sd_pick_busiest(struct lb_env *env,
 	 */
 	if (sgs->group_type == group_misfit_task &&
 	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
-	     !group_has_capacity(env, &sds->local_stat)))
+	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
 	if (sgs->group_type > busiest->group_type)
@@ -8089,62 +8165,80 @@  static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->group_type < busiest->group_type)
 		return false;
 
-	if (sgs->avg_load <= busiest->avg_load)
-		return false;
-
-	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
-		goto asym_packing;
-
 	/*
-	 * Candidate sg has no more than one task per CPU and
-	 * has higher per-CPU capacity. Migrating tasks to less
-	 * capable CPUs may harm throughput. Maximize throughput,
-	 * power/energy consequences are not considered.
+	 * The candidate and the current busiest group are the same type of
+	 * group. Let check which one is the busiest according to the type.
 	 */
-	if (sgs->sum_h_nr_running <= sgs->group_weight &&
-	    group_smaller_min_cpu_capacity(sds->local, sg))
-		return false;
 
-	/*
-	 * If we have more than one misfit sg go with the biggest misfit.
-	 */
-	if (sgs->group_type == group_misfit_task &&
-	    sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+	switch (sgs->group_type) {
+	case group_overloaded:
+		/* Select the overloaded group with highest avg_load. */
+		if (sgs->avg_load <= busiest->avg_load)
+			return false;
+		break;
+
+	case group_imbalanced:
+		/*
+		 * Select the 1st imbalanced group as we don't have any way to
+		 * choose one more than another.
+		 */
 		return false;
 
-asym_packing:
-	/* This is the busiest node in its class. */
-	if (!(env->sd->flags & SD_ASYM_PACKING))
-		return true;
+	case group_asym_packing:
+		/* Prefer to move from lowest priority CPU's work */
+		if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
+			return false;
+		break;
 
-	/* No ASYM_PACKING if target CPU is already busy */
-	if (env->idle == CPU_NOT_IDLE)
-		return true;
-	/*
-	 * ASYM_PACKING needs to move all the work to the highest
-	 * prority CPUs in the group, therefore mark all groups
-	 * of lower priority than ourself as busy.
-	 *
-	 * This is primarily intended to used at the sibling level.  Some
-	 * cores like POWER7 prefer to use lower numbered SMT threads.  In the
-	 * case of POWER7, it can move to lower SMT modes only when higher
-	 * threads are idle.  When in lower SMT modes, the threads will
-	 * perform better since they share less core resources.  Hence when we
-	 * have idle threads, we want them to be the higher ones.
-	 */
-	if (sgs->sum_h_nr_running &&
-	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
-		sgs->group_asym_packing = 1;
-		if (!sds->busiest)
-			return true;
+	case group_misfit_task:
+		/*
+		 * If we have more than one misfit sg go with the biggest
+		 * misfit.
+		 */
+		if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+			return false;
+		break;
 
-		/* Prefer to move from lowest priority CPU's work */
-		if (sched_asym_prefer(sds->busiest->asym_prefer_cpu,
-				      sg->asym_prefer_cpu))
-			return true;
+	case group_fully_busy:
+		/*
+		 * Select the fully busy group with highest avg_load. In
+		 * theory, there is no need to pull task from such kind of
+		 * group because tasks have all compute capacity that they need
+		 * but we can still improve the overall throughput by reducing
+		 * contention when accessing shared HW resources.
+		 *
+		 * XXX for now avg_load is not computed and always 0 so we
+		 * select the 1st one.
+		 */
+		if (sgs->avg_load <= busiest->avg_load)
+			return false;
+		break;
+
+	case group_has_spare:
+		/*
+		 * Select not overloaded group with lowest number of
+		 * idle cpus. We could also compare the spare capacity
+		 * which is more stable but it can end up that the
+		 * group has less spare capacity but finally more idle
+		 * cpus which means less opportunity to pull tasks.
+		 */
+		if (sgs->idle_cpus >= busiest->idle_cpus)
+			return false;
+		break;
 	}
 
-	return false;
+	/*
+	 * Candidate sg has no more than one task per CPU and has higher
+	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
+	 * throughput. Maximize throughput, power/energy consequences are not
+	 * considered.
+	 */
+	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+	    (sgs->group_type <= group_fully_busy) &&
+	    (group_smaller_min_cpu_capacity(sds->local, sg)))
+		return false;
+
+	return true;
 }
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -8182,13 +8276,13 @@  static inline enum fbq_type fbq_classify_rq(struct rq *rq)
  * @env: The load balancing environment.
  * @sds: variable to hold the statistics for this sched_domain.
  */
+
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
-	bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
 	int sg_status = 0;
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -8215,22 +8309,6 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (local_group)
 			goto next_group;
 
-		/*
-		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity so that we'll try
-		 * and move all the excess tasks away. We lower the capacity
-		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks. The extra check prevents the case where
-		 * you always pull from the heaviest group when it is already
-		 * under-utilized (possible with a large weight task outweighs
-		 * the tasks on the system).
-		 */
-		if (prefer_sibling && sds->local &&
-		    group_has_capacity(env, local) &&
-		    (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {
-			sgs->group_no_capacity = 1;
-			sgs->group_type = group_classify(sg, sgs);
-		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -8239,13 +8317,15 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 next_group:
 		/* Now, start updating sd_lb_stats */
-		sds->total_running += sgs->sum_h_nr_running;
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
+	/* Tag domain that child domain prefers tasks go to siblings first */
+	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+
 #ifdef CONFIG_NO_HZ_COMMON
 	if ((env->flags & LBF_NOHZ_AGAIN) &&
 	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
@@ -8283,69 +8363,133 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
  */
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	unsigned long max_pull, load_above_capacity = ~0UL;
 	struct sg_lb_stats *local, *busiest;
 
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
-	if (busiest->group_asym_packing) {
+	if (busiest->group_type == group_misfit_task) {
+		/* Set imbalance to allow misfit task to be balanced. */
+		env->balance_type = migrate_misfit;
+		env->imbalance = busiest->group_misfit_task_load;
+		return;
+	}
+
+	if (busiest->group_type == group_asym_packing) {
+		/*
+		 * In case of asym capacity, we will try to migrate all load to
+		 * the preferred CPU.
+		 */
+		env->balance_type = migrate_load;
 		env->imbalance = busiest->group_load;
 		return;
 	}
 
+	if (busiest->group_type == group_imbalanced) {
+		/*
+		 * In the group_imb case we cannot rely on group-wide averages
+		 * to ensure CPU-load equilibrium, try to move any task to fix
+		 * the imbalance. The next load balance will take care of
+		 * balancing back the system.
+		 */
+		env->balance_type = migrate_task;
+		env->imbalance = 1;
+		return;
+	}
+
 	/*
-	 * Avg load of busiest sg can be less and avg load of local sg can
-	 * be greater than avg load across all sgs of sd because avg load
-	 * factors in sg capacity and sgs with smaller group_type are
-	 * skipped when updating the busiest sg:
+	 * Try to use spare capacity of local group without overloading it or
+	 * emptying busiest
 	 */
-	if (busiest->group_type != group_misfit_task &&
-	    (busiest->avg_load <= sds->avg_load ||
-	     local->avg_load >= sds->avg_load)) {
-		env->imbalance = 0;
+	if (local->group_type == group_has_spare) {
+		if (busiest->group_type > group_fully_busy) {
+			/*
+			 * If busiest is overloaded, try to fill spare
+			 * capacity. This might end up creating spare capacity
+			 * in busiest or busiest still being overloaded but
+			 * there is no simple way to directly compute the
+			 * amount of load to migrate in order to balance the
+			 * system.
+			 */
+			env->balance_type = migrate_util;
+			env->imbalance = max(local->group_capacity, local->group_util) -
+				    local->group_util;
+			return;
+		}
+
+		if (busiest->group_weight == 1 || sds->prefer_sibling) {
+			/*
+			 * When prefer sibling, evenly spread running tasks on
+			 * groups.
+			 */
+			env->balance_type = migrate_task;
+			env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
+			return;
+		}
+
+		/*
+		 * If there is no overload, we just want to even the number of
+		 * idle cpus.
+		 */
+		env->balance_type = migrate_task;
+		env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
 		return;
 	}
 
 	/*
-	 * If there aren't any idle CPUs, avoid creating some.
+	 * Local is fully busy but have to take more load to relieve the
+	 * busiest group
 	 */
-	if (busiest->group_type == group_overloaded &&
-	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
-		if (load_above_capacity > busiest->group_capacity) {
-			load_above_capacity -= busiest->group_capacity;
-			load_above_capacity *= scale_load_down(NICE_0_LOAD);
-			load_above_capacity /= busiest->group_capacity;
-		} else
-			load_above_capacity = ~0UL;
+	if (local->group_type < group_overloaded) {
+		/*
+		 * Local will become overloaded so the avg_load metrics are
+		 * finally needed.
+		 */
+
+		local->avg_load = (local->group_load * SCHED_CAPACITY_SCALE) /
+				  local->group_capacity;
+
+		sds->avg_load = (sds->total_load * SCHED_CAPACITY_SCALE) /
+				sds->total_capacity;
 	}
 
 	/*
-	 * We're trying to get all the CPUs to the average_load, so we don't
-	 * want to push ourselves above the average load, nor do we wish to
-	 * reduce the max loaded CPU below the average load. At the same time,
-	 * we also don't want to reduce the group load below the group
-	 * capacity. Thus we look for the minimum possible imbalance.
+	 * Both group are or will become overloaded and we're trying to get all
+	 * the CPUs to the average_load, so we don't want to push ourselves
+	 * above the average load, nor do we wish to reduce the max loaded CPU
+	 * below the average load. At the same time, we also don't want to
+	 * reduce the group load below the group capacity. Thus we look for
+	 * the minimum possible imbalance.
 	 */
-	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
-
-	/* How much load to actually move to equalise the imbalance */
+	env->balance_type = migrate_load;
 	env->imbalance = min(
-		max_pull * busiest->group_capacity,
+		(busiest->avg_load - sds->avg_load) * busiest->group_capacity,
 		(sds->avg_load - local->avg_load) * local->group_capacity
 	) / SCHED_CAPACITY_SCALE;
-
-	/* Boost imbalance to allow misfit task to be balanced. */
-	if (busiest->group_type == group_misfit_task) {
-		env->imbalance = max_t(long, env->imbalance,
-				       busiest->group_misfit_task_load);
-	}
-
 }
 
 /******* find_busiest_group() helpers end here *********************/
 
+/*
+ * Decision matrix according to the local and busiest group state
+ *
+ * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
+ * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
+ * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
+ * misfit_task      force     N/A        N/A    N/A  force      force
+ * asym_capacity    force     force      N/A    N/A  force      force
+ * imbalanced       force     force      N/A    N/A  force      force
+ * overloaded       force     force      N/A    N/A  force      avg_load
+ *
+ * N/A :      Not Applicable because already filtered while updating
+ *            statistics.
+ * balanced : The system is balanced for these 2 groups.
+ * force :    Calculate the imbalance as load migration is probably needed.
+ * avg_load : Only if imbalance is significant enough.
+ * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite
+ *            different in groups.
+ */
+
 /**
  * find_busiest_group - Returns the busiest group within the sched_domain
  * if there is an imbalance.
@@ -8380,17 +8524,17 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
-	/* ASYM feature bypasses nice load balance check */
-	if (busiest->group_asym_packing)
-		goto force_balance;
-
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || busiest->sum_h_nr_running == 0)
+	if (!sds.busiest)
 		goto out_balanced;
 
-	/* XXX broken for overlapping NUMA groups */
-	sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
-						/ sds.total_capacity;
+	/* Misfit tasks should be dealt with regardless of the avg load */
+	if (busiest->group_type == group_misfit_task)
+		goto force_balance;
+
+	/* ASYM feature bypasses nice load balance check */
+	if (busiest->group_type == group_asym_packing)
+		goto force_balance;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
@@ -8401,55 +8545,64 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/*
-	 * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
-	 * capacities from resulting in underutilization due to avg_load.
-	 */
-	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
-	    busiest->group_no_capacity)
-		goto force_balance;
-
-	/* Misfit tasks should be dealt with regardless of the avg load */
-	if (busiest->group_type == group_misfit_task)
-		goto force_balance;
-
-	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.
 	 */
-	if (local->avg_load >= busiest->avg_load)
+	if (local->group_type > busiest->group_type)
 		goto out_balanced;
 
 	/*
-	 * Don't pull any tasks if this group is already above the domain
-	 * average load.
+	 * When groups are overloaded, use the avg_load to ensure fairness
+	 * between tasks.
 	 */
-	if (local->avg_load >= sds.avg_load)
-		goto out_balanced;
+	if (local->group_type == group_overloaded) {
+		/*
+		 * If the local group is more loaded than the selected
+		 * busiest group don't try and pull any tasks.
+		 */
+		if (local->avg_load >= busiest->avg_load)
+			goto out_balanced;
+
+		/* XXX broken for overlapping NUMA groups */
+		sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
+				sds.total_capacity;
 
-	if (env->idle == CPU_IDLE) {
 		/*
-		 * This CPU is idle. If the busiest group is not overloaded
-		 * and there is no imbalance between this and busiest group
-		 * wrt idle CPUs, it is balanced. The imbalance becomes
-		 * significant if the diff is greater than 1 otherwise we
-		 * might end up to just move the imbalance on another group
+		 * Don't pull any tasks if this group is already above the
+		 * domain average load.
 		 */
-		if ((busiest->group_type != group_overloaded) &&
-				(local->idle_cpus <= (busiest->idle_cpus + 1)))
+		if (local->avg_load >= sds.avg_load)
 			goto out_balanced;
-	} else {
+
 		/*
-		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
-		 * imbalance_pct to be conservative.
+		 * If the busiest group is more loaded, use imbalance_pct to be
+		 * conservative.
 		 */
 		if (100 * busiest->avg_load <=
 				env->sd->imbalance_pct * local->avg_load)
 			goto out_balanced;
 	}
 
+	/* Try to move all excess tasks to child's sibling domain */
+	if (sds.prefer_sibling && local->group_type == group_has_spare &&
+	    busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
+		goto force_balance;
+
+	if (busiest->group_type != group_overloaded &&
+	     (env->idle == CPU_NOT_IDLE ||
+	      local->idle_cpus <= (busiest->idle_cpus + 1)))
+		/*
+		 * If the busiest group is not overloaded
+		 * and there is no imbalance between this and busiest group
+		 * wrt idle CPUs, it is balanced. The imbalance
+		 * becomes significant if the diff is greater than 1 otherwise
+		 * we might end up to just move the imbalance on another
+		 * group.
+		 */
+		goto out_balanced;
+
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
-	env->src_grp_type = busiest->group_type;
 	calculate_imbalance(env, &sds);
 	return env->imbalance ? sds.busiest : NULL;
 
@@ -8465,11 +8618,13 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 				     struct sched_group *group)
 {
 	struct rq *busiest = NULL, *rq;
-	unsigned long busiest_load = 0, busiest_capacity = 1;
+	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
+	unsigned int busiest_nr = 0;
 	int i;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
-		unsigned long capacity, load;
+		unsigned long capacity, load, util;
+		unsigned int nr_running;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -8497,20 +8652,8 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 		if (rt > env->fbq_type)
 			continue;
 
-		/*
-		 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
-		 * seek the "biggest" misfit task.
-		 */
-		if (env->src_grp_type == group_misfit_task) {
-			if (rq->misfit_task_load > busiest_load) {
-				busiest_load = rq->misfit_task_load;
-				busiest = rq;
-			}
-
-			continue;
-		}
-
 		capacity = capacity_of(i);
+		nr_running = rq->cfs.h_nr_running;
 
 		/*
 		 * For ASYM_CPUCAPACITY domains, don't pick a CPU that could
@@ -8520,35 +8663,67 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 		 */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
 		    capacity_of(env->dst_cpu) < capacity &&
-		    rq->nr_running == 1)
+		    nr_running == 1)
 			continue;
 
-		load = cpu_runnable_load(rq);
+		switch (env->balance_type) {
+		case migrate_load:
+			/*
+			 * When comparing with load imbalance, use cpu_runnable_load()
+			 * which is not scaled with the CPU capacity.
+			 */
+			load = cpu_runnable_load(rq);
 
-		/*
-		 * When comparing with imbalance, use cpu_runnable_load()
-		 * which is not scaled with the CPU capacity.
-		 */
+			if (nr_running == 1 && load > env->imbalance &&
+			    !check_cpu_capacity(rq, env->sd))
+				break;
 
-		if (rq->nr_running == 1 && load > env->imbalance &&
-		    !check_cpu_capacity(rq, env->sd))
-			continue;
+			/*
+			 * For the load comparisons with the other CPU's, consider
+			 * the cpu_runnable_load() scaled with the CPU capacity, so
+			 * that the load can be moved away from the CPU that is
+			 * potentially running at a lower capacity.
+			 *
+			 * Thus we're looking for max(load_i / capacity_i), crosswise
+			 * multiplication to rid ourselves of the division works out
+			 * to: load_i * capacity_j > load_j * capacity_i;  where j is
+			 * our previous maximum.
+			 */
+			if (load * busiest_capacity > busiest_load * capacity) {
+				busiest_load = load;
+				busiest_capacity = capacity;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_util:
+			util = cpu_util(cpu_of(rq));
+
+			if (busiest_util < util) {
+				busiest_util = util;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_task:
+			if (busiest_nr < nr_running) {
+				busiest_nr = nr_running;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_misfit:
+			/*
+			 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
+			 * seek the "biggest" misfit task.
+			 */
+			if (rq->misfit_task_load > busiest_load) {
+				busiest_load = rq->misfit_task_load;
+				busiest = rq;
+			}
+
+			break;
 
-		/*
-		 * For the load comparisons with the other CPU's, consider
-		 * the cpu_runnable_load() scaled with the CPU capacity, so
-		 * that the load can be moved away from the CPU that is
-		 * potentially running at a lower capacity.
-		 *
-		 * Thus we're looking for max(load_i / capacity_i), crosswise
-		 * multiplication to rid ourselves of the division works out
-		 * to: load_i * capacity_j > load_j * capacity_i;  where j is
-		 * our previous maximum.
-		 */
-		if (load * busiest_capacity > busiest_load * capacity) {
-			busiest_load = load;
-			busiest_capacity = capacity;
-			busiest = rq;
 		}
 	}
 
@@ -8594,7 +8769,7 @@  voluntary_active_balance(struct lb_env *env)
 			return 1;
 	}
 
-	if (env->src_grp_type == group_misfit_task)
+	if (env->balance_type == migrate_misfit)
 		return 1;
 
 	return 0;