diff mbox series

[v4,04/11] sched/fair: rework load_balance

Message ID 1571405198-27570-5-git-send-email-vincent.guittot@linaro.org
State Accepted
Commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912
Headers show
Series sched/fair: rework the CFS load balance | expand

Commit Message

Vincent Guittot Oct. 18, 2019, 1:26 p.m. UTC
The load_balance algorithm contains some heuristics which have become
meaningless since the rework of the scheduler's metrics like the
introduction of PELT.

Furthermore, load is an ill-suited metric for solving certain task
placement imbalance scenarios. For instance, in the presence of idle CPUs,
we should simply try to get at least one task per CPU, whereas the current
load-based algorithm can actually leave idle CPUs alone simply because the
load is somewhat balanced. The current algorithm ends up creating virtual
and meaningless value like the avg_load_per_task or tweaks the state of a
group to make it overloaded whereas it's not, in order to try to migrate
tasks.

load_balance should better qualify the imbalance of the group and clearly
define what has to be moved to fix this imbalance.

The type of sched_group has been extended to better reflect the type of
imbalance. We now have :
	group_has_spare
	group_fully_busy
	group_misfit_task
	group_asym_packing
	group_imbalanced
	group_overloaded

Based on the type of sched_group, load_balance now sets what it wants to
move in order to fix the imbalance. It can be some load as before but also
some utilization, a number of task or a type of task:
	migrate_task
	migrate_util
	migrate_load
	migrate_misfit

This new load_balance algorithm fixes several pending wrong tasks
placement:
- the 1 task per CPU case with asymmetric system
- the case of cfs task preempted by other class
- the case of tasks not evenly spread on groups with spare capacity

Also the load balance decisions have been consolidated in the 3 functions
below after removing the few bypasses and hacks of the current code:
- update_sd_pick_busiest() select the busiest sched_group.
- find_busiest_group() checks if there is an imbalance between local and
  busiest group.
- calculate_imbalance() decides what have to be moved.

Finally, the now unused field total_running of struct sd_lb_stats has been
removed.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c | 611 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 402 insertions(+), 209 deletions(-)

-- 
2.7.4

Comments

Mel Gorman Oct. 30, 2019, 3:45 p.m. UTC | #1
On Fri, Oct 18, 2019 at 03:26:31PM +0200, Vincent Guittot wrote:
> The load_balance algorithm contains some heuristics which have become

> meaningless since the rework of the scheduler's metrics like the

> introduction of PELT.

> 

> Furthermore, load is an ill-suited metric for solving certain task

> placement imbalance scenarios. For instance, in the presence of idle CPUs,

> we should simply try to get at least one task per CPU, whereas the current

> load-based algorithm can actually leave idle CPUs alone simply because the

> load is somewhat balanced. The current algorithm ends up creating virtual

> and meaningless value like the avg_load_per_task or tweaks the state of a

> group to make it overloaded whereas it's not, in order to try to migrate

> tasks.

> 


I do not think this is necessarily 100% true. With both the previous
load-balancing behaviour and the apparent behaviour of this patch, it's
still possible to pull two communicating tasks apart and across NUMA
domains when utilisation is low. Specifically, a load difference of less
than SCHED_CAPACITY_SCALE between NUMA codes can be enough too migrate
task to level out load.

So, load might be ill-suited for some cases but that does not make it
completely useless either.

The type of behaviour can be seen by running netperf via mmtests
(configuration file configs/config-network-netperf-unbound) on a NUMA
machine and noting that the local vs remote NUMA hinting faults are roughly
50%. I had prototyped some fixes around this that took imbalance_pct into
account but it was too special-cased and was not a universal win. If
I was reviewing my own patch I would have naked it on the "you added a
special-case hack into the load balancer for one load". I didn't get back
to it before getting cc'd on this series.

> load_balance should better qualify the imbalance of the group and clearly

> define what has to be moved to fix this imbalance.

> 

> The type of sched_group has been extended to better reflect the type of

> imbalance. We now have :

> 	group_has_spare

> 	group_fully_busy

> 	group_misfit_task

> 	group_asym_packing

> 	group_imbalanced

> 	group_overloaded

> 

> Based on the type of sched_group, load_balance now sets what it wants to

> move in order to fix the imbalance. It can be some load as before but also

> some utilization, a number of task or a type of task:

> 	migrate_task

> 	migrate_util

> 	migrate_load

> 	migrate_misfit

> 

> This new load_balance algorithm fixes several pending wrong tasks

> placement:

> - the 1 task per CPU case with asymmetric system

> - the case of cfs task preempted by other class

> - the case of tasks not evenly spread on groups with spare capacity

> 


On the last one, spreading tasks evenly across NUMA domains is not
necessarily a good idea. If I have 2 tasks running on a 2-socket machine
with 24 logical CPUs per socket, it should not automatically mean that
one task should move cross-node and I have definitely observed this
happening. It's probably bad in terms of locality no matter what but it's
especially bad if the 2 tasks happened to be communicating because then
load balancing will pull apart the tasks while wake_affine will push
them together (and potentially NUMA balancing as well). Note that this
also applies for some IO workloads because, depending on the filesystem,
the task may be communicating with workqueues (XFS) or a kernel thread
(ext4 with jbd2).

> Also the load balance decisions have been consolidated in the 3 functions

> below after removing the few bypasses and hacks of the current code:

> - update_sd_pick_busiest() select the busiest sched_group.

> - find_busiest_group() checks if there is an imbalance between local and

>   busiest group.

> - calculate_imbalance() decides what have to be moved.

> 

> Finally, the now unused field total_running of struct sd_lb_stats has been

> removed.

> 

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---

>  kernel/sched/fair.c | 611 ++++++++++++++++++++++++++++++++++------------------

>  1 file changed, 402 insertions(+), 209 deletions(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index e004841..5ae5281 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7068,11 +7068,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;

>  

>  enum fbq_type { regular, remote, all };

>  

> +/*

> + * group_type describes the group of CPUs at the moment of the load balance.

> + * The enum is ordered by pulling priority, with the group with lowest priority

> + * first so the groupe_type can be simply compared when selecting the busiest

> + * group. see update_sd_pick_busiest().

> + */


s/groupe_type/group_type/

>  enum group_type {

> -	group_other = 0,

> +	group_has_spare = 0,

> +	group_fully_busy,

>  	group_misfit_task,

> +	group_asym_packing,

>  	group_imbalanced,

> -	group_overloaded,

> +	group_overloaded

> +};

> +


While not your fault, it would be nice to comment on the meaning of each
group type. From a glance, it's not obvious to me why a misfit task should
be a high priority to move a task than a fully_busy (but not overloaded)
group given that moving the misfit task might make a group overloaded.

> +enum migration_type {

> +	migrate_load = 0,

> +	migrate_util,

> +	migrate_task,

> +	migrate_misfit

>  };

>  


Could do with a comment explaining what migration_type is for because
the name is unhelpful. I *think* at a glance it's related to what sort
of imbalance is being addressed which is partially addressed by the
group_type. That understanding may change as I continue reading the series
but now I have to figure it out which means it'll be forgotten again in
6 months.

>  #define LBF_ALL_PINNED	0x01

> @@ -7105,7 +7120,7 @@ struct lb_env {

>  	unsigned int		loop_max;

>  

>  	enum fbq_type		fbq_type;

> -	enum group_type		src_grp_type;

> +	enum migration_type	migration_type;

>  	struct list_head	tasks;

>  };

>  

> @@ -7328,7 +7343,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)

>  static const unsigned int sched_nr_migrate_break = 32;

>  

>  /*

> - * detach_tasks() -- tries to detach up to imbalance runnable load from

> + * detach_tasks() -- tries to detach up to imbalance load/util/tasks from

>   * busiest_rq, as part of a balancing operation within domain "sd".

>   *

>   * Returns number of detached tasks if successful and 0 otherwise.

> @@ -7336,8 +7351,8 @@ static const unsigned int sched_nr_migrate_break = 32;

>  static int detach_tasks(struct lb_env *env)

>  {

>  	struct list_head *tasks = &env->src_rq->cfs_tasks;

> +	unsigned long util, load;

>  	struct task_struct *p;

> -	unsigned long load;

>  	int detached = 0;

>  

>  	lockdep_assert_held(&env->src_rq->lock);

> @@ -7370,19 +7385,51 @@ static int detach_tasks(struct lb_env *env)

>  		if (!can_migrate_task(p, env))

>  			goto next;

>  

> -		load = task_h_load(p);

> +		switch (env->migration_type) {

> +		case migrate_load:

> +			load = task_h_load(p);

>  

> -		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)

> -			goto next;

> +			if (sched_feat(LB_MIN) &&

> +			    load < 16 && !env->sd->nr_balance_failed)

> +				goto next;

>  

> -		if ((load / 2) > env->imbalance)

> -			goto next;

> +			if ((load / 2) > env->imbalance)

> +				goto next;

> +

> +			env->imbalance -= load;

> +			break;

> +

> +		case migrate_util:

> +			util = task_util_est(p);

> +

> +			if (util > env->imbalance)

> +				goto next;

> +

> +			env->imbalance -= util;

> +			break;

> +

> +		case migrate_task:

> +			env->imbalance--;

> +			break;

> +

> +		case migrate_misfit:

> +			load = task_h_load(p);

> +

> +			/*

> +			 * load of misfit task might decrease a bit since it has

> +			 * been recorded. Be conservative in the condition.

> +			 */

> +			if (load / 2 < env->imbalance)

> +				goto next;

> +

> +			env->imbalance = 0;

> +			break;

> +		}

>  


So, no problem with this but it brings up another point. migration_type
also determines what env->imbalance means (e.g. load, utilisation,
nr_running etc). That was true before your patch too but now it's much
more explicit, which is nice, but could do with a comment.

>  		detach_task(p, env);

>  		list_add(&p->se.group_node, &env->tasks);

>  

>  		detached++;

> -		env->imbalance -= load;

>  

>  #ifdef CONFIG_PREEMPTION

>  		/*

> @@ -7396,7 +7443,7 @@ static int detach_tasks(struct lb_env *env)

>  

>  		/*

>  		 * We only want to steal up to the prescribed amount of

> -		 * runnable load.

> +		 * load/util/tasks.

>  		 */

>  		if (env->imbalance <= 0)

>  			break;

> @@ -7661,7 +7708,6 @@ struct sg_lb_stats {

>  	unsigned int idle_cpus;

>  	unsigned int group_weight;

>  	enum group_type group_type;

> -	int group_no_capacity;

>  	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */

>  	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */

>  #ifdef CONFIG_NUMA_BALANCING


Glad to see group_no_capacity go away, that had some "interesting"
treatment in update_sd_lb_stats.

> @@ -7677,10 +7723,10 @@ struct sg_lb_stats {

>  struct sd_lb_stats {

>  	struct sched_group *busiest;	/* Busiest group in this sd */

>  	struct sched_group *local;	/* Local group in this sd */

> -	unsigned long total_running;

>  	unsigned long total_load;	/* Total load of all groups in sd */

>  	unsigned long total_capacity;	/* Total capacity of all groups in sd */

>  	unsigned long avg_load;	/* Average load across all groups in sd */

> +	unsigned int prefer_sibling; /* tasks should go to sibling first */

>  

>  	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */

>  	struct sg_lb_stats local_stat;	/* Statistics of the local group */

> @@ -7691,19 +7737,18 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)

>  	/*

>  	 * Skimp on the clearing to avoid duplicate work. We can avoid clearing

>  	 * local_stat because update_sg_lb_stats() does a full clear/assignment.

> -	 * We must however clear busiest_stat::avg_load because

> -	 * update_sd_pick_busiest() reads this before assignment.

> +	 * We must however set busiest_stat::group_type and

> +	 * busiest_stat::idle_cpus to the worst busiest group because

> +	 * update_sd_pick_busiest() reads these before assignment.

>  	 */

>  	*sds = (struct sd_lb_stats){

>  		.busiest = NULL,

>  		.local = NULL,

> -		.total_running = 0UL,

>  		.total_load = 0UL,

>  		.total_capacity = 0UL,

>  		.busiest_stat = {

> -			.avg_load = 0UL,

> -			.sum_h_nr_running = 0,

> -			.group_type = group_other,

> +			.idle_cpus = UINT_MAX,

> +			.group_type = group_has_spare,

>  		},

>  	};

>  }

> @@ -7945,19 +7990,26 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)

>  }

>  

>  static inline enum

> -group_type group_classify(struct sched_group *group,

> +group_type group_classify(struct lb_env *env,

> +			  struct sched_group *group,

>  			  struct sg_lb_stats *sgs)

>  {

> -	if (sgs->group_no_capacity)

> +	if (group_is_overloaded(env, sgs))

>  		return group_overloaded;

>  

>  	if (sg_imbalanced(group))

>  		return group_imbalanced;

>  

> +	if (sgs->group_asym_packing)

> +		return group_asym_packing;

> +

>  	if (sgs->group_misfit_task_load)

>  		return group_misfit_task;

>  

> -	return group_other;

> +	if (!group_has_capacity(env, sgs))

> +		return group_fully_busy;

> +

> +	return group_has_spare;

>  }

>  

>  static bool update_nohz_stats(struct rq *rq, bool force)

> @@ -7994,10 +8046,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,

>  				      struct sg_lb_stats *sgs,

>  				      int *sg_status)

>  {

> -	int i, nr_running;

> +	int i, nr_running, local_group;

>  

>  	memset(sgs, 0, sizeof(*sgs));

>  

> +	local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));

> +

>  	for_each_cpu_and(i, sched_group_span(group), env->cpus) {

>  		struct rq *rq = cpu_rq(i);

>  

> @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,

>  		/*

>  		 * No need to call idle_cpu() if nr_running is not 0

>  		 */

> -		if (!nr_running && idle_cpu(i))

> +		if (!nr_running && idle_cpu(i)) {

>  			sgs->idle_cpus++;

> +			/* Idle cpu can't have misfit task */

> +			continue;

> +		}

> +

> +		if (local_group)

> +			continue;

>  

> +		/* Check for a misfit task on the cpu */

>  		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&

>  		    sgs->group_misfit_task_load < rq->misfit_task_load) {

>  			sgs->group_misfit_task_load = rq->misfit_task_load;


So.... why exactly do we not care about misfit tasks on CPUs in the
local group? I'm not saying you're wrong because you have a clear idea
on how misfit tasks should be treated but it's very non-obvious just
from the code.

> <SNIP>

>

> @@ -8079,62 +8154,80 @@ static bool update_sd_pick_busiest(struct lb_env *env,

>  	if (sgs->group_type < busiest->group_type)

>  		return false;

>  

> -	if (sgs->avg_load <= busiest->avg_load)

> -		return false;

> -

> -	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))

> -		goto asym_packing;

> -

>  	/*

> -	 * Candidate sg has no more than one task per CPU and

> -	 * has higher per-CPU capacity. Migrating tasks to less

> -	 * capable CPUs may harm throughput. Maximize throughput,

> -	 * power/energy consequences are not considered.

> +	 * The candidate and the current busiest group are the same type of

> +	 * group. Let check which one is the busiest according to the type.

>  	 */

> -	if (sgs->sum_h_nr_running <= sgs->group_weight &&

> -	    group_smaller_min_cpu_capacity(sds->local, sg))

> -		return false;

>  

> -	/*

> -	 * If we have more than one misfit sg go with the biggest misfit.

> -	 */

> -	if (sgs->group_type == group_misfit_task &&

> -	    sgs->group_misfit_task_load < busiest->group_misfit_task_load)

> +	switch (sgs->group_type) {

> +	case group_overloaded:

> +		/* Select the overloaded group with highest avg_load. */

> +		if (sgs->avg_load <= busiest->avg_load)

> +			return false;

> +		break;

> +

> +	case group_imbalanced:

> +		/*

> +		 * Select the 1st imbalanced group as we don't have any way to

> +		 * choose one more than another.

> +		 */

>  		return false;

>  

> -asym_packing:

> -	/* This is the busiest node in its class. */

> -	if (!(env->sd->flags & SD_ASYM_PACKING))

> -		return true;

> +	case group_asym_packing:

> +		/* Prefer to move from lowest priority CPU's work */

> +		if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))

> +			return false;

> +		break;

>  


Again, I'm not seeing what prevents a !SD_ASYM_PACKING domain checking
sched_asym_prefer.

> <SNIP>

> +	case group_fully_busy:

> +		/*

> +		 * Select the fully busy group with highest avg_load. In

> +		 * theory, there is no need to pull task from such kind of

> +		 * group because tasks have all compute capacity that they need

> +		 * but we can still improve the overall throughput by reducing

> +		 * contention when accessing shared HW resources.

> +		 *

> +		 * XXX for now avg_load is not computed and always 0 so we

> +		 * select the 1st one.

> +		 */

> +		if (sgs->avg_load <= busiest->avg_load)

> +			return false;

> +		break;

> +


With the exception that if we are balancing between NUMA domains and they
were communicating tasks that we've now pulled them apart. That might
increase the CPU resources available at the cost of increased remote
memory access cost.

> +	case group_has_spare:

> +		/*

> +		 * Select not overloaded group with lowest number of

> +		 * idle cpus. We could also compare the spare capacity

> +		 * which is more stable but it can end up that the

> +		 * group has less spare capacity but finally more idle

> +		 * cpus which means less opportunity to pull tasks.

> +		 */

> +		if (sgs->idle_cpus >= busiest->idle_cpus)

> +			return false;

> +		break;

>  	}

>  

> -	return false;

> +	/*

> +	 * Candidate sg has no more than one task per CPU and has higher

> +	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm

> +	 * throughput. Maximize throughput, power/energy consequences are not

> +	 * considered.

> +	 */

> +	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&

> +	    (sgs->group_type <= group_fully_busy) &&

> +	    (group_smaller_min_cpu_capacity(sds->local, sg)))

> +		return false;

> +

> +	return true;

>  }

>  

>  #ifdef CONFIG_NUMA_BALANCING

> @@ -8172,13 +8265,13 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)

>   * @env: The load balancing environment.

>   * @sds: variable to hold the statistics for this sched_domain.

>   */

> +


Spurious whitespace change.

>  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)

>  {

>  	struct sched_domain *child = env->sd->child;

>  	struct sched_group *sg = env->sd->groups;

>  	struct sg_lb_stats *local = &sds->local_stat;

>  	struct sg_lb_stats tmp_sgs;

> -	bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;

>  	int sg_status = 0;

>  

>  #ifdef CONFIG_NO_HZ_COMMON

> @@ -8205,22 +8298,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

>  		if (local_group)

>  			goto next_group;

>  

> -		/*

> -		 * In case the child domain prefers tasks go to siblings

> -		 * first, lower the sg capacity so that we'll try

> -		 * and move all the excess tasks away. We lower the capacity

> -		 * of a group only if the local group has the capacity to fit

> -		 * these excess tasks. The extra check prevents the case where

> -		 * you always pull from the heaviest group when it is already

> -		 * under-utilized (possible with a large weight task outweighs

> -		 * the tasks on the system).

> -		 */

> -		if (prefer_sibling && sds->local &&

> -		    group_has_capacity(env, local) &&

> -		    (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {

> -			sgs->group_no_capacity = 1;

> -			sgs->group_type = group_classify(sg, sgs);

> -		}

>  

>  		if (update_sd_pick_busiest(env, sds, sg, sgs)) {

>  			sds->busiest = sg;

> @@ -8229,13 +8306,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

>  

>  next_group:

>  		/* Now, start updating sd_lb_stats */

> -		sds->total_running += sgs->sum_h_nr_running;

>  		sds->total_load += sgs->group_load;

>  		sds->total_capacity += sgs->group_capacity;

>  

>  		sg = sg->next;

>  	} while (sg != env->sd->groups);

>  

> +	/* Tag domain that child domain prefers tasks go to siblings first */

> +	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;

> +

>  #ifdef CONFIG_NO_HZ_COMMON

>  	if ((env->flags & LBF_NOHZ_AGAIN) &&

>  	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {

> @@ -8273,69 +8352,149 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

>   */

>  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

>  {

> -	unsigned long max_pull, load_above_capacity = ~0UL;

>  	struct sg_lb_stats *local, *busiest;

>  

>  	local = &sds->local_stat;

>  	busiest = &sds->busiest_stat;

>  

> -	if (busiest->group_asym_packing) {

> -		env->imbalance = busiest->group_load;

> +	if (busiest->group_type == group_misfit_task) {

> +		/* Set imbalance to allow misfit task to be balanced. */

> +		env->migration_type = migrate_misfit;

> +		env->imbalance = busiest->group_misfit_task_load;

> +		return;

> +	}

> +

> +	if (busiest->group_type == group_asym_packing) {

> +		/*

> +		 * In case of asym capacity, we will try to migrate all load to

> +		 * the preferred CPU.

> +		 */

> +		env->migration_type = migrate_task;

> +		env->imbalance = busiest->sum_h_nr_running;

> +		return;

> +	}

> +

> +	if (busiest->group_type == group_imbalanced) {

> +		/*

> +		 * In the group_imb case we cannot rely on group-wide averages

> +		 * to ensure CPU-load equilibrium, try to move any task to fix

> +		 * the imbalance. The next load balance will take care of

> +		 * balancing back the system.

> +		 */

> +		env->migration_type = migrate_task;

> +		env->imbalance = 1;

>  		return;

>  	}

>  

>  	/*

> -	 * Avg load of busiest sg can be less and avg load of local sg can

> -	 * be greater than avg load across all sgs of sd because avg load

> -	 * factors in sg capacity and sgs with smaller group_type are

> -	 * skipped when updating the busiest sg:

> +	 * Try to use spare capacity of local group without overloading it or

> +	 * emptying busiest

>  	 */

> -	if (busiest->group_type != group_misfit_task &&

> -	    (busiest->avg_load <= sds->avg_load ||

> -	     local->avg_load >= sds->avg_load)) {

> -		env->imbalance = 0;

> +	if (local->group_type == group_has_spare) {

> +		if (busiest->group_type > group_fully_busy) {

> +			/*

> +			 * If busiest is overloaded, try to fill spare

> +			 * capacity. This might end up creating spare capacity

> +			 * in busiest or busiest still being overloaded but

> +			 * there is no simple way to directly compute the

> +			 * amount of load to migrate in order to balance the

> +			 * system.

> +			 */


busiest may not be overloaded, it may be imbalanced. Maybe the
distinction is irrelevant though.

> +			env->migration_type = migrate_util;

> +			env->imbalance = max(local->group_capacity, local->group_util) -

> +					 local->group_util;

> +

> +			/*

> +			 * In some case, the group's utilization is max or even

> +			 * higher than capacity because of migrations but the

> +			 * local CPU is (newly) idle. There is at least one

> +			 * waiting task in this overloaded busiest group. Let

> +			 * try to pull it.

> +			 */

> +			if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {

> +				env->migration_type = migrate_task;

> +				env->imbalance = 1;

> +			}

> +


Not convinced this is a good thing to do across NUMA domains. If it was
tied to the group being definitely overloaded then I could see the logic.

> +			return;

> +		}

> +

> +		if (busiest->group_weight == 1 || sds->prefer_sibling) {

> +			unsigned int nr_diff = busiest->sum_h_nr_running;

> +			/*

> +			 * When prefer sibling, evenly spread running tasks on

> +			 * groups.

> +			 */

> +			env->migration_type = migrate_task;

> +			lsub_positive(&nr_diff, local->sum_h_nr_running);

> +			env->imbalance = nr_diff >> 1;

> +			return;

> +		}


Comment is slightly misleading given that it's not just preferring
siblings but for when balancing across single CPU domains.

> +

> +		/*

> +		 * If there is no overload, we just want to even the number of

> +		 * idle cpus.

> +		 */

> +		env->migration_type = migrate_task;

> +		env->imbalance = max_t(long, 0, (local->idle_cpus -

> +						 busiest->idle_cpus) >> 1);

>  		return;

>  	}


Why do we want an even number of idle CPUs unconditionally? This goes back
to the NUMA domain case. 2 communicating tasks running on a 2-socket system
should not be pulled apart just to have 1 task running on each socket.

I didn't see anything obvious after this point but I also am getting a
bit on the fried side trying to hold this entire patch in my head and
got hung up on the NUMA domain balancing in particular.

-- 
Mel Gorman
SUSE Labs
Valentin Schneider Oct. 30, 2019, 4:16 p.m. UTC | #2
On 30/10/2019 16:45, Mel Gorman wrote:
> On Fri, Oct 18, 2019 at 03:26:31PM +0200, Vincent Guittot wrote:

>> The load_balance algorithm contains some heuristics which have become

>> meaningless since the rework of the scheduler's metrics like the

>> introduction of PELT.

>>

>> Furthermore, load is an ill-suited metric for solving certain task

>> placement imbalance scenarios. For instance, in the presence of idle CPUs,

>> we should simply try to get at least one task per CPU, whereas the current

>> load-based algorithm can actually leave idle CPUs alone simply because the

>> load is somewhat balanced. The current algorithm ends up creating virtual

>> and meaningless value like the avg_load_per_task or tweaks the state of a

>> group to make it overloaded whereas it's not, in order to try to migrate

>> tasks.

>>

> 

> I do not think this is necessarily 100% true. With both the previous

> load-balancing behaviour and the apparent behaviour of this patch, it's

> still possible to pull two communicating tasks apart and across NUMA

> domains when utilisation is low. Specifically, a load difference of less

> than SCHED_CAPACITY_SCALE between NUMA codes can be enough too migrate

> task to level out load.

> 

> So, load might be ill-suited for some cases but that does not make it

> completely useless either.

> 


For sure! What we're trying to get to is that load is much less relevant
at low task count than at high task count.

A pathological case for asymmetric systems (big.LITTLE et al) is if you spawn
as many tasks as you have CPUs and they're not all spread out straight from
wakeup (so you have at least one rq coscheduling 2 tasks), the load balancer
might never solve that imbalance and leave a CPU idle for the duration of the
workload.

The problem there is that the current LB tries to balance avg_load, which is
screwy when asymmetric capacities are involved.

Take the h960 for instance, which is 4+4 big.LITTLE (LITTLEs are 462 capacity,
bigs 1024). Say you end up with 3 tasks on the LITTLEs (one is idle), and 5
tasks on the bigs (none idle). The tasks are CPU hogs, so they won't go
through new wakeups.                                                                                                                                                                   

You have something like 3072 load on the LITTLEs vs 5120 on the bigs. The
thing is, a group's avg_load scales inversely with capacity:
                                                                            
sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
				sgs->group_capacity;                                          

That means the LITTLEs group ends up with 1702 avg_load, whereas the bigs
group gets 1280 - the LITTLEs group is seen as more "busy" despite having an
idle CPU. This kind of imbalance usually ends up in fix_small_imbalance()
territory, which is random at best.

>> @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,

>>  		/*

>>  		 * No need to call idle_cpu() if nr_running is not 0

>>  		 */

>> -		if (!nr_running && idle_cpu(i))

>> +		if (!nr_running && idle_cpu(i)) {

>>  			sgs->idle_cpus++;

>> +			/* Idle cpu can't have misfit task */

>> +			continue;

>> +		}

>> +

>> +		if (local_group)

>> +			continue;

>>  

>> +		/* Check for a misfit task on the cpu */

>>  		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&

>>  		    sgs->group_misfit_task_load < rq->misfit_task_load) {

>>  			sgs->group_misfit_task_load = rq->misfit_task_load;

> 

> So.... why exactly do we not care about misfit tasks on CPUs in the

> local group? I'm not saying you're wrong because you have a clear idea

> on how misfit tasks should be treated but it's very non-obvious just

> from the code.

> 


Misfit tasks need to be migrated away to CPUs of higher capacity. We let that
happen through the usual load-balance pull - there's little point in pulling
from ourselves.

> I didn't see anything obvious after this point but I also am getting a

> bit on the fried side trying to hold this entire patch in my head and

> got hung up on the NUMA domain balancing in particular.

> 


I share the feeling, I've done it by chunks but need to have another go at it.
Vincent Guittot Oct. 31, 2019, 9:09 a.m. UTC | #3
On Wed, 30 Oct 2019 at 16:45, Mel Gorman <mgorman@techsingularity.net> wrote:
>

> On Fri, Oct 18, 2019 at 03:26:31PM +0200, Vincent Guittot wrote:

> > The load_balance algorithm contains some heuristics which have become

> > meaningless since the rework of the scheduler's metrics like the

> > introduction of PELT.

> >

> > Furthermore, load is an ill-suited metric for solving certain task

> > placement imbalance scenarios. For instance, in the presence of idle CPUs,

> > we should simply try to get at least one task per CPU, whereas the current

> > load-based algorithm can actually leave idle CPUs alone simply because the

> > load is somewhat balanced. The current algorithm ends up creating virtual

> > and meaningless value like the avg_load_per_task or tweaks the state of a

> > group to make it overloaded whereas it's not, in order to try to migrate

> > tasks.

> >

>

> I do not think this is necessarily 100% true. With both the previous

> load-balancing behaviour and the apparent behaviour of this patch, it's

> still possible to pull two communicating tasks apart and across NUMA

> domains when utilisation is low. Specifically, a load difference of less

> than SCHED_CAPACITY_SCALE between NUMA codes can be enough too migrate

> task to level out load.

>

> So, load might be ill-suited for some cases but that does not make it

> completely useless either.


I fully agree and we keep using it in some cases.
The goal is only to not use it when it is obviously the wrong metric to be used

>

> The type of behaviour can be seen by running netperf via mmtests

> (configuration file configs/config-network-netperf-unbound) on a NUMA

> machine and noting that the local vs remote NUMA hinting faults are roughly

> 50%. I had prototyped some fixes around this that took imbalance_pct into

> account but it was too special-cased and was not a universal win. If

> I was reviewing my own patch I would have naked it on the "you added a

> special-case hack into the load balancer for one load". I didn't get back

> to it before getting cc'd on this series.

>

> > load_balance should better qualify the imbalance of the group and clearly

> > define what has to be moved to fix this imbalance.

> >

> > The type of sched_group has been extended to better reflect the type of

> > imbalance. We now have :

> >       group_has_spare

> >       group_fully_busy

> >       group_misfit_task

> >       group_asym_packing

> >       group_imbalanced

> >       group_overloaded

> >

> > Based on the type of sched_group, load_balance now sets what it wants to

> > move in order to fix the imbalance. It can be some load as before but also

> > some utilization, a number of task or a type of task:

> >       migrate_task

> >       migrate_util

> >       migrate_load

> >       migrate_misfit

> >

> > This new load_balance algorithm fixes several pending wrong tasks

> > placement:

> > - the 1 task per CPU case with asymmetric system

> > - the case of cfs task preempted by other class

> > - the case of tasks not evenly spread on groups with spare capacity

> >

>

> On the last one, spreading tasks evenly across NUMA domains is not

> necessarily a good idea. If I have 2 tasks running on a 2-socket machine

> with 24 logical CPUs per socket, it should not automatically mean that

> one task should move cross-node and I have definitely observed this

> happening. It's probably bad in terms of locality no matter what but it's

> especially bad if the 2 tasks happened to be communicating because then

> load balancing will pull apart the tasks while wake_affine will push

> them together (and potentially NUMA balancing as well). Note that this

> also applies for some IO workloads because, depending on the filesystem,

> the task may be communicating with workqueues (XFS) or a kernel thread

> (ext4 with jbd2).


This rework doesn't touch the NUMA_BALANCING part and NUMA balancing
still gives guidances with fbq_classify_group/queue.
But the latter could also take advantage of the new type of group. For
example, what I did in the fix for find_idlest_group : checking
numa_preferred_nid when the group has capacity and keep the task on
preferred node if possible. Similar behavior could also be beneficial
in periodic load_balance case.

>

> > Also the load balance decisions have been consolidated in the 3 functions

> > below after removing the few bypasses and hacks of the current code:

> > - update_sd_pick_busiest() select the busiest sched_group.

> > - find_busiest_group() checks if there is an imbalance between local and

> >   busiest group.

> > - calculate_imbalance() decides what have to be moved.

> >

> > Finally, the now unused field total_running of struct sd_lb_stats has been

> > removed.

> >

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> > ---

> >  kernel/sched/fair.c | 611 ++++++++++++++++++++++++++++++++++------------------

> >  1 file changed, 402 insertions(+), 209 deletions(-)

> >

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index e004841..5ae5281 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -7068,11 +7068,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;

> >

> >  enum fbq_type { regular, remote, all };

> >

> > +/*

> > + * group_type describes the group of CPUs at the moment of the load balance.

> > + * The enum is ordered by pulling priority, with the group with lowest priority

> > + * first so the groupe_type can be simply compared when selecting the busiest

> > + * group. see update_sd_pick_busiest().

> > + */

>

> s/groupe_type/group_type/

>

> >  enum group_type {

> > -     group_other = 0,

> > +     group_has_spare = 0,

> > +     group_fully_busy,

> >       group_misfit_task,

> > +     group_asym_packing,

> >       group_imbalanced,

> > -     group_overloaded,

> > +     group_overloaded

> > +};

> > +

>

> While not your fault, it would be nice to comment on the meaning of each

> group type. From a glance, it's not obvious to me why a misfit task should

> be a high priority to move a task than a fully_busy (but not overloaded)

> group given that moving the misfit task might make a group overloaded.

>

> > +enum migration_type {

> > +     migrate_load = 0,

> > +     migrate_util,

> > +     migrate_task,

> > +     migrate_misfit

> >  };

> >

>

> Could do with a comment explaining what migration_type is for because

> the name is unhelpful. I *think* at a glance it's related to what sort

> of imbalance is being addressed which is partially addressed by the

> group_type. That understanding may change as I continue reading the series

> but now I have to figure it out which means it'll be forgotten again in

> 6 months.

>

> >  #define LBF_ALL_PINNED       0x01

> > @@ -7105,7 +7120,7 @@ struct lb_env {

> >       unsigned int            loop_max;

> >

> >       enum fbq_type           fbq_type;

> > -     enum group_type         src_grp_type;

> > +     enum migration_type     migration_type;

> >       struct list_head        tasks;

> >  };

> >

> > @@ -7328,7 +7343,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)

> >  static const unsigned int sched_nr_migrate_break = 32;

> >

> >  /*

> > - * detach_tasks() -- tries to detach up to imbalance runnable load from

> > + * detach_tasks() -- tries to detach up to imbalance load/util/tasks from

> >   * busiest_rq, as part of a balancing operation within domain "sd".

> >   *

> >   * Returns number of detached tasks if successful and 0 otherwise.

> > @@ -7336,8 +7351,8 @@ static const unsigned int sched_nr_migrate_break = 32;

> >  static int detach_tasks(struct lb_env *env)

> >  {

> >       struct list_head *tasks = &env->src_rq->cfs_tasks;

> > +     unsigned long util, load;

> >       struct task_struct *p;

> > -     unsigned long load;

> >       int detached = 0;

> >

> >       lockdep_assert_held(&env->src_rq->lock);

> > @@ -7370,19 +7385,51 @@ static int detach_tasks(struct lb_env *env)

> >               if (!can_migrate_task(p, env))

> >                       goto next;

> >

> > -             load = task_h_load(p);

> > +             switch (env->migration_type) {

> > +             case migrate_load:

> > +                     load = task_h_load(p);

> >

> > -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)

> > -                     goto next;

> > +                     if (sched_feat(LB_MIN) &&

> > +                         load < 16 && !env->sd->nr_balance_failed)

> > +                             goto next;

> >

> > -             if ((load / 2) > env->imbalance)

> > -                     goto next;

> > +                     if ((load / 2) > env->imbalance)

> > +                             goto next;

> > +

> > +                     env->imbalance -= load;

> > +                     break;

> > +

> > +             case migrate_util:

> > +                     util = task_util_est(p);

> > +

> > +                     if (util > env->imbalance)

> > +                             goto next;

> > +

> > +                     env->imbalance -= util;

> > +                     break;

> > +

> > +             case migrate_task:

> > +                     env->imbalance--;

> > +                     break;

> > +

> > +             case migrate_misfit:

> > +                     load = task_h_load(p);

> > +

> > +                     /*

> > +                      * load of misfit task might decrease a bit since it has

> > +                      * been recorded. Be conservative in the condition.

> > +                      */

> > +                     if (load / 2 < env->imbalance)

> > +                             goto next;

> > +

> > +                     env->imbalance = 0;

> > +                     break;

> > +             }

> >

>

> So, no problem with this but it brings up another point. migration_type

> also determines what env->imbalance means (e.g. load, utilisation,

> nr_running etc). That was true before your patch too but now it's much

> more explicit, which is nice, but could do with a comment.

>

> >               detach_task(p, env);

> >               list_add(&p->se.group_node, &env->tasks);

> >

> >               detached++;

> > -             env->imbalance -= load;

> >

> >  #ifdef CONFIG_PREEMPTION

> >               /*

> > @@ -7396,7 +7443,7 @@ static int detach_tasks(struct lb_env *env)

> >

> >               /*

> >                * We only want to steal up to the prescribed amount of

> > -              * runnable load.

> > +              * load/util/tasks.

> >                */

> >               if (env->imbalance <= 0)

> >                       break;

> > @@ -7661,7 +7708,6 @@ struct sg_lb_stats {

> >       unsigned int idle_cpus;

> >       unsigned int group_weight;

> >       enum group_type group_type;

> > -     int group_no_capacity;

> >       unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */

> >       unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */

> >  #ifdef CONFIG_NUMA_BALANCING

>

> Glad to see group_no_capacity go away, that had some "interesting"

> treatment in update_sd_lb_stats.

>

> > @@ -7677,10 +7723,10 @@ struct sg_lb_stats {

> >  struct sd_lb_stats {

> >       struct sched_group *busiest;    /* Busiest group in this sd */

> >       struct sched_group *local;      /* Local group in this sd */

> > -     unsigned long total_running;

> >       unsigned long total_load;       /* Total load of all groups in sd */

> >       unsigned long total_capacity;   /* Total capacity of all groups in sd */

> >       unsigned long avg_load; /* Average load across all groups in sd */

> > +     unsigned int prefer_sibling; /* tasks should go to sibling first */

> >

> >       struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */

> >       struct sg_lb_stats local_stat;  /* Statistics of the local group */

> > @@ -7691,19 +7737,18 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)

> >       /*

> >        * Skimp on the clearing to avoid duplicate work. We can avoid clearing

> >        * local_stat because update_sg_lb_stats() does a full clear/assignment.

> > -      * We must however clear busiest_stat::avg_load because

> > -      * update_sd_pick_busiest() reads this before assignment.

> > +      * We must however set busiest_stat::group_type and

> > +      * busiest_stat::idle_cpus to the worst busiest group because

> > +      * update_sd_pick_busiest() reads these before assignment.

> >        */

> >       *sds = (struct sd_lb_stats){

> >               .busiest = NULL,

> >               .local = NULL,

> > -             .total_running = 0UL,

> >               .total_load = 0UL,

> >               .total_capacity = 0UL,

> >               .busiest_stat = {

> > -                     .avg_load = 0UL,

> > -                     .sum_h_nr_running = 0,

> > -                     .group_type = group_other,

> > +                     .idle_cpus = UINT_MAX,

> > +                     .group_type = group_has_spare,

> >               },

> >       };

> >  }

> > @@ -7945,19 +7990,26 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)

> >  }

> >

> >  static inline enum

> > -group_type group_classify(struct sched_group *group,

> > +group_type group_classify(struct lb_env *env,

> > +                       struct sched_group *group,

> >                         struct sg_lb_stats *sgs)

> >  {

> > -     if (sgs->group_no_capacity)

> > +     if (group_is_overloaded(env, sgs))

> >               return group_overloaded;

> >

> >       if (sg_imbalanced(group))

> >               return group_imbalanced;

> >

> > +     if (sgs->group_asym_packing)

> > +             return group_asym_packing;

> > +

> >       if (sgs->group_misfit_task_load)

> >               return group_misfit_task;

> >

> > -     return group_other;

> > +     if (!group_has_capacity(env, sgs))

> > +             return group_fully_busy;

> > +

> > +     return group_has_spare;

> >  }

> >

> >  static bool update_nohz_stats(struct rq *rq, bool force)

> > @@ -7994,10 +8046,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,

> >                                     struct sg_lb_stats *sgs,

> >                                     int *sg_status)

> >  {

> > -     int i, nr_running;

> > +     int i, nr_running, local_group;

> >

> >       memset(sgs, 0, sizeof(*sgs));

> >

> > +     local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));

> > +

> >       for_each_cpu_and(i, sched_group_span(group), env->cpus) {

> >               struct rq *rq = cpu_rq(i);

> >

> > @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,

> >               /*

> >                * No need to call idle_cpu() if nr_running is not 0

> >                */

> > -             if (!nr_running && idle_cpu(i))

> > +             if (!nr_running && idle_cpu(i)) {

> >                       sgs->idle_cpus++;

> > +                     /* Idle cpu can't have misfit task */

> > +                     continue;

> > +             }

> > +

> > +             if (local_group)

> > +                     continue;

> >

> > +             /* Check for a misfit task on the cpu */

> >               if (env->sd->flags & SD_ASYM_CPUCAPACITY &&

> >                   sgs->group_misfit_task_load < rq->misfit_task_load) {

> >                       sgs->group_misfit_task_load = rq->misfit_task_load;

>

> So.... why exactly do we not care about misfit tasks on CPUs in the

> local group? I'm not saying you're wrong because you have a clear idea

> on how misfit tasks should be treated but it's very non-obvious just

> from the code.


local_group can't do anything with local misfit tasks so it doesn't
give any additional information compared to overloaded, fully_busy or
has_spare

>

> > <SNIP>

> >

> > @@ -8079,62 +8154,80 @@ static bool update_sd_pick_busiest(struct lb_env *env,

> >       if (sgs->group_type < busiest->group_type)

> >               return false;

> >

> > -     if (sgs->avg_load <= busiest->avg_load)

> > -             return false;

> > -

> > -     if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))

> > -             goto asym_packing;

> > -

> >       /*

> > -      * Candidate sg has no more than one task per CPU and

> > -      * has higher per-CPU capacity. Migrating tasks to less

> > -      * capable CPUs may harm throughput. Maximize throughput,

> > -      * power/energy consequences are not considered.

> > +      * The candidate and the current busiest group are the same type of

> > +      * group. Let check which one is the busiest according to the type.

> >        */

> > -     if (sgs->sum_h_nr_running <= sgs->group_weight &&

> > -         group_smaller_min_cpu_capacity(sds->local, sg))

> > -             return false;

> >

> > -     /*

> > -      * If we have more than one misfit sg go with the biggest misfit.

> > -      */

> > -     if (sgs->group_type == group_misfit_task &&

> > -         sgs->group_misfit_task_load < busiest->group_misfit_task_load)

> > +     switch (sgs->group_type) {

> > +     case group_overloaded:

> > +             /* Select the overloaded group with highest avg_load. */

> > +             if (sgs->avg_load <= busiest->avg_load)

> > +                     return false;

> > +             break;

> > +

> > +     case group_imbalanced:

> > +             /*

> > +              * Select the 1st imbalanced group as we don't have any way to

> > +              * choose one more than another.

> > +              */

> >               return false;

> >

> > -asym_packing:

> > -     /* This is the busiest node in its class. */

> > -     if (!(env->sd->flags & SD_ASYM_PACKING))

> > -             return true;

> > +     case group_asym_packing:

> > +             /* Prefer to move from lowest priority CPU's work */

> > +             if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))

> > +                     return false;

> > +             break;

> >

>

> Again, I'm not seeing what prevents a !SD_ASYM_PACKING domain checking

> sched_asym_prefer.


the test is done when collecting group's statistic in update_sg_lb_stats()
/* Check if dst cpu is idle and preferred to this group */
if (env->sd->flags & SD_ASYM_PACKING &&
    env->idle != CPU_NOT_IDLE &&
    sgs->sum_h_nr_running &&
    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
sgs->group_asym_packing = 1;
}

Then the group type group_asym_packing is only set if
sgs->group_asym_packing has been set

>

> > <SNIP>

> > +     case group_fully_busy:

> > +             /*

> > +              * Select the fully busy group with highest avg_load. In

> > +              * theory, there is no need to pull task from such kind of

> > +              * group because tasks have all compute capacity that they need

> > +              * but we can still improve the overall throughput by reducing

> > +              * contention when accessing shared HW resources.

> > +              *

> > +              * XXX for now avg_load is not computed and always 0 so we

> > +              * select the 1st one.

> > +              */

> > +             if (sgs->avg_load <= busiest->avg_load)

> > +                     return false;

> > +             break;

> > +

>

> With the exception that if we are balancing between NUMA domains and they

> were communicating tasks that we've now pulled them apart. That might

> increase the CPU resources available at the cost of increased remote

> memory access cost.


I expect the numa classification to help and skip those runqueue

>

> > +     case group_has_spare:

> > +             /*

> > +              * Select not overloaded group with lowest number of

> > +              * idle cpus. We could also compare the spare capacity

> > +              * which is more stable but it can end up that the

> > +              * group has less spare capacity but finally more idle

> > +              * cpus which means less opportunity to pull tasks.

> > +              */

> > +             if (sgs->idle_cpus >= busiest->idle_cpus)

> > +                     return false;

> > +             break;

> >       }

> >

> > -     return false;

> > +     /*

> > +      * Candidate sg has no more than one task per CPU and has higher

> > +      * per-CPU capacity. Migrating tasks to less capable CPUs may harm

> > +      * throughput. Maximize throughput, power/energy consequences are not

> > +      * considered.

> > +      */

> > +     if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&

> > +         (sgs->group_type <= group_fully_busy) &&

> > +         (group_smaller_min_cpu_capacity(sds->local, sg)))

> > +             return false;

> > +

> > +     return true;

> >  }

> >

> >  #ifdef CONFIG_NUMA_BALANCING

> > @@ -8172,13 +8265,13 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)

> >   * @env: The load balancing environment.

> >   * @sds: variable to hold the statistics for this sched_domain.

> >   */

> > +

>

> Spurious whitespace change.

>

> >  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)

> >  {

> >       struct sched_domain *child = env->sd->child;

> >       struct sched_group *sg = env->sd->groups;

> >       struct sg_lb_stats *local = &sds->local_stat;

> >       struct sg_lb_stats tmp_sgs;

> > -     bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;

> >       int sg_status = 0;

> >

> >  #ifdef CONFIG_NO_HZ_COMMON

> > @@ -8205,22 +8298,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

> >               if (local_group)

> >                       goto next_group;

> >

> > -             /*

> > -              * In case the child domain prefers tasks go to siblings

> > -              * first, lower the sg capacity so that we'll try

> > -              * and move all the excess tasks away. We lower the capacity

> > -              * of a group only if the local group has the capacity to fit

> > -              * these excess tasks. The extra check prevents the case where

> > -              * you always pull from the heaviest group when it is already

> > -              * under-utilized (possible with a large weight task outweighs

> > -              * the tasks on the system).

> > -              */

> > -             if (prefer_sibling && sds->local &&

> > -                 group_has_capacity(env, local) &&

> > -                 (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {

> > -                     sgs->group_no_capacity = 1;

> > -                     sgs->group_type = group_classify(sg, sgs);

> > -             }

> >

> >               if (update_sd_pick_busiest(env, sds, sg, sgs)) {

> >                       sds->busiest = sg;

> > @@ -8229,13 +8306,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

> >

> >  next_group:

> >               /* Now, start updating sd_lb_stats */

> > -             sds->total_running += sgs->sum_h_nr_running;

> >               sds->total_load += sgs->group_load;

> >               sds->total_capacity += sgs->group_capacity;

> >

> >               sg = sg->next;

> >       } while (sg != env->sd->groups);

> >

> > +     /* Tag domain that child domain prefers tasks go to siblings first */

> > +     sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;

> > +

> >  #ifdef CONFIG_NO_HZ_COMMON

> >       if ((env->flags & LBF_NOHZ_AGAIN) &&

> >           cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {

> > @@ -8273,69 +8352,149 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

> >   */

> >  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

> >  {

> > -     unsigned long max_pull, load_above_capacity = ~0UL;

> >       struct sg_lb_stats *local, *busiest;

> >

> >       local = &sds->local_stat;

> >       busiest = &sds->busiest_stat;

> >

> > -     if (busiest->group_asym_packing) {

> > -             env->imbalance = busiest->group_load;

> > +     if (busiest->group_type == group_misfit_task) {

> > +             /* Set imbalance to allow misfit task to be balanced. */

> > +             env->migration_type = migrate_misfit;

> > +             env->imbalance = busiest->group_misfit_task_load;

> > +             return;

> > +     }

> > +

> > +     if (busiest->group_type == group_asym_packing) {

> > +             /*

> > +              * In case of asym capacity, we will try to migrate all load to

> > +              * the preferred CPU.

> > +              */

> > +             env->migration_type = migrate_task;

> > +             env->imbalance = busiest->sum_h_nr_running;

> > +             return;

> > +     }

> > +

> > +     if (busiest->group_type == group_imbalanced) {

> > +             /*

> > +              * In the group_imb case we cannot rely on group-wide averages

> > +              * to ensure CPU-load equilibrium, try to move any task to fix

> > +              * the imbalance. The next load balance will take care of

> > +              * balancing back the system.

> > +              */

> > +             env->migration_type = migrate_task;

> > +             env->imbalance = 1;

> >               return;

> >       }

> >

> >       /*

> > -      * Avg load of busiest sg can be less and avg load of local sg can

> > -      * be greater than avg load across all sgs of sd because avg load

> > -      * factors in sg capacity and sgs with smaller group_type are

> > -      * skipped when updating the busiest sg:

> > +      * Try to use spare capacity of local group without overloading it or

> > +      * emptying busiest

> >        */

> > -     if (busiest->group_type != group_misfit_task &&

> > -         (busiest->avg_load <= sds->avg_load ||

> > -          local->avg_load >= sds->avg_load)) {

> > -             env->imbalance = 0;

> > +     if (local->group_type == group_has_spare) {

> > +             if (busiest->group_type > group_fully_busy) {

> > +                     /*

> > +                      * If busiest is overloaded, try to fill spare

> > +                      * capacity. This might end up creating spare capacity

> > +                      * in busiest or busiest still being overloaded but

> > +                      * there is no simple way to directly compute the

> > +                      * amount of load to migrate in order to balance the

> > +                      * system.

> > +                      */

>

> busiest may not be overloaded, it may be imbalanced. Maybe the

> distinction is irrelevant though.


the case busiest->group_type == group_imbalanced has already been
handled earlier int he function

>

> > +                     env->migration_type = migrate_util;

> > +                     env->imbalance = max(local->group_capacity, local->group_util) -

> > +                                      local->group_util;

> > +

> > +                     /*

> > +                      * In some case, the group's utilization is max or even

> > +                      * higher than capacity because of migrations but the

> > +                      * local CPU is (newly) idle. There is at least one

> > +                      * waiting task in this overloaded busiest group. Let

> > +                      * try to pull it.

> > +                      */

> > +                     if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {

> > +                             env->migration_type = migrate_task;

> > +                             env->imbalance = 1;

> > +                     }

> > +

>

> Not convinced this is a good thing to do across NUMA domains. If it was

> tied to the group being definitely overloaded then I could see the logic.

>

> > +                     return;

> > +             }

> > +

> > +             if (busiest->group_weight == 1 || sds->prefer_sibling) {

> > +                     unsigned int nr_diff = busiest->sum_h_nr_running;

> > +                     /*

> > +                      * When prefer sibling, evenly spread running tasks on

> > +                      * groups.

> > +                      */

> > +                     env->migration_type = migrate_task;

> > +                     lsub_positive(&nr_diff, local->sum_h_nr_running);

> > +                     env->imbalance = nr_diff >> 1;

> > +                     return;

> > +             }

>

> Comment is slightly misleading given that it's not just preferring

> siblings but for when balancing across single CPU domains.

>

> > +

> > +             /*

> > +              * If there is no overload, we just want to even the number of

> > +              * idle cpus.

> > +              */

> > +             env->migration_type = migrate_task;

> > +             env->imbalance = max_t(long, 0, (local->idle_cpus -

> > +                                              busiest->idle_cpus) >> 1);

> >               return;

> >       }

>

> Why do we want an even number of idle CPUs unconditionally? This goes back

> to the NUMA domain case. 2 communicating tasks running on a 2-socket system

> should not be pulled apart just to have 1 task running on each socket.

>

> I didn't see anything obvious after this point but I also am getting a

> bit on the fried side trying to hold this entire patch in my head and

> got hung up on the NUMA domain balancing in particular.

>

> --

> Mel Gorman

> SUSE Labs
Mel Gorman Oct. 31, 2019, 10:15 a.m. UTC | #4
On Thu, Oct 31, 2019 at 10:09:17AM +0100, Vincent Guittot wrote:
> On Wed, 30 Oct 2019 at 16:45, Mel Gorman <mgorman@techsingularity.net> wrote:

> >

> > On Fri, Oct 18, 2019 at 03:26:31PM +0200, Vincent Guittot wrote:

> > > The load_balance algorithm contains some heuristics which have become

> > > meaningless since the rework of the scheduler's metrics like the

> > > introduction of PELT.

> > >

> > > Furthermore, load is an ill-suited metric for solving certain task

> > > placement imbalance scenarios. For instance, in the presence of idle CPUs,

> > > we should simply try to get at least one task per CPU, whereas the current

> > > load-based algorithm can actually leave idle CPUs alone simply because the

> > > load is somewhat balanced. The current algorithm ends up creating virtual

> > > and meaningless value like the avg_load_per_task or tweaks the state of a

> > > group to make it overloaded whereas it's not, in order to try to migrate

> > > tasks.

> > >

> >

> > I do not think this is necessarily 100% true. With both the previous

> > load-balancing behaviour and the apparent behaviour of this patch, it's

> > still possible to pull two communicating tasks apart and across NUMA

> > domains when utilisation is low. Specifically, a load difference of less

> > than SCHED_CAPACITY_SCALE between NUMA codes can be enough too migrate

> > task to level out load.

> >

> > So, load might be ill-suited for some cases but that does not make it

> > completely useless either.

> 

> I fully agree and we keep using it in some cases.

> The goal is only to not use it when it is obviously the wrong metric to be used

> 


Understood, ideally it's be explicit why each metric (task/util/load)
is used each time for future reference and why it's the best for a given
situation. It's not a requirement for the series as the scheduler does
not have a perfect history of explaining itself.

> >

> > The type of behaviour can be seen by running netperf via mmtests

> > (configuration file configs/config-network-netperf-unbound) on a NUMA

> > machine and noting that the local vs remote NUMA hinting faults are roughly

> > 50%. I had prototyped some fixes around this that took imbalance_pct into

> > account but it was too special-cased and was not a universal win. If

> > I was reviewing my own patch I would have naked it on the "you added a

> > special-case hack into the load balancer for one load". I didn't get back

> > to it before getting cc'd on this series.

> >

> > > load_balance should better qualify the imbalance of the group and clearly

> > > define what has to be moved to fix this imbalance.

> > >

> > > The type of sched_group has been extended to better reflect the type of

> > > imbalance. We now have :

> > >       group_has_spare

> > >       group_fully_busy

> > >       group_misfit_task

> > >       group_asym_packing

> > >       group_imbalanced

> > >       group_overloaded

> > >

> > > Based on the type of sched_group, load_balance now sets what it wants to

> > > move in order to fix the imbalance. It can be some load as before but also

> > > some utilization, a number of task or a type of task:

> > >       migrate_task

> > >       migrate_util

> > >       migrate_load

> > >       migrate_misfit

> > >

> > > This new load_balance algorithm fixes several pending wrong tasks

> > > placement:

> > > - the 1 task per CPU case with asymmetric system

> > > - the case of cfs task preempted by other class

> > > - the case of tasks not evenly spread on groups with spare capacity

> > >

> >

> > On the last one, spreading tasks evenly across NUMA domains is not

> > necessarily a good idea. If I have 2 tasks running on a 2-socket machine

> > with 24 logical CPUs per socket, it should not automatically mean that

> > one task should move cross-node and I have definitely observed this

> > happening. It's probably bad in terms of locality no matter what but it's

> > especially bad if the 2 tasks happened to be communicating because then

> > load balancing will pull apart the tasks while wake_affine will push

> > them together (and potentially NUMA balancing as well). Note that this

> > also applies for some IO workloads because, depending on the filesystem,

> > the task may be communicating with workqueues (XFS) or a kernel thread

> > (ext4 with jbd2).

> 

> This rework doesn't touch the NUMA_BALANCING part and NUMA balancing

> still gives guidances with fbq_classify_group/queue.


I know the NUMA_BALANCING part is not touched, I'm talking about load
balancing across SD_NUMA domains which happens independently of
NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to
override the load balancer when it moves tasks away from the preferred
node.

> But the latter could also take advantage of the new type of group. For

> example, what I did in the fix for find_idlest_group : checking

> numa_preferred_nid when the group has capacity and keep the task on

> preferred node if possible. Similar behavior could also be beneficial

> in periodic load_balance case.

> 


And this is the catch -- numa_preferred_nid is not guaranteed to be set at
all. NUMA balancing might be disabled, the task may not have been running
long enough to pick a preferred NID or NUMA balancing might be unable to
pick a preferred NID. The decision to avoid unnecessary migrations across
NUMA domains should be made independently of NUMA balancing. The netperf
configuration from mmtests is great at illustrating the point because it'll
also say what the average local/remote access ratio is. 2 communicating
tasks running on an otherwise idle NUMA machine should not have the load
balancer move the server to one node and the client to another.

Historically, we might have accounted for this with imbalance_pct which
makes sense for load and was special cased in some places, but it does
not make sense to use imbalance_pct for nr_running. Either way, I think
balancing across SD_NUMA should have explicit logic to take into account
that there is an additional cost outside of the scheduler when a task
moves cross-node.

Even if it's a case that this series does not tackle the problem now,
it'd be good to leave a TODO comment behind noting that SD_NUMA may need
to be special cased.

> > > @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,

> > >               /*

> > >                * No need to call idle_cpu() if nr_running is not 0

> > >                */

> > > -             if (!nr_running && idle_cpu(i))

> > > +             if (!nr_running && idle_cpu(i)) {

> > >                       sgs->idle_cpus++;

> > > +                     /* Idle cpu can't have misfit task */

> > > +                     continue;

> > > +             }

> > > +

> > > +             if (local_group)

> > > +                     continue;

> > >

> > > +             /* Check for a misfit task on the cpu */

> > >               if (env->sd->flags & SD_ASYM_CPUCAPACITY &&

> > >                   sgs->group_misfit_task_load < rq->misfit_task_load) {

> > >                       sgs->group_misfit_task_load = rq->misfit_task_load;

> >

> > So.... why exactly do we not care about misfit tasks on CPUs in the

> > local group? I'm not saying you're wrong because you have a clear idea

> > on how misfit tasks should be treated but it's very non-obvious just

> > from the code.

> 

> local_group can't do anything with local misfit tasks so it doesn't

> give any additional information compared to overloaded, fully_busy or

> has_spare

> 


Ok, that's very clear and now I'm feeling a bit stupid because I should
have spotted that. It really could do with a comment so somebody else
does not try "fixing" it :(

> > > <SNIP>

> > >

> > > @@ -8079,62 +8154,80 @@ static bool update_sd_pick_busiest(struct lb_env *env,

> > >       if (sgs->group_type < busiest->group_type)

> > >               return false;

> > >

> > > -     if (sgs->avg_load <= busiest->avg_load)

> > > -             return false;

> > > -

> > > -     if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))

> > > -             goto asym_packing;

> > > -

> > >       /*

> > > -      * Candidate sg has no more than one task per CPU and

> > > -      * has higher per-CPU capacity. Migrating tasks to less

> > > -      * capable CPUs may harm throughput. Maximize throughput,

> > > -      * power/energy consequences are not considered.

> > > +      * The candidate and the current busiest group are the same type of

> > > +      * group. Let check which one is the busiest according to the type.

> > >        */

> > > -     if (sgs->sum_h_nr_running <= sgs->group_weight &&

> > > -         group_smaller_min_cpu_capacity(sds->local, sg))

> > > -             return false;

> > >

> > > -     /*

> > > -      * If we have more than one misfit sg go with the biggest misfit.

> > > -      */

> > > -     if (sgs->group_type == group_misfit_task &&

> > > -         sgs->group_misfit_task_load < busiest->group_misfit_task_load)

> > > +     switch (sgs->group_type) {

> > > +     case group_overloaded:

> > > +             /* Select the overloaded group with highest avg_load. */

> > > +             if (sgs->avg_load <= busiest->avg_load)

> > > +                     return false;

> > > +             break;

> > > +

> > > +     case group_imbalanced:

> > > +             /*

> > > +              * Select the 1st imbalanced group as we don't have any way to

> > > +              * choose one more than another.

> > > +              */

> > >               return false;

> > >

> > > -asym_packing:

> > > -     /* This is the busiest node in its class. */

> > > -     if (!(env->sd->flags & SD_ASYM_PACKING))

> > > -             return true;

> > > +     case group_asym_packing:

> > > +             /* Prefer to move from lowest priority CPU's work */

> > > +             if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))

> > > +                     return false;

> > > +             break;

> > >

> >

> > Again, I'm not seeing what prevents a !SD_ASYM_PACKING domain checking

> > sched_asym_prefer.

> 

> the test is done when collecting group's statistic in update_sg_lb_stats()

> /* Check if dst cpu is idle and preferred to this group */

> if (env->sd->flags & SD_ASYM_PACKING &&

>     env->idle != CPU_NOT_IDLE &&

>     sgs->sum_h_nr_running &&

>     sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {

> sgs->group_asym_packing = 1;

> }

> 

> Then the group type group_asym_packing is only set if

> sgs->group_asym_packing has been set

> 


Yeah, sorry. I need to get ASYM_PACKING clearer in my head.

> >

> > > <SNIP>

> > > +     case group_fully_busy:

> > > +             /*

> > > +              * Select the fully busy group with highest avg_load. In

> > > +              * theory, there is no need to pull task from such kind of

> > > +              * group because tasks have all compute capacity that they need

> > > +              * but we can still improve the overall throughput by reducing

> > > +              * contention when accessing shared HW resources.

> > > +              *

> > > +              * XXX for now avg_load is not computed and always 0 so we

> > > +              * select the 1st one.

> > > +              */

> > > +             if (sgs->avg_load <= busiest->avg_load)

> > > +                     return false;

> > > +             break;

> > > +

> >

> > With the exception that if we are balancing between NUMA domains and they

> > were communicating tasks that we've now pulled them apart. That might

> > increase the CPU resources available at the cost of increased remote

> > memory access cost.

> 

> I expect the numa classification to help and skip those runqueue

> 


It might but the "canary in the mine" is netperf. A basic pair should
not be pulled apart.

> > > @@ -8273,69 +8352,149 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

> > >   */

> > >  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

> > >  {

> > > -     unsigned long max_pull, load_above_capacity = ~0UL;

> > >       struct sg_lb_stats *local, *busiest;

> > >

> > >       local = &sds->local_stat;

> > >       busiest = &sds->busiest_stat;

> > >

> > > -     if (busiest->group_asym_packing) {

> > > -             env->imbalance = busiest->group_load;

> > > +     if (busiest->group_type == group_misfit_task) {

> > > +             /* Set imbalance to allow misfit task to be balanced. */

> > > +             env->migration_type = migrate_misfit;

> > > +             env->imbalance = busiest->group_misfit_task_load;

> > > +             return;

> > > +     }

> > > +

> > > +     if (busiest->group_type == group_asym_packing) {

> > > +             /*

> > > +              * In case of asym capacity, we will try to migrate all load to

> > > +              * the preferred CPU.

> > > +              */

> > > +             env->migration_type = migrate_task;

> > > +             env->imbalance = busiest->sum_h_nr_running;

> > > +             return;

> > > +     }

> > > +

> > > +     if (busiest->group_type == group_imbalanced) {

> > > +             /*

> > > +              * In the group_imb case we cannot rely on group-wide averages

> > > +              * to ensure CPU-load equilibrium, try to move any task to fix

> > > +              * the imbalance. The next load balance will take care of

> > > +              * balancing back the system.

> > > +              */

> > > +             env->migration_type = migrate_task;

> > > +             env->imbalance = 1;

> > >               return;

> > >       }

> > >

> > >       /*

> > > -      * Avg load of busiest sg can be less and avg load of local sg can

> > > -      * be greater than avg load across all sgs of sd because avg load

> > > -      * factors in sg capacity and sgs with smaller group_type are

> > > -      * skipped when updating the busiest sg:

> > > +      * Try to use spare capacity of local group without overloading it or

> > > +      * emptying busiest

> > >        */

> > > -     if (busiest->group_type != group_misfit_task &&

> > > -         (busiest->avg_load <= sds->avg_load ||

> > > -          local->avg_load >= sds->avg_load)) {

> > > -             env->imbalance = 0;

> > > +     if (local->group_type == group_has_spare) {

> > > +             if (busiest->group_type > group_fully_busy) {

> > > +                     /*

> > > +                      * If busiest is overloaded, try to fill spare

> > > +                      * capacity. This might end up creating spare capacity

> > > +                      * in busiest or busiest still being overloaded but

> > > +                      * there is no simple way to directly compute the

> > > +                      * amount of load to migrate in order to balance the

> > > +                      * system.

> > > +                      */

> >

> > busiest may not be overloaded, it may be imbalanced. Maybe the

> > distinction is irrelevant though.

> 

> the case busiest->group_type == group_imbalanced has already been

> handled earlier int he function

> 


Bah, of course.

-- 
Mel Gorman
SUSE Labs
Vincent Guittot Oct. 31, 2019, 11:13 a.m. UTC | #5
On Thu, 31 Oct 2019 at 11:15, Mel Gorman <mgorman@techsingularity.net> wrote:
>

> On Thu, Oct 31, 2019 at 10:09:17AM +0100, Vincent Guittot wrote:

> > On Wed, 30 Oct 2019 at 16:45, Mel Gorman <mgorman@techsingularity.net> wrote:

> > >

> > > On Fri, Oct 18, 2019 at 03:26:31PM +0200, Vincent Guittot wrote:

> > > > The load_balance algorithm contains some heuristics which have become

> > > > meaningless since the rework of the scheduler's metrics like the

> > > > introduction of PELT.

> > > >

> > > > Furthermore, load is an ill-suited metric for solving certain task

> > > > placement imbalance scenarios. For instance, in the presence of idle CPUs,

> > > > we should simply try to get at least one task per CPU, whereas the current

> > > > load-based algorithm can actually leave idle CPUs alone simply because the

> > > > load is somewhat balanced. The current algorithm ends up creating virtual

> > > > and meaningless value like the avg_load_per_task or tweaks the state of a

> > > > group to make it overloaded whereas it's not, in order to try to migrate

> > > > tasks.

> > > >

> > >

> > > I do not think this is necessarily 100% true. With both the previous

> > > load-balancing behaviour and the apparent behaviour of this patch, it's

> > > still possible to pull two communicating tasks apart and across NUMA

> > > domains when utilisation is low. Specifically, a load difference of less

> > > than SCHED_CAPACITY_SCALE between NUMA codes can be enough too migrate

> > > task to level out load.

> > >

> > > So, load might be ill-suited for some cases but that does not make it

> > > completely useless either.

> >

> > I fully agree and we keep using it in some cases.

> > The goal is only to not use it when it is obviously the wrong metric to be used

> >

>

> Understood, ideally it's be explicit why each metric (task/util/load)

> is used each time for future reference and why it's the best for a given

> situation. It's not a requirement for the series as the scheduler does

> not have a perfect history of explaining itself.

>

> > >

> > > The type of behaviour can be seen by running netperf via mmtests

> > > (configuration file configs/config-network-netperf-unbound) on a NUMA

> > > machine and noting that the local vs remote NUMA hinting faults are roughly

> > > 50%. I had prototyped some fixes around this that took imbalance_pct into

> > > account but it was too special-cased and was not a universal win. If

> > > I was reviewing my own patch I would have naked it on the "you added a

> > > special-case hack into the load balancer for one load". I didn't get back

> > > to it before getting cc'd on this series.

> > >

> > > > load_balance should better qualify the imbalance of the group and clearly

> > > > define what has to be moved to fix this imbalance.

> > > >

> > > > The type of sched_group has been extended to better reflect the type of

> > > > imbalance. We now have :

> > > >       group_has_spare

> > > >       group_fully_busy

> > > >       group_misfit_task

> > > >       group_asym_packing

> > > >       group_imbalanced

> > > >       group_overloaded

> > > >

> > > > Based on the type of sched_group, load_balance now sets what it wants to

> > > > move in order to fix the imbalance. It can be some load as before but also

> > > > some utilization, a number of task or a type of task:

> > > >       migrate_task

> > > >       migrate_util

> > > >       migrate_load

> > > >       migrate_misfit

> > > >

> > > > This new load_balance algorithm fixes several pending wrong tasks

> > > > placement:

> > > > - the 1 task per CPU case with asymmetric system

> > > > - the case of cfs task preempted by other class

> > > > - the case of tasks not evenly spread on groups with spare capacity

> > > >

> > >

> > > On the last one, spreading tasks evenly across NUMA domains is not

> > > necessarily a good idea. If I have 2 tasks running on a 2-socket machine

> > > with 24 logical CPUs per socket, it should not automatically mean that

> > > one task should move cross-node and I have definitely observed this

> > > happening. It's probably bad in terms of locality no matter what but it's

> > > especially bad if the 2 tasks happened to be communicating because then

> > > load balancing will pull apart the tasks while wake_affine will push

> > > them together (and potentially NUMA balancing as well). Note that this

> > > also applies for some IO workloads because, depending on the filesystem,

> > > the task may be communicating with workqueues (XFS) or a kernel thread

> > > (ext4 with jbd2).

> >

> > This rework doesn't touch the NUMA_BALANCING part and NUMA balancing

> > still gives guidances with fbq_classify_group/queue.

>

> I know the NUMA_BALANCING part is not touched, I'm talking about load

> balancing across SD_NUMA domains which happens independently of

> NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to

> override the load balancer when it moves tasks away from the preferred

> node.


Yes. this patchset relies on this override for now to prevent moving task away.
I agree that additional patches are probably needed to improve load
balance at NUMA level and I expect that this rework will make it
simpler to add.
I just wanted to get the output of some real use cases before defining
more numa level specific conditions. Some want to spread on there numa
nodes but other want to keep everything together. The preferred node
and fbq_classify_group was the only sensible metrics to me when he
wrote this patchset but changes can be added if they make sense.

>

> > But the latter could also take advantage of the new type of group. For

> > example, what I did in the fix for find_idlest_group : checking

> > numa_preferred_nid when the group has capacity and keep the task on

> > preferred node if possible. Similar behavior could also be beneficial

> > in periodic load_balance case.

> >

>

> And this is the catch -- numa_preferred_nid is not guaranteed to be set at

> all. NUMA balancing might be disabled, the task may not have been running

> long enough to pick a preferred NID or NUMA balancing might be unable to

> pick a preferred NID. The decision to avoid unnecessary migrations across

> NUMA domains should be made independently of NUMA balancing. The netperf

> configuration from mmtests is great at illustrating the point because it'll

> also say what the average local/remote access ratio is. 2 communicating

> tasks running on an otherwise idle NUMA machine should not have the load

> balancer move the server to one node and the client to another.


I'm going to make it a try on my setup to see the results

>

> Historically, we might have accounted for this with imbalance_pct which

> makes sense for load and was special cased in some places, but it does

> not make sense to use imbalance_pct for nr_running. Either way, I think

> balancing across SD_NUMA should have explicit logic to take into account

> that there is an additional cost outside of the scheduler when a task

> moves cross-node.

>

> Even if it's a case that this series does not tackle the problem now,

> it'd be good to leave a TODO comment behind noting that SD_NUMA may need

> to be special cased.

>

> > > > @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,

> > > >               /*

> > > >                * No need to call idle_cpu() if nr_running is not 0

> > > >                */

> > > > -             if (!nr_running && idle_cpu(i))

> > > > +             if (!nr_running && idle_cpu(i)) {

> > > >                       sgs->idle_cpus++;

> > > > +                     /* Idle cpu can't have misfit task */

> > > > +                     continue;

> > > > +             }

> > > > +

> > > > +             if (local_group)

> > > > +                     continue;

> > > >

> > > > +             /* Check for a misfit task on the cpu */

> > > >               if (env->sd->flags & SD_ASYM_CPUCAPACITY &&

> > > >                   sgs->group_misfit_task_load < rq->misfit_task_load) {

> > > >                       sgs->group_misfit_task_load = rq->misfit_task_load;

> > >

> > > So.... why exactly do we not care about misfit tasks on CPUs in the

> > > local group? I'm not saying you're wrong because you have a clear idea

> > > on how misfit tasks should be treated but it's very non-obvious just

> > > from the code.

> >

> > local_group can't do anything with local misfit tasks so it doesn't

> > give any additional information compared to overloaded, fully_busy or

> > has_spare

> >

>

> Ok, that's very clear and now I'm feeling a bit stupid because I should

> have spotted that. It really could do with a comment so somebody else

> does not try "fixing" it :(

>

> > > > <SNIP>

> > > >

> > > > @@ -8079,62 +8154,80 @@ static bool update_sd_pick_busiest(struct lb_env *env,

> > > >       if (sgs->group_type < busiest->group_type)

> > > >               return false;

> > > >

> > > > -     if (sgs->avg_load <= busiest->avg_load)

> > > > -             return false;

> > > > -

> > > > -     if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))

> > > > -             goto asym_packing;

> > > > -

> > > >       /*

> > > > -      * Candidate sg has no more than one task per CPU and

> > > > -      * has higher per-CPU capacity. Migrating tasks to less

> > > > -      * capable CPUs may harm throughput. Maximize throughput,

> > > > -      * power/energy consequences are not considered.

> > > > +      * The candidate and the current busiest group are the same type of

> > > > +      * group. Let check which one is the busiest according to the type.

> > > >        */

> > > > -     if (sgs->sum_h_nr_running <= sgs->group_weight &&

> > > > -         group_smaller_min_cpu_capacity(sds->local, sg))

> > > > -             return false;

> > > >

> > > > -     /*

> > > > -      * If we have more than one misfit sg go with the biggest misfit.

> > > > -      */

> > > > -     if (sgs->group_type == group_misfit_task &&

> > > > -         sgs->group_misfit_task_load < busiest->group_misfit_task_load)

> > > > +     switch (sgs->group_type) {

> > > > +     case group_overloaded:

> > > > +             /* Select the overloaded group with highest avg_load. */

> > > > +             if (sgs->avg_load <= busiest->avg_load)

> > > > +                     return false;

> > > > +             break;

> > > > +

> > > > +     case group_imbalanced:

> > > > +             /*

> > > > +              * Select the 1st imbalanced group as we don't have any way to

> > > > +              * choose one more than another.

> > > > +              */

> > > >               return false;

> > > >

> > > > -asym_packing:

> > > > -     /* This is the busiest node in its class. */

> > > > -     if (!(env->sd->flags & SD_ASYM_PACKING))

> > > > -             return true;

> > > > +     case group_asym_packing:

> > > > +             /* Prefer to move from lowest priority CPU's work */

> > > > +             if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))

> > > > +                     return false;

> > > > +             break;

> > > >

> > >

> > > Again, I'm not seeing what prevents a !SD_ASYM_PACKING domain checking

> > > sched_asym_prefer.

> >

> > the test is done when collecting group's statistic in update_sg_lb_stats()

> > /* Check if dst cpu is idle and preferred to this group */

> > if (env->sd->flags & SD_ASYM_PACKING &&

> >     env->idle != CPU_NOT_IDLE &&

> >     sgs->sum_h_nr_running &&

> >     sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {

> > sgs->group_asym_packing = 1;

> > }

> >

> > Then the group type group_asym_packing is only set if

> > sgs->group_asym_packing has been set

> >

>

> Yeah, sorry. I need to get ASYM_PACKING clearer in my head.

>

> > >

> > > > <SNIP>

> > > > +     case group_fully_busy:

> > > > +             /*

> > > > +              * Select the fully busy group with highest avg_load. In

> > > > +              * theory, there is no need to pull task from such kind of

> > > > +              * group because tasks have all compute capacity that they need

> > > > +              * but we can still improve the overall throughput by reducing

> > > > +              * contention when accessing shared HW resources.

> > > > +              *

> > > > +              * XXX for now avg_load is not computed and always 0 so we

> > > > +              * select the 1st one.

> > > > +              */

> > > > +             if (sgs->avg_load <= busiest->avg_load)

> > > > +                     return false;

> > > > +             break;

> > > > +

> > >

> > > With the exception that if we are balancing between NUMA domains and they

> > > were communicating tasks that we've now pulled them apart. That might

> > > increase the CPU resources available at the cost of increased remote

> > > memory access cost.

> >

> > I expect the numa classification to help and skip those runqueue

> >

>

> It might but the "canary in the mine" is netperf. A basic pair should

> not be pulled apart.

>

> > > > @@ -8273,69 +8352,149 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

> > > >   */

> > > >  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)

> > > >  {

> > > > -     unsigned long max_pull, load_above_capacity = ~0UL;

> > > >       struct sg_lb_stats *local, *busiest;

> > > >

> > > >       local = &sds->local_stat;

> > > >       busiest = &sds->busiest_stat;

> > > >

> > > > -     if (busiest->group_asym_packing) {

> > > > -             env->imbalance = busiest->group_load;

> > > > +     if (busiest->group_type == group_misfit_task) {

> > > > +             /* Set imbalance to allow misfit task to be balanced. */

> > > > +             env->migration_type = migrate_misfit;

> > > > +             env->imbalance = busiest->group_misfit_task_load;

> > > > +             return;

> > > > +     }

> > > > +

> > > > +     if (busiest->group_type == group_asym_packing) {

> > > > +             /*

> > > > +              * In case of asym capacity, we will try to migrate all load to

> > > > +              * the preferred CPU.

> > > > +              */

> > > > +             env->migration_type = migrate_task;

> > > > +             env->imbalance = busiest->sum_h_nr_running;

> > > > +             return;

> > > > +     }

> > > > +

> > > > +     if (busiest->group_type == group_imbalanced) {

> > > > +             /*

> > > > +              * In the group_imb case we cannot rely on group-wide averages

> > > > +              * to ensure CPU-load equilibrium, try to move any task to fix

> > > > +              * the imbalance. The next load balance will take care of

> > > > +              * balancing back the system.

> > > > +              */

> > > > +             env->migration_type = migrate_task;

> > > > +             env->imbalance = 1;

> > > >               return;

> > > >       }

> > > >

> > > >       /*

> > > > -      * Avg load of busiest sg can be less and avg load of local sg can

> > > > -      * be greater than avg load across all sgs of sd because avg load

> > > > -      * factors in sg capacity and sgs with smaller group_type are

> > > > -      * skipped when updating the busiest sg:

> > > > +      * Try to use spare capacity of local group without overloading it or

> > > > +      * emptying busiest

> > > >        */

> > > > -     if (busiest->group_type != group_misfit_task &&

> > > > -         (busiest->avg_load <= sds->avg_load ||

> > > > -          local->avg_load >= sds->avg_load)) {

> > > > -             env->imbalance = 0;

> > > > +     if (local->group_type == group_has_spare) {

> > > > +             if (busiest->group_type > group_fully_busy) {

> > > > +                     /*

> > > > +                      * If busiest is overloaded, try to fill spare

> > > > +                      * capacity. This might end up creating spare capacity

> > > > +                      * in busiest or busiest still being overloaded but

> > > > +                      * there is no simple way to directly compute the

> > > > +                      * amount of load to migrate in order to balance the

> > > > +                      * system.

> > > > +                      */

> > >

> > > busiest may not be overloaded, it may be imbalanced. Maybe the

> > > distinction is irrelevant though.

> >

> > the case busiest->group_type == group_imbalanced has already been

> > handled earlier int he function

> >

>

> Bah, of course.

>

> --

> Mel Gorman

> SUSE Labs
Mel Gorman Oct. 31, 2019, 11:40 a.m. UTC | #6
On Thu, Oct 31, 2019 at 12:13:09PM +0100, Vincent Guittot wrote:
> > > > On the last one, spreading tasks evenly across NUMA domains is not

> > > > necessarily a good idea. If I have 2 tasks running on a 2-socket machine

> > > > with 24 logical CPUs per socket, it should not automatically mean that

> > > > one task should move cross-node and I have definitely observed this

> > > > happening. It's probably bad in terms of locality no matter what but it's

> > > > especially bad if the 2 tasks happened to be communicating because then

> > > > load balancing will pull apart the tasks while wake_affine will push

> > > > them together (and potentially NUMA balancing as well). Note that this

> > > > also applies for some IO workloads because, depending on the filesystem,

> > > > the task may be communicating with workqueues (XFS) or a kernel thread

> > > > (ext4 with jbd2).

> > >

> > > This rework doesn't touch the NUMA_BALANCING part and NUMA balancing

> > > still gives guidances with fbq_classify_group/queue.

> >

> > I know the NUMA_BALANCING part is not touched, I'm talking about load

> > balancing across SD_NUMA domains which happens independently of

> > NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to

> > override the load balancer when it moves tasks away from the preferred

> > node.

> 

> Yes. this patchset relies on this override for now to prevent moving task away.


Fair enough, netperf hits the corner case where it does not work but
that is also true without your series.

> I agree that additional patches are probably needed to improve load

> balance at NUMA level and I expect that this rework will make it

> simpler to add.

> I just wanted to get the output of some real use cases before defining

> more numa level specific conditions. Some want to spread on there numa

> nodes but other want to keep everything together. The preferred node

> and fbq_classify_group was the only sensible metrics to me when he

> wrote this patchset but changes can be added if they make sense.

> 


That's fair. While it was possible to address the case before your
series, it was a hatchet job. If the changelog simply notes that some
special casing may still be required for SD_NUMA but it's outside the
scope of the series, then I'd be happy. At least there is a good chance
then if there is follow-up work that it won't be interpreted as an
attempt to reintroduce hacky heuristics.

> >

> > > But the latter could also take advantage of the new type of group. For

> > > example, what I did in the fix for find_idlest_group : checking

> > > numa_preferred_nid when the group has capacity and keep the task on

> > > preferred node if possible. Similar behavior could also be beneficial

> > > in periodic load_balance case.

> > >

> >

> > And this is the catch -- numa_preferred_nid is not guaranteed to be set at

> > all. NUMA balancing might be disabled, the task may not have been running

> > long enough to pick a preferred NID or NUMA balancing might be unable to

> > pick a preferred NID. The decision to avoid unnecessary migrations across

> > NUMA domains should be made independently of NUMA balancing. The netperf

> > configuration from mmtests is great at illustrating the point because it'll

> > also say what the average local/remote access ratio is. 2 communicating

> > tasks running on an otherwise idle NUMA machine should not have the load

> > balancer move the server to one node and the client to another.

> 

> I'm going to make it a try on my setup to see the results

> 


Thanks.

-- 
Mel Gorman
SUSE Labs
Vincent Guittot Nov. 8, 2019, 4:35 p.m. UTC | #7
Le Thursday 31 Oct 2019 à 11:40:20 (+0000), Mel Gorman a écrit :
> On Thu, Oct 31, 2019 at 12:13:09PM +0100, Vincent Guittot wrote:

> > > > > On the last one, spreading tasks evenly across NUMA domains is not

> > > > > necessarily a good idea. If I have 2 tasks running on a 2-socket machine

> > > > > with 24 logical CPUs per socket, it should not automatically mean that

> > > > > one task should move cross-node and I have definitely observed this

> > > > > happening. It's probably bad in terms of locality no matter what but it's

> > > > > especially bad if the 2 tasks happened to be communicating because then

> > > > > load balancing will pull apart the tasks while wake_affine will push

> > > > > them together (and potentially NUMA balancing as well). Note that this

> > > > > also applies for some IO workloads because, depending on the filesystem,

> > > > > the task may be communicating with workqueues (XFS) or a kernel thread

> > > > > (ext4 with jbd2).

> > > >

> > > > This rework doesn't touch the NUMA_BALANCING part and NUMA balancing

> > > > still gives guidances with fbq_classify_group/queue.

> > >

> > > I know the NUMA_BALANCING part is not touched, I'm talking about load

> > > balancing across SD_NUMA domains which happens independently of

> > > NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to

> > > override the load balancer when it moves tasks away from the preferred

> > > node.

> > 

> > Yes. this patchset relies on this override for now to prevent moving task away.

> 

> Fair enough, netperf hits the corner case where it does not work but

> that is also true without your series.


I run mmtest/netperf test on my setup. It's a mix of small positive or
negative differences (see below)

netperf-udp
                                    5.3-rc2                5.3-rc2
								        tip               +rwk+fix
Hmean     send-64          95.06 (   0.00%)       94.12 *  -0.99%*
Hmean     send-128        191.71 (   0.00%)      189.94 *  -0.93%*
Hmean     send-256        379.05 (   0.00%)      370.96 *  -2.14%*
Hmean     send-1024      1485.24 (   0.00%)     1476.64 *  -0.58%*
Hmean     send-2048      2894.80 (   0.00%)     2887.00 *  -0.27%*
Hmean     send-3312      4580.27 (   0.00%)     4555.91 *  -0.53%*
Hmean     send-4096      5592.99 (   0.00%)     5517.31 *  -1.35%*
Hmean     send-8192      9117.00 (   0.00%)     9497.06 *   4.17%*
Hmean     send-16384    15824.59 (   0.00%)    15824.30 *  -0.00%*
Hmean     recv-64          95.06 (   0.00%)       94.08 *  -1.04%*
Hmean     recv-128        191.68 (   0.00%)      189.89 *  -0.93%*
Hmean     recv-256        378.94 (   0.00%)      370.87 *  -2.13%*
Hmean     recv-1024      1485.24 (   0.00%)     1476.20 *  -0.61%*
Hmean     recv-2048      2893.52 (   0.00%)     2885.25 *  -0.29%*
Hmean     recv-3312      4580.27 (   0.00%)     4553.48 *  -0.58%*
Hmean     recv-4096      5592.99 (   0.00%)     5517.27 *  -1.35%*
Hmean     recv-8192      9115.69 (   0.00%)     9495.69 *   4.17%*
Hmean     recv-16384    15824.36 (   0.00%)    15818.36 *  -0.04%*
Stddev    send-64           0.15 (   0.00%)        1.17 (-688.29%)
Stddev    send-128          1.56 (   0.00%)        1.15 (  25.96%)
Stddev    send-256          4.20 (   0.00%)        5.27 ( -25.63%)
Stddev    send-1024        20.11 (   0.00%)        5.68 (  71.74%)
Stddev    send-2048        11.06 (   0.00%)       21.74 ( -96.50%)
Stddev    send-3312        61.10 (   0.00%)       48.03 (  21.39%)
Stddev    send-4096        71.84 (   0.00%)       31.99 (  55.46%)
Stddev    send-8192       165.14 (   0.00%)      159.99 (   3.12%)
Stddev    send-16384       81.30 (   0.00%)      188.65 (-132.05%)
Stddev    recv-64           0.15 (   0.00%)        1.15 (-673.42%)
Stddev    recv-128          1.58 (   0.00%)        1.14 (  28.27%)
Stddev    recv-256          4.29 (   0.00%)        5.19 ( -21.05%)
Stddev    recv-1024        20.11 (   0.00%)        5.70 (  71.67%)
Stddev    recv-2048        10.43 (   0.00%)       21.41 (-105.22%)
Stddev    recv-3312        61.10 (   0.00%)       46.92 (  23.20%)
Stddev    recv-4096        71.84 (   0.00%)       31.97 (  55.50%)
Stddev    recv-8192       163.90 (   0.00%)      160.88 (   1.84%)
Stddev    recv-16384       81.41 (   0.00%)      187.01 (-129.71%)

                     5.3-rc2     5.3-rc2
                         tip    +rwk+fix
Duration User          38.90       39.13
Duration System      1311.29     1311.10
Duration Elapsed     1892.82     1892.86
 
netperf-tcp
                               5.3-rc2                5.3-rc2
                                   tip               +rwk+fix
Hmean     64         871.30 (   0.00%)      860.90 *  -1.19%*
Hmean     128       1689.39 (   0.00%)     1679.31 *  -0.60%*
Hmean     256       3199.59 (   0.00%)     3241.98 *   1.32%*
Hmean     1024      9390.47 (   0.00%)     9268.47 *  -1.30%*
Hmean     2048     13373.95 (   0.00%)    13395.61 *   0.16%*
Hmean     3312     16701.30 (   0.00%)    17165.96 *   2.78%*
Hmean     4096     15831.03 (   0.00%)    15544.66 *  -1.81%*
Hmean     8192     19720.01 (   0.00%)    20188.60 *   2.38%*
Hmean     16384    23925.90 (   0.00%)    23914.50 *  -0.05%*
Stddev    64           7.38 (   0.00%)        4.23 (  42.67%)
Stddev    128         11.62 (   0.00%)       10.13 (  12.85%)
Stddev    256         34.33 (   0.00%)        7.94 (  76.88%)
Stddev    1024        35.61 (   0.00%)      116.34 (-226.66%)
Stddev    2048       285.30 (   0.00%)       80.50 (  71.78%)
Stddev    3312       304.74 (   0.00%)      449.08 ( -47.36%)
Stddev    4096       668.11 (   0.00%)      569.30 (  14.79%)
Stddev    8192       733.23 (   0.00%)      944.38 ( -28.80%)
Stddev    16384      553.03 (   0.00%)      299.44 (  45.86%)

                     5.3-rc2     5.3-rc2
                         tip    +rwk+fix
Duration User         138.05      140.95
Duration System      1210.60     1208.45
Duration Elapsed     1352.86     1352.90


> 

> > I agree that additional patches are probably needed to improve load

> > balance at NUMA level and I expect that this rework will make it

> > simpler to add.

> > I just wanted to get the output of some real use cases before defining

> > more numa level specific conditions. Some want to spread on there numa

> > nodes but other want to keep everything together. The preferred node

> > and fbq_classify_group was the only sensible metrics to me when he

> > wrote this patchset but changes can be added if they make sense.

> > 

> 

> That's fair. While it was possible to address the case before your

> series, it was a hatchet job. If the changelog simply notes that some

> special casing may still be required for SD_NUMA but it's outside the

> scope of the series, then I'd be happy. At least there is a good chance

> then if there is follow-up work that it won't be interpreted as an

> attempt to reintroduce hacky heuristics.

>


Would the additional comment make sense for you about work to be done
for SD_NUMA ?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0ad4b21..7e4cb65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6960,11 +6960,34 @@ enum fbq_type { regular, remote, all };
  * group. see update_sd_pick_busiest().
  */
 enum group_type {
+	/*
+	 * The group has spare capacity that can be used to process more work.
+	 */
 	group_has_spare = 0,
+	/*
+	 * The group is fully used and the tasks don't compete for more CPU
+	 * cycles. Nevetheless, some tasks might wait before running.
+	 */
 	group_fully_busy,
+	/*
+	 * One task doesn't fit with CPU's capacity and must be migrated on a
+	 * more powerful CPU.
+	 */
 	group_misfit_task,
+	/*
+	 * One local CPU with higher capacity is available and task should be
+	 * migrated on it instead on current CPU.
+	 */
 	group_asym_packing,
+	/*
+	 * The tasks affinity prevents the scheduler to balance the load across
+	 * the system.
+	 */
 	group_imbalanced,
+	/*
+	 * The CPU is overloaded and can't provide expected CPU cycles to all
+	 * tasks.
+	 */
 	group_overloaded
 };
 
@@ -8563,7 +8586,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 
 	/*
 	 * Try to use spare capacity of local group without overloading it or
-	 * emptying busiest
+	 * emptying busiest.
+	 * XXX Spreading tasks across numa nodes is not always the best policy
+	 * and special cares should be taken for SD_NUMA domain level before
+	 * spreading the tasks. For now, load_balance() fully relies on
+	 * NUMA_BALANCING and fbq_classify_group/rq to overide the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
-- 
2.7.4

> 

> > >

> > > > But the latter could also take advantage of the new type of group. For

> > > > example, what I did in the fix for find_idlest_group : checking

> > > > numa_preferred_nid when the group has capacity and keep the task on

> > > > preferred node if possible. Similar behavior could also be beneficial

> > > > in periodic load_balance case.

> > > >

> > >

> > > And this is the catch -- numa_preferred_nid is not guaranteed to be set at

> > > all. NUMA balancing might be disabled, the task may not have been running

> > > long enough to pick a preferred NID or NUMA balancing might be unable to

> > > pick a preferred NID. The decision to avoid unnecessary migrations across

> > > NUMA domains should be made independently of NUMA balancing. The netperf

> > > configuration from mmtests is great at illustrating the point because it'll

> > > also say what the average local/remote access ratio is. 2 communicating

> > > tasks running on an otherwise idle NUMA machine should not have the load

> > > balancer move the server to one node and the client to another.

> > 

> > I'm going to make it a try on my setup to see the results

> > 

> 

> Thanks.

> 

> -- 

> Mel Gorman

> SUSE Labs
Mel Gorman Nov. 8, 2019, 6:37 p.m. UTC | #8
On Fri, Nov 08, 2019 at 05:35:01PM +0100, Vincent Guittot wrote:
> > Fair enough, netperf hits the corner case where it does not work but

> > that is also true without your series.

> 

> I run mmtest/netperf test on my setup. It's a mix of small positive or

> negative differences (see below)

> 

> <SNIP>

>

> netperf-tcp

>                                5.3-rc2                5.3-rc2

>                                    tip               +rwk+fix

> Hmean     64         871.30 (   0.00%)      860.90 *  -1.19%*

> Hmean     128       1689.39 (   0.00%)     1679.31 *  -0.60%*

> Hmean     256       3199.59 (   0.00%)     3241.98 *   1.32%*

> Hmean     1024      9390.47 (   0.00%)     9268.47 *  -1.30%*

> Hmean     2048     13373.95 (   0.00%)    13395.61 *   0.16%*

> Hmean     3312     16701.30 (   0.00%)    17165.96 *   2.78%*

> Hmean     4096     15831.03 (   0.00%)    15544.66 *  -1.81%*

> Hmean     8192     19720.01 (   0.00%)    20188.60 *   2.38%*

> Hmean     16384    23925.90 (   0.00%)    23914.50 *  -0.05%*

> Stddev    64           7.38 (   0.00%)        4.23 (  42.67%)

> Stddev    128         11.62 (   0.00%)       10.13 (  12.85%)

> Stddev    256         34.33 (   0.00%)        7.94 (  76.88%)

> Stddev    1024        35.61 (   0.00%)      116.34 (-226.66%)

> Stddev    2048       285.30 (   0.00%)       80.50 (  71.78%)

> Stddev    3312       304.74 (   0.00%)      449.08 ( -47.36%)

> Stddev    4096       668.11 (   0.00%)      569.30 (  14.79%)

> Stddev    8192       733.23 (   0.00%)      944.38 ( -28.80%)

> Stddev    16384      553.03 (   0.00%)      299.44 (  45.86%)

> 

>                      5.3-rc2     5.3-rc2

>                          tip    +rwk+fix

> Duration User         138.05      140.95

> Duration System      1210.60     1208.45

> Duration Elapsed     1352.86     1352.90

> 


This roughly matches what I've seen. The interesting part to me for
netperf is the next section of the report that reports the locality of
numa hints. With netperf on a 2-socket machine, it's generally around
50% as the client/server are pulled apart. Because netperf is not
heavily memory bound, it doesn't have much impact on the overall
performance but it's good at catching the cross-node migrations.

> > 

> > > I agree that additional patches are probably needed to improve load

> > > balance at NUMA level and I expect that this rework will make it

> > > simpler to add.

> > > I just wanted to get the output of some real use cases before defining

> > > more numa level specific conditions. Some want to spread on there numa

> > > nodes but other want to keep everything together. The preferred node

> > > and fbq_classify_group was the only sensible metrics to me when he

> > > wrote this patchset but changes can be added if they make sense.

> > > 

> > 

> > That's fair. While it was possible to address the case before your

> > series, it was a hatchet job. If the changelog simply notes that some

> > special casing may still be required for SD_NUMA but it's outside the

> > scope of the series, then I'd be happy. At least there is a good chance

> > then if there is follow-up work that it won't be interpreted as an

> > attempt to reintroduce hacky heuristics.

> >

> 

> Would the additional comment make sense for you about work to be done

> for SD_NUMA ?

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 0ad4b21..7e4cb65 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -6960,11 +6960,34 @@ enum fbq_type { regular, remote, all };

>   * group. see update_sd_pick_busiest().

>   */

>  enum group_type {

> +	/*

> +	 * The group has spare capacity that can be used to process more work.

> +	 */

>  	group_has_spare = 0,

> +	/*

> +	 * The group is fully used and the tasks don't compete for more CPU

> +	 * cycles. Nevetheless, some tasks might wait before running.

> +	 */

>  	group_fully_busy,

> +	/*

> +	 * One task doesn't fit with CPU's capacity and must be migrated on a

> +	 * more powerful CPU.

> +	 */

>  	group_misfit_task,

> +	/*

> +	 * One local CPU with higher capacity is available and task should be

> +	 * migrated on it instead on current CPU.

> +	 */

>  	group_asym_packing,

> +	/*

> +	 * The tasks affinity prevents the scheduler to balance the load across

> +	 * the system.

> +	 */

>  	group_imbalanced,

> +	/*

> +	 * The CPU is overloaded and can't provide expected CPU cycles to all

> +	 * tasks.

> +	 */

>  	group_overloaded

>  };


Looks good.

>  

> @@ -8563,7 +8586,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s

>  

>  	/*

>  	 * Try to use spare capacity of local group without overloading it or

> -	 * emptying busiest

> +	 * emptying busiest.

> +	 * XXX Spreading tasks across numa nodes is not always the best policy

> +	 * and special cares should be taken for SD_NUMA domain level before

> +	 * spreading the tasks. For now, load_balance() fully relies on

> +	 * NUMA_BALANCING and fbq_classify_group/rq to overide the decision.

>  	 */

>  	if (local->group_type == group_has_spare) {

>  		if (busiest->group_type > group_fully_busy) {


Perfect. Any patch in that are can then update the comment and it
should be semi-obvious to the next reviewer that it's expected.

Thanks Vincent.

-- 
Mel Gorman
SUSE Labs
Vincent Guittot Nov. 12, 2019, 10:58 a.m. UTC | #9
Le Friday 08 Nov 2019 à 18:37:30 (+0000), Mel Gorman a écrit :
> On Fri, Nov 08, 2019 at 05:35:01PM +0100, Vincent Guittot wrote:

> > > Fair enough, netperf hits the corner case where it does not work but

> > > that is also true without your series.

> > 

> > I run mmtest/netperf test on my setup. It's a mix of small positive or

> > negative differences (see below)

> > 

> > <SNIP>

> >

> > netperf-tcp

> >                                5.3-rc2                5.3-rc2

> >                                    tip               +rwk+fix

> > Hmean     64         871.30 (   0.00%)      860.90 *  -1.19%*

> > Hmean     128       1689.39 (   0.00%)     1679.31 *  -0.60%*

> > Hmean     256       3199.59 (   0.00%)     3241.98 *   1.32%*

> > Hmean     1024      9390.47 (   0.00%)     9268.47 *  -1.30%*

> > Hmean     2048     13373.95 (   0.00%)    13395.61 *   0.16%*

> > Hmean     3312     16701.30 (   0.00%)    17165.96 *   2.78%*

> > Hmean     4096     15831.03 (   0.00%)    15544.66 *  -1.81%*

> > Hmean     8192     19720.01 (   0.00%)    20188.60 *   2.38%*

> > Hmean     16384    23925.90 (   0.00%)    23914.50 *  -0.05%*

> > Stddev    64           7.38 (   0.00%)        4.23 (  42.67%)

> > Stddev    128         11.62 (   0.00%)       10.13 (  12.85%)

> > Stddev    256         34.33 (   0.00%)        7.94 (  76.88%)

> > Stddev    1024        35.61 (   0.00%)      116.34 (-226.66%)

> > Stddev    2048       285.30 (   0.00%)       80.50 (  71.78%)

> > Stddev    3312       304.74 (   0.00%)      449.08 ( -47.36%)

> > Stddev    4096       668.11 (   0.00%)      569.30 (  14.79%)

> > Stddev    8192       733.23 (   0.00%)      944.38 ( -28.80%)

> > Stddev    16384      553.03 (   0.00%)      299.44 (  45.86%)

> > 

> >                      5.3-rc2     5.3-rc2

> >                          tip    +rwk+fix

> > Duration User         138.05      140.95

> > Duration System      1210.60     1208.45

> > Duration Elapsed     1352.86     1352.90

> > 

> 

> This roughly matches what I've seen. The interesting part to me for

> netperf is the next section of the report that reports the locality of

> numa hints. With netperf on a 2-socket machine, it's generally around

> 50% as the client/server are pulled apart. Because netperf is not

> heavily memory bound, it doesn't have much impact on the overall

> performance but it's good at catching the cross-node migrations.


Ok. I didn't want to make my reply too long. I have put them below for 
the netperf-tcp results:
                                        5.3-rc2        5.3-rc2
                                            tip      +rwk+fix
Ops NUMA alloc hit                  60077762.00    60387907.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                60077571.00    60387798.00
Ops NUMA base-page range updates        5948.00       17223.00
Ops NUMA PTE updates                    5948.00       17223.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                    4639.00       14050.00
Ops NUMA hint local faults %            2073.00        6515.00
Ops NUMA hint local percent               44.69          46.37
Ops NUMA pages migrated                 1528.00        4306.00
Ops AutoNUMA cost                         23.27          70.45


>

> > > 

> > > > I agree that additional patches are probably needed to improve load

> > > > balance at NUMA level and I expect that this rework will make it

> > > > simpler to add.

> > > > I just wanted to get the output of some real use cases before defining

> > > > more numa level specific conditions. Some want to spread on there numa

> > > > nodes but other want to keep everything together. The preferred node

> > > > and fbq_classify_group was the only sensible metrics to me when he

> > > > wrote this patchset but changes can be added if they make sense.

> > > > 

> > > 

> > > That's fair. While it was possible to address the case before your

> > > series, it was a hatchet job. If the changelog simply notes that some

> > > special casing may still be required for SD_NUMA but it's outside the

> > > scope of the series, then I'd be happy. At least there is a good chance

> > > then if there is follow-up work that it won't be interpreted as an

> > > attempt to reintroduce hacky heuristics.

> > >

> > 

> > Would the additional comment make sense for you about work to be done

> > for SD_NUMA ?

> > 

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index 0ad4b21..7e4cb65 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -6960,11 +6960,34 @@ enum fbq_type { regular, remote, all };

> >   * group. see update_sd_pick_busiest().

> >   */

> >  enum group_type {

> > +	/*

> > +	 * The group has spare capacity that can be used to process more work.

> > +	 */

> >  	group_has_spare = 0,

> > +	/*

> > +	 * The group is fully used and the tasks don't compete for more CPU

> > +	 * cycles. Nevetheless, some tasks might wait before running.

> > +	 */

> >  	group_fully_busy,

> > +	/*

> > +	 * One task doesn't fit with CPU's capacity and must be migrated on a

> > +	 * more powerful CPU.

> > +	 */

> >  	group_misfit_task,

> > +	/*

> > +	 * One local CPU with higher capacity is available and task should be

> > +	 * migrated on it instead on current CPU.

> > +	 */

> >  	group_asym_packing,

> > +	/*

> > +	 * The tasks affinity prevents the scheduler to balance the load across

> > +	 * the system.

> > +	 */

> >  	group_imbalanced,

> > +	/*

> > +	 * The CPU is overloaded and can't provide expected CPU cycles to all

> > +	 * tasks.

> > +	 */

> >  	group_overloaded

> >  };

> 

> Looks good.

> 

> >  

> > @@ -8563,7 +8586,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s

> >  

> >  	/*

> >  	 * Try to use spare capacity of local group without overloading it or

> > -	 * emptying busiest

> > +	 * emptying busiest.

> > +	 * XXX Spreading tasks across numa nodes is not always the best policy

> > +	 * and special cares should be taken for SD_NUMA domain level before

> > +	 * spreading the tasks. For now, load_balance() fully relies on

> > +	 * NUMA_BALANCING and fbq_classify_group/rq to overide the decision.

> >  	 */

> >  	if (local->group_type == group_has_spare) {

> >  		if (busiest->group_type > group_fully_busy) {

> 

> Perfect. Any patch in that are can then update the comment and it

> should be semi-obvious to the next reviewer that it's expected.

> 

> Thanks Vincent.

> 

> -- 

> Mel Gorman

> SUSE Labs
Mel Gorman Nov. 12, 2019, 3:06 p.m. UTC | #10
On Tue, Nov 12, 2019 at 11:58:30AM +0100, Vincent Guittot wrote:
> > This roughly matches what I've seen. The interesting part to me for

> > netperf is the next section of the report that reports the locality of

> > numa hints. With netperf on a 2-socket machine, it's generally around

> > 50% as the client/server are pulled apart. Because netperf is not

> > heavily memory bound, it doesn't have much impact on the overall

> > performance but it's good at catching the cross-node migrations.

> 

> Ok. I didn't want to make my reply too long. I have put them below for 

> the netperf-tcp results:

>                                         5.3-rc2        5.3-rc2

>                                             tip      +rwk+fix

> Ops NUMA alloc hit                  60077762.00    60387907.00

> Ops NUMA alloc miss                        0.00           0.00

> Ops NUMA interleave hit                    0.00           0.00

> Ops NUMA alloc local                60077571.00    60387798.00

> Ops NUMA base-page range updates        5948.00       17223.00

> Ops NUMA PTE updates                    5948.00       17223.00

> Ops NUMA PMD updates                       0.00           0.00

> Ops NUMA hint faults                    4639.00       14050.00

> Ops NUMA hint local faults %            2073.00        6515.00

> Ops NUMA hint local percent               44.69          46.37

> Ops NUMA pages migrated                 1528.00        4306.00

> Ops AutoNUMA cost                         23.27          70.45

> 


Thanks -- it was "NUMA hint local percent" I was interested in and the
46.37% local hinting faults is likely indicative of the client/server
being load balanced across SD_NUMA domains without NUMA Balancing being
aggressive enough to fix it. At least I know I am not just seriously
unlucky or testing magical machines!

-- 
Mel Gorman
SUSE Labs
Vincent Guittot Nov. 12, 2019, 3:40 p.m. UTC | #11
On Tue, 12 Nov 2019 at 16:06, Mel Gorman <mgorman@techsingularity.net> wrote:
>

> On Tue, Nov 12, 2019 at 11:58:30AM +0100, Vincent Guittot wrote:

> > > This roughly matches what I've seen. The interesting part to me for

> > > netperf is the next section of the report that reports the locality of

> > > numa hints. With netperf on a 2-socket machine, it's generally around

> > > 50% as the client/server are pulled apart. Because netperf is not

> > > heavily memory bound, it doesn't have much impact on the overall

> > > performance but it's good at catching the cross-node migrations.

> >

> > Ok. I didn't want to make my reply too long. I have put them below for

> > the netperf-tcp results:

> >                                         5.3-rc2        5.3-rc2

> >                                             tip      +rwk+fix

> > Ops NUMA alloc hit                  60077762.00    60387907.00

> > Ops NUMA alloc miss                        0.00           0.00

> > Ops NUMA interleave hit                    0.00           0.00

> > Ops NUMA alloc local                60077571.00    60387798.00

> > Ops NUMA base-page range updates        5948.00       17223.00

> > Ops NUMA PTE updates                    5948.00       17223.00

> > Ops NUMA PMD updates                       0.00           0.00

> > Ops NUMA hint faults                    4639.00       14050.00

> > Ops NUMA hint local faults %            2073.00        6515.00

> > Ops NUMA hint local percent               44.69          46.37

> > Ops NUMA pages migrated                 1528.00        4306.00

> > Ops AutoNUMA cost                         23.27          70.45

> >

>

> Thanks -- it was "NUMA hint local percent" I was interested in and the

> 46.37% local hinting faults is likely indicative of the client/server

> being load balanced across SD_NUMA domains without NUMA Balancing being

> aggressive enough to fix it. At least I know I am not just seriously

> unlucky or testing magical machines!


I agree that the collaboration between load balanced across SD_NUMA
level and NUMA balancing should be improved

It's also interesting to notice that the patchset doesn't seem to do
worse than the baseline: 46.37% vs 44.69%

Vincent

>

> --

> Mel Gorman

> SUSE Labs
Mel Gorman Nov. 12, 2019, 5:45 p.m. UTC | #12
On Tue, Nov 12, 2019 at 04:40:20PM +0100, Vincent Guittot wrote:
> On Tue, 12 Nov 2019 at 16:06, Mel Gorman <mgorman@techsingularity.net> wrote:

> >

> > On Tue, Nov 12, 2019 at 11:58:30AM +0100, Vincent Guittot wrote:

> > > > This roughly matches what I've seen. The interesting part to me for

> > > > netperf is the next section of the report that reports the locality of

> > > > numa hints. With netperf on a 2-socket machine, it's generally around

> > > > 50% as the client/server are pulled apart. Because netperf is not

> > > > heavily memory bound, it doesn't have much impact on the overall

> > > > performance but it's good at catching the cross-node migrations.

> > >

> > > Ok. I didn't want to make my reply too long. I have put them below for

> > > the netperf-tcp results:

> > >                                         5.3-rc2        5.3-rc2

> > >                                             tip      +rwk+fix

> > > Ops NUMA alloc hit                  60077762.00    60387907.00

> > > Ops NUMA alloc miss                        0.00           0.00

> > > Ops NUMA interleave hit                    0.00           0.00

> > > Ops NUMA alloc local                60077571.00    60387798.00

> > > Ops NUMA base-page range updates        5948.00       17223.00

> > > Ops NUMA PTE updates                    5948.00       17223.00

> > > Ops NUMA PMD updates                       0.00           0.00

> > > Ops NUMA hint faults                    4639.00       14050.00

> > > Ops NUMA hint local faults %            2073.00        6515.00

> > > Ops NUMA hint local percent               44.69          46.37

> > > Ops NUMA pages migrated                 1528.00        4306.00

> > > Ops AutoNUMA cost                         23.27          70.45

> > >

> >

> > Thanks -- it was "NUMA hint local percent" I was interested in and the

> > 46.37% local hinting faults is likely indicative of the client/server

> > being load balanced across SD_NUMA domains without NUMA Balancing being

> > aggressive enough to fix it. At least I know I am not just seriously

> > unlucky or testing magical machines!

> 

> I agree that the collaboration between load balanced across SD_NUMA

> level and NUMA balancing should be improved

> 

> It's also interesting to notice that the patchset doesn't seem to do

> worse than the baseline: 46.37% vs 44.69%

> 


Yes, I should have highlighted that. The series appears to improve a
number of areas while being performance neutral with respect to SD_NUMA.
If this turns out to be wrong in some case, it should be semi-obvious even
if the locality looks ok. It'll be a headline regression with increased
NUMA pte scanning and increased frequency of migrations indicating that
NUMA balancing is taken excessive corrective action. I'll know it when
I see it :P

-- 
Mel Gorman
SUSE Labs
Ingo Molnar Nov. 18, 2019, 1:50 p.m. UTC | #13
* Mel Gorman <mgorman@techsingularity.net> wrote:

> s/groupe_type/group_type/

> 

> >  enum group_type {

> > -	group_other = 0,

> > +	group_has_spare = 0,

> > +	group_fully_busy,

> >  	group_misfit_task,

> > +	group_asym_packing,

> >  	group_imbalanced,

> > -	group_overloaded,

> > +	group_overloaded

> > +};

> > +

> 

> While not your fault, it would be nice to comment on the meaning of each

> group type. From a glance, it's not obvious to me why a misfit task should

> be a high priority to move a task than a fully_busy (but not overloaded)

> group given that moving the misfit task might make a group overloaded.


This part of your feedback should now be addressed in the scheduler tree 
via:

  a9723389cc75: sched/fair: Add comments for group_type and balancing at SD_NUMA level

> > +enum migration_type {

> > +	migrate_load = 0,

> > +	migrate_util,

> > +	migrate_task,

> > +	migrate_misfit

> >  };

> >  

> 

> Could do with a comment explaining what migration_type is for because

> the name is unhelpful. I *think* at a glance it's related to what sort

> of imbalance is being addressed which is partially addressed by the

> group_type. That understanding may change as I continue reading the series

> but now I have to figure it out which means it'll be forgotten again in

> 6 months.


Agreed. Vincent, is any patch brewing here, or should I take a stab?

Thanks,

	Ingo
Vincent Guittot Nov. 18, 2019, 1:57 p.m. UTC | #14
On Mon, 18 Nov 2019 at 14:50, Ingo Molnar <mingo@kernel.org> wrote:
>

>

> * Mel Gorman <mgorman@techsingularity.net> wrote:

>

> > s/groupe_type/group_type/

> >

> > >  enum group_type {

> > > -   group_other = 0,

> > > +   group_has_spare = 0,

> > > +   group_fully_busy,

> > >     group_misfit_task,

> > > +   group_asym_packing,

> > >     group_imbalanced,

> > > -   group_overloaded,

> > > +   group_overloaded

> > > +};

> > > +

> >

> > While not your fault, it would be nice to comment on the meaning of each

> > group type. From a glance, it's not obvious to me why a misfit task should

> > be a high priority to move a task than a fully_busy (but not overloaded)

> > group given that moving the misfit task might make a group overloaded.

>

> This part of your feedback should now be addressed in the scheduler tree

> via:

>

>   a9723389cc75: sched/fair: Add comments for group_type and balancing at SD_NUMA level

>

> > > +enum migration_type {

> > > +   migrate_load = 0,

> > > +   migrate_util,

> > > +   migrate_task,

> > > +   migrate_misfit

> > >  };

> > >

> >

> > Could do with a comment explaining what migration_type is for because

> > the name is unhelpful. I *think* at a glance it's related to what sort

> > of imbalance is being addressed which is partially addressed by the

> > group_type. That understanding may change as I continue reading the series

> > but now I have to figure it out which means it'll be forgotten again in

> > 6 months.

>

> Agreed. Vincent, is any patch brewing here, or should I take a stab?

>


No I haven't patch under preparation for this
So you can go ahead

Thanks,
Vincent

> Thanks,

>

>         Ingo
Mel Gorman Nov. 18, 2019, 2:51 p.m. UTC | #15
On Mon, Nov 18, 2019 at 02:50:17PM +0100, Ingo Molnar wrote:
> 

> * Mel Gorman <mgorman@techsingularity.net> wrote:

> 

> > s/groupe_type/group_type/

> > 

> > >  enum group_type {

> > > -	group_other = 0,

> > > +	group_has_spare = 0,

> > > +	group_fully_busy,

> > >  	group_misfit_task,

> > > +	group_asym_packing,

> > >  	group_imbalanced,

> > > -	group_overloaded,

> > > +	group_overloaded

> > > +};

> > > +

> > 

> > While not your fault, it would be nice to comment on the meaning of each

> > group type. From a glance, it's not obvious to me why a misfit task should

> > be a high priority to move a task than a fully_busy (but not overloaded)

> > group given that moving the misfit task might make a group overloaded.

> 

> This part of your feedback should now be addressed in the scheduler tree 

> via:

> 

>   a9723389cc75: sched/fair: Add comments for group_type and balancing at SD_NUMA level

> 


While I can't see that commit ID yet, the discussed version of the patch
was fine by me.

-- 
Mel Gorman
SUSE Labs
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e004841..5ae5281 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7068,11 +7068,26 @@  static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+/*
+ * group_type describes the group of CPUs at the moment of the load balance.
+ * The enum is ordered by pulling priority, with the group with lowest priority
+ * first so the groupe_type can be simply compared when selecting the busiest
+ * group. see update_sd_pick_busiest().
+ */
 enum group_type {
-	group_other = 0,
+	group_has_spare = 0,
+	group_fully_busy,
 	group_misfit_task,
+	group_asym_packing,
 	group_imbalanced,
-	group_overloaded,
+	group_overloaded
+};
+
+enum migration_type {
+	migrate_load = 0,
+	migrate_util,
+	migrate_task,
+	migrate_misfit
 };
 
 #define LBF_ALL_PINNED	0x01
@@ -7105,7 +7120,7 @@  struct lb_env {
 	unsigned int		loop_max;
 
 	enum fbq_type		fbq_type;
-	enum group_type		src_grp_type;
+	enum migration_type	migration_type;
 	struct list_head	tasks;
 };
 
@@ -7328,7 +7343,7 @@  static struct task_struct *detach_one_task(struct lb_env *env)
 static const unsigned int sched_nr_migrate_break = 32;
 
 /*
- * detach_tasks() -- tries to detach up to imbalance runnable load from
+ * detach_tasks() -- tries to detach up to imbalance load/util/tasks from
  * busiest_rq, as part of a balancing operation within domain "sd".
  *
  * Returns number of detached tasks if successful and 0 otherwise.
@@ -7336,8 +7351,8 @@  static const unsigned int sched_nr_migrate_break = 32;
 static int detach_tasks(struct lb_env *env)
 {
 	struct list_head *tasks = &env->src_rq->cfs_tasks;
+	unsigned long util, load;
 	struct task_struct *p;
-	unsigned long load;
 	int detached = 0;
 
 	lockdep_assert_held(&env->src_rq->lock);
@@ -7370,19 +7385,51 @@  static int detach_tasks(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			goto next;
 
-		load = task_h_load(p);
+		switch (env->migration_type) {
+		case migrate_load:
+			load = task_h_load(p);
 
-		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
-			goto next;
+			if (sched_feat(LB_MIN) &&
+			    load < 16 && !env->sd->nr_balance_failed)
+				goto next;
 
-		if ((load / 2) > env->imbalance)
-			goto next;
+			if ((load / 2) > env->imbalance)
+				goto next;
+
+			env->imbalance -= load;
+			break;
+
+		case migrate_util:
+			util = task_util_est(p);
+
+			if (util > env->imbalance)
+				goto next;
+
+			env->imbalance -= util;
+			break;
+
+		case migrate_task:
+			env->imbalance--;
+			break;
+
+		case migrate_misfit:
+			load = task_h_load(p);
+
+			/*
+			 * load of misfit task might decrease a bit since it has
+			 * been recorded. Be conservative in the condition.
+			 */
+			if (load / 2 < env->imbalance)
+				goto next;
+
+			env->imbalance = 0;
+			break;
+		}
 
 		detach_task(p, env);
 		list_add(&p->se.group_node, &env->tasks);
 
 		detached++;
-		env->imbalance -= load;
 
 #ifdef CONFIG_PREEMPTION
 		/*
@@ -7396,7 +7443,7 @@  static int detach_tasks(struct lb_env *env)
 
 		/*
 		 * We only want to steal up to the prescribed amount of
-		 * runnable load.
+		 * load/util/tasks.
 		 */
 		if (env->imbalance <= 0)
 			break;
@@ -7661,7 +7708,6 @@  struct sg_lb_stats {
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_no_capacity;
 	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
 	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
@@ -7677,10 +7723,10 @@  struct sg_lb_stats {
 struct sd_lb_stats {
 	struct sched_group *busiest;	/* Busiest group in this sd */
 	struct sched_group *local;	/* Local group in this sd */
-	unsigned long total_running;
 	unsigned long total_load;	/* Total load of all groups in sd */
 	unsigned long total_capacity;	/* Total capacity of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
+	unsigned int prefer_sibling; /* tasks should go to sibling first */
 
 	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
 	struct sg_lb_stats local_stat;	/* Statistics of the local group */
@@ -7691,19 +7737,18 @@  static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 	/*
 	 * Skimp on the clearing to avoid duplicate work. We can avoid clearing
 	 * local_stat because update_sg_lb_stats() does a full clear/assignment.
-	 * We must however clear busiest_stat::avg_load because
-	 * update_sd_pick_busiest() reads this before assignment.
+	 * We must however set busiest_stat::group_type and
+	 * busiest_stat::idle_cpus to the worst busiest group because
+	 * update_sd_pick_busiest() reads these before assignment.
 	 */
 	*sds = (struct sd_lb_stats){
 		.busiest = NULL,
 		.local = NULL,
-		.total_running = 0UL,
 		.total_load = 0UL,
 		.total_capacity = 0UL,
 		.busiest_stat = {
-			.avg_load = 0UL,
-			.sum_h_nr_running = 0,
-			.group_type = group_other,
+			.idle_cpus = UINT_MAX,
+			.group_type = group_has_spare,
 		},
 	};
 }
@@ -7945,19 +7990,26 @@  group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 }
 
 static inline enum
-group_type group_classify(struct sched_group *group,
+group_type group_classify(struct lb_env *env,
+			  struct sched_group *group,
 			  struct sg_lb_stats *sgs)
 {
-	if (sgs->group_no_capacity)
+	if (group_is_overloaded(env, sgs))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
+	if (sgs->group_asym_packing)
+		return group_asym_packing;
+
 	if (sgs->group_misfit_task_load)
 		return group_misfit_task;
 
-	return group_other;
+	if (!group_has_capacity(env, sgs))
+		return group_fully_busy;
+
+	return group_has_spare;
 }
 
 static bool update_nohz_stats(struct rq *rq, bool force)
@@ -7994,10 +8046,12 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 				      struct sg_lb_stats *sgs,
 				      int *sg_status)
 {
-	int i, nr_running;
+	int i, nr_running, local_group;
 
 	memset(sgs, 0, sizeof(*sgs));
 
+	local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
+
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
@@ -8022,9 +8076,16 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
-		if (!nr_running && idle_cpu(i))
+		if (!nr_running && idle_cpu(i)) {
 			sgs->idle_cpus++;
+			/* Idle cpu can't have misfit task */
+			continue;
+		}
+
+		if (local_group)
+			continue;
 
+		/* Check for a misfit task on the cpu */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
 		    sgs->group_misfit_task_load < rq->misfit_task_load) {
 			sgs->group_misfit_task_load = rq->misfit_task_load;
@@ -8032,14 +8093,24 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 	}
 
-	/* Adjust by relative CPU capacity of the group */
+	/* Check if dst cpu is idle and preferred to this group */
+	if (env->sd->flags & SD_ASYM_PACKING &&
+	    env->idle != CPU_NOT_IDLE &&
+	    sgs->sum_h_nr_running &&
+	    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
+		sgs->group_asym_packing = 1;
+	}
+
 	sgs->group_capacity = group->sgc->capacity;
-	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
 
 	sgs->group_weight = group->group_weight;
 
-	sgs->group_no_capacity = group_is_overloaded(env, sgs);
-	sgs->group_type = group_classify(group, sgs);
+	sgs->group_type = group_classify(env, group, sgs);
+
+	/* Computing avg_load makes sense only when group is overloaded */
+	if (sgs->group_type == group_overloaded)
+		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
+				sgs->group_capacity;
 }
 
 /**
@@ -8062,6 +8133,10 @@  static bool update_sd_pick_busiest(struct lb_env *env,
 {
 	struct sg_lb_stats *busiest = &sds->busiest_stat;
 
+	/* Make sure that there is at least one task to pull */
+	if (!sgs->sum_h_nr_running)
+		return false;
+
 	/*
 	 * Don't try to pull misfit tasks we can't help.
 	 * We can use max_capacity here as reduction in capacity on some
@@ -8070,7 +8145,7 @@  static bool update_sd_pick_busiest(struct lb_env *env,
 	 */
 	if (sgs->group_type == group_misfit_task &&
 	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
-	     !group_has_capacity(env, &sds->local_stat)))
+	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
 	if (sgs->group_type > busiest->group_type)
@@ -8079,62 +8154,80 @@  static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->group_type < busiest->group_type)
 		return false;
 
-	if (sgs->avg_load <= busiest->avg_load)
-		return false;
-
-	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
-		goto asym_packing;
-
 	/*
-	 * Candidate sg has no more than one task per CPU and
-	 * has higher per-CPU capacity. Migrating tasks to less
-	 * capable CPUs may harm throughput. Maximize throughput,
-	 * power/energy consequences are not considered.
+	 * The candidate and the current busiest group are the same type of
+	 * group. Let check which one is the busiest according to the type.
 	 */
-	if (sgs->sum_h_nr_running <= sgs->group_weight &&
-	    group_smaller_min_cpu_capacity(sds->local, sg))
-		return false;
 
-	/*
-	 * If we have more than one misfit sg go with the biggest misfit.
-	 */
-	if (sgs->group_type == group_misfit_task &&
-	    sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+	switch (sgs->group_type) {
+	case group_overloaded:
+		/* Select the overloaded group with highest avg_load. */
+		if (sgs->avg_load <= busiest->avg_load)
+			return false;
+		break;
+
+	case group_imbalanced:
+		/*
+		 * Select the 1st imbalanced group as we don't have any way to
+		 * choose one more than another.
+		 */
 		return false;
 
-asym_packing:
-	/* This is the busiest node in its class. */
-	if (!(env->sd->flags & SD_ASYM_PACKING))
-		return true;
+	case group_asym_packing:
+		/* Prefer to move from lowest priority CPU's work */
+		if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
+			return false;
+		break;
 
-	/* No ASYM_PACKING if target CPU is already busy */
-	if (env->idle == CPU_NOT_IDLE)
-		return true;
-	/*
-	 * ASYM_PACKING needs to move all the work to the highest
-	 * prority CPUs in the group, therefore mark all groups
-	 * of lower priority than ourself as busy.
-	 *
-	 * This is primarily intended to used at the sibling level.  Some
-	 * cores like POWER7 prefer to use lower numbered SMT threads.  In the
-	 * case of POWER7, it can move to lower SMT modes only when higher
-	 * threads are idle.  When in lower SMT modes, the threads will
-	 * perform better since they share less core resources.  Hence when we
-	 * have idle threads, we want them to be the higher ones.
-	 */
-	if (sgs->sum_h_nr_running &&
-	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
-		sgs->group_asym_packing = 1;
-		if (!sds->busiest)
-			return true;
+	case group_misfit_task:
+		/*
+		 * If we have more than one misfit sg go with the biggest
+		 * misfit.
+		 */
+		if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+			return false;
+		break;
 
-		/* Prefer to move from lowest priority CPU's work */
-		if (sched_asym_prefer(sds->busiest->asym_prefer_cpu,
-				      sg->asym_prefer_cpu))
-			return true;
+	case group_fully_busy:
+		/*
+		 * Select the fully busy group with highest avg_load. In
+		 * theory, there is no need to pull task from such kind of
+		 * group because tasks have all compute capacity that they need
+		 * but we can still improve the overall throughput by reducing
+		 * contention when accessing shared HW resources.
+		 *
+		 * XXX for now avg_load is not computed and always 0 so we
+		 * select the 1st one.
+		 */
+		if (sgs->avg_load <= busiest->avg_load)
+			return false;
+		break;
+
+	case group_has_spare:
+		/*
+		 * Select not overloaded group with lowest number of
+		 * idle cpus. We could also compare the spare capacity
+		 * which is more stable but it can end up that the
+		 * group has less spare capacity but finally more idle
+		 * cpus which means less opportunity to pull tasks.
+		 */
+		if (sgs->idle_cpus >= busiest->idle_cpus)
+			return false;
+		break;
 	}
 
-	return false;
+	/*
+	 * Candidate sg has no more than one task per CPU and has higher
+	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
+	 * throughput. Maximize throughput, power/energy consequences are not
+	 * considered.
+	 */
+	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+	    (sgs->group_type <= group_fully_busy) &&
+	    (group_smaller_min_cpu_capacity(sds->local, sg)))
+		return false;
+
+	return true;
 }
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -8172,13 +8265,13 @@  static inline enum fbq_type fbq_classify_rq(struct rq *rq)
  * @env: The load balancing environment.
  * @sds: variable to hold the statistics for this sched_domain.
  */
+
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
-	bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
 	int sg_status = 0;
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -8205,22 +8298,6 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (local_group)
 			goto next_group;
 
-		/*
-		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity so that we'll try
-		 * and move all the excess tasks away. We lower the capacity
-		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks. The extra check prevents the case where
-		 * you always pull from the heaviest group when it is already
-		 * under-utilized (possible with a large weight task outweighs
-		 * the tasks on the system).
-		 */
-		if (prefer_sibling && sds->local &&
-		    group_has_capacity(env, local) &&
-		    (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {
-			sgs->group_no_capacity = 1;
-			sgs->group_type = group_classify(sg, sgs);
-		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -8229,13 +8306,15 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 next_group:
 		/* Now, start updating sd_lb_stats */
-		sds->total_running += sgs->sum_h_nr_running;
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
+	/* Tag domain that child domain prefers tasks go to siblings first */
+	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+
 #ifdef CONFIG_NO_HZ_COMMON
 	if ((env->flags & LBF_NOHZ_AGAIN) &&
 	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
@@ -8273,69 +8352,149 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
  */
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	unsigned long max_pull, load_above_capacity = ~0UL;
 	struct sg_lb_stats *local, *busiest;
 
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
-	if (busiest->group_asym_packing) {
-		env->imbalance = busiest->group_load;
+	if (busiest->group_type == group_misfit_task) {
+		/* Set imbalance to allow misfit task to be balanced. */
+		env->migration_type = migrate_misfit;
+		env->imbalance = busiest->group_misfit_task_load;
+		return;
+	}
+
+	if (busiest->group_type == group_asym_packing) {
+		/*
+		 * In case of asym capacity, we will try to migrate all load to
+		 * the preferred CPU.
+		 */
+		env->migration_type = migrate_task;
+		env->imbalance = busiest->sum_h_nr_running;
+		return;
+	}
+
+	if (busiest->group_type == group_imbalanced) {
+		/*
+		 * In the group_imb case we cannot rely on group-wide averages
+		 * to ensure CPU-load equilibrium, try to move any task to fix
+		 * the imbalance. The next load balance will take care of
+		 * balancing back the system.
+		 */
+		env->migration_type = migrate_task;
+		env->imbalance = 1;
 		return;
 	}
 
 	/*
-	 * Avg load of busiest sg can be less and avg load of local sg can
-	 * be greater than avg load across all sgs of sd because avg load
-	 * factors in sg capacity and sgs with smaller group_type are
-	 * skipped when updating the busiest sg:
+	 * Try to use spare capacity of local group without overloading it or
+	 * emptying busiest
 	 */
-	if (busiest->group_type != group_misfit_task &&
-	    (busiest->avg_load <= sds->avg_load ||
-	     local->avg_load >= sds->avg_load)) {
-		env->imbalance = 0;
+	if (local->group_type == group_has_spare) {
+		if (busiest->group_type > group_fully_busy) {
+			/*
+			 * If busiest is overloaded, try to fill spare
+			 * capacity. This might end up creating spare capacity
+			 * in busiest or busiest still being overloaded but
+			 * there is no simple way to directly compute the
+			 * amount of load to migrate in order to balance the
+			 * system.
+			 */
+			env->migration_type = migrate_util;
+			env->imbalance = max(local->group_capacity, local->group_util) -
+					 local->group_util;
+
+			/*
+			 * In some case, the group's utilization is max or even
+			 * higher than capacity because of migrations but the
+			 * local CPU is (newly) idle. There is at least one
+			 * waiting task in this overloaded busiest group. Let
+			 * try to pull it.
+			 */
+			if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
+				env->migration_type = migrate_task;
+				env->imbalance = 1;
+			}
+
+			return;
+		}
+
+		if (busiest->group_weight == 1 || sds->prefer_sibling) {
+			unsigned int nr_diff = busiest->sum_h_nr_running;
+			/*
+			 * When prefer sibling, evenly spread running tasks on
+			 * groups.
+			 */
+			env->migration_type = migrate_task;
+			lsub_positive(&nr_diff, local->sum_h_nr_running);
+			env->imbalance = nr_diff >> 1;
+			return;
+		}
+
+		/*
+		 * If there is no overload, we just want to even the number of
+		 * idle cpus.
+		 */
+		env->migration_type = migrate_task;
+		env->imbalance = max_t(long, 0, (local->idle_cpus -
+						 busiest->idle_cpus) >> 1);
 		return;
 	}
 
 	/*
-	 * If there aren't any idle CPUs, avoid creating some.
+	 * Local is fully busy but has to take more load to relieve the
+	 * busiest group
 	 */
-	if (busiest->group_type == group_overloaded &&
-	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
-		if (load_above_capacity > busiest->group_capacity) {
-			load_above_capacity -= busiest->group_capacity;
-			load_above_capacity *= scale_load_down(NICE_0_LOAD);
-			load_above_capacity /= busiest->group_capacity;
-		} else
-			load_above_capacity = ~0UL;
+	if (local->group_type < group_overloaded) {
+		/*
+		 * Local will become overloaded so the avg_load metrics are
+		 * finally needed.
+		 */
+
+		local->avg_load = (local->group_load * SCHED_CAPACITY_SCALE) /
+				  local->group_capacity;
+
+		sds->avg_load = (sds->total_load * SCHED_CAPACITY_SCALE) /
+				sds->total_capacity;
 	}
 
 	/*
-	 * We're trying to get all the CPUs to the average_load, so we don't
-	 * want to push ourselves above the average load, nor do we wish to
-	 * reduce the max loaded CPU below the average load. At the same time,
-	 * we also don't want to reduce the group load below the group
-	 * capacity. Thus we look for the minimum possible imbalance.
+	 * Both group are or will become overloaded and we're trying to get all
+	 * the CPUs to the average_load, so we don't want to push ourselves
+	 * above the average load, nor do we wish to reduce the max loaded CPU
+	 * below the average load. At the same time, we also don't want to
+	 * reduce the group load below the group capacity. Thus we look for
+	 * the minimum possible imbalance.
 	 */
-	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
-
-	/* How much load to actually move to equalise the imbalance */
+	env->migration_type = migrate_load;
 	env->imbalance = min(
-		max_pull * busiest->group_capacity,
+		(busiest->avg_load - sds->avg_load) * busiest->group_capacity,
 		(sds->avg_load - local->avg_load) * local->group_capacity
 	) / SCHED_CAPACITY_SCALE;
-
-	/* Boost imbalance to allow misfit task to be balanced. */
-	if (busiest->group_type == group_misfit_task) {
-		env->imbalance = max_t(long, env->imbalance,
-				       busiest->group_misfit_task_load);
-	}
-
 }
 
 /******* find_busiest_group() helpers end here *********************/
 
+/*
+ * Decision matrix according to the local and busiest group type
+ *
+ * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
+ * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
+ * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
+ * misfit_task      force     N/A        N/A    N/A  force      force
+ * asym_packing     force     force      N/A    N/A  force      force
+ * imbalanced       force     force      N/A    N/A  force      force
+ * overloaded       force     force      N/A    N/A  force      avg_load
+ *
+ * N/A :      Not Applicable because already filtered while updating
+ *            statistics.
+ * balanced : The system is balanced for these 2 groups.
+ * force :    Calculate the imbalance as load migration is probably needed.
+ * avg_load : Only if imbalance is significant enough.
+ * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite
+ *            different in groups.
+ */
+
 /**
  * find_busiest_group - Returns the busiest group within the sched_domain
  * if there is an imbalance.
@@ -8370,17 +8529,17 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
-	/* ASYM feature bypasses nice load balance check */
-	if (busiest->group_asym_packing)
-		goto force_balance;
-
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || busiest->sum_h_nr_running == 0)
+	if (!sds.busiest)
 		goto out_balanced;
 
-	/* XXX broken for overlapping NUMA groups */
-	sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
-						/ sds.total_capacity;
+	/* Misfit tasks should be dealt with regardless of the avg load */
+	if (busiest->group_type == group_misfit_task)
+		goto force_balance;
+
+	/* ASYM feature bypasses nice load balance check */
+	if (busiest->group_type == group_asym_packing)
+		goto force_balance;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
@@ -8391,55 +8550,64 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/*
-	 * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
-	 * capacities from resulting in underutilization due to avg_load.
-	 */
-	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
-	    busiest->group_no_capacity)
-		goto force_balance;
-
-	/* Misfit tasks should be dealt with regardless of the avg load */
-	if (busiest->group_type == group_misfit_task)
-		goto force_balance;
-
-	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.
 	 */
-	if (local->avg_load >= busiest->avg_load)
+	if (local->group_type > busiest->group_type)
 		goto out_balanced;
 
 	/*
-	 * Don't pull any tasks if this group is already above the domain
-	 * average load.
+	 * When groups are overloaded, use the avg_load to ensure fairness
+	 * between tasks.
 	 */
-	if (local->avg_load >= sds.avg_load)
-		goto out_balanced;
+	if (local->group_type == group_overloaded) {
+		/*
+		 * If the local group is more loaded than the selected
+		 * busiest group don't try and pull any tasks.
+		 */
+		if (local->avg_load >= busiest->avg_load)
+			goto out_balanced;
+
+		/* XXX broken for overlapping NUMA groups */
+		sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
+				sds.total_capacity;
 
-	if (env->idle == CPU_IDLE) {
 		/*
-		 * This CPU is idle. If the busiest group is not overloaded
-		 * and there is no imbalance between this and busiest group
-		 * wrt idle CPUs, it is balanced. The imbalance becomes
-		 * significant if the diff is greater than 1 otherwise we
-		 * might end up to just move the imbalance on another group
+		 * Don't pull any tasks if this group is already above the
+		 * domain average load.
 		 */
-		if ((busiest->group_type != group_overloaded) &&
-				(local->idle_cpus <= (busiest->idle_cpus + 1)))
+		if (local->avg_load >= sds.avg_load)
 			goto out_balanced;
-	} else {
+
 		/*
-		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
-		 * imbalance_pct to be conservative.
+		 * If the busiest group is more loaded, use imbalance_pct to be
+		 * conservative.
 		 */
 		if (100 * busiest->avg_load <=
 				env->sd->imbalance_pct * local->avg_load)
 			goto out_balanced;
 	}
 
+	/* Try to move all excess tasks to child's sibling domain */
+	if (sds.prefer_sibling && local->group_type == group_has_spare &&
+	    busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
+		goto force_balance;
+
+	if (busiest->group_type != group_overloaded &&
+	     (env->idle == CPU_NOT_IDLE ||
+	      local->idle_cpus <= (busiest->idle_cpus + 1)))
+		/*
+		 * If the busiest group is not overloaded
+		 * and there is no imbalance between this and busiest group
+		 * wrt idle CPUs, it is balanced. The imbalance
+		 * becomes significant if the diff is greater than 1 otherwise
+		 * we might end up to just move the imbalance on another
+		 * group.
+		 */
+		goto out_balanced;
+
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
-	env->src_grp_type = busiest->group_type;
 	calculate_imbalance(env, &sds);
 	return env->imbalance ? sds.busiest : NULL;
 
@@ -8455,11 +8623,13 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 				     struct sched_group *group)
 {
 	struct rq *busiest = NULL, *rq;
-	unsigned long busiest_load = 0, busiest_capacity = 1;
+	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
+	unsigned int busiest_nr = 0;
 	int i;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
-		unsigned long capacity, load;
+		unsigned long capacity, load, util;
+		unsigned int nr_running;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -8487,20 +8657,8 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 		if (rt > env->fbq_type)
 			continue;
 
-		/*
-		 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
-		 * seek the "biggest" misfit task.
-		 */
-		if (env->src_grp_type == group_misfit_task) {
-			if (rq->misfit_task_load > busiest_load) {
-				busiest_load = rq->misfit_task_load;
-				busiest = rq;
-			}
-
-			continue;
-		}
-
 		capacity = capacity_of(i);
+		nr_running = rq->cfs.h_nr_running;
 
 		/*
 		 * For ASYM_CPUCAPACITY domains, don't pick a CPU that could
@@ -8510,35 +8668,70 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 		 */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
 		    capacity_of(env->dst_cpu) < capacity &&
-		    rq->nr_running == 1)
+		    nr_running == 1)
 			continue;
 
-		load = cpu_runnable_load(rq);
+		switch (env->migration_type) {
+		case migrate_load:
+			/*
+			 * When comparing with load imbalance, use
+			 * cpu_runnable_load() which is not scaled with the CPU
+			 * capacity.
+			 */
+			load = cpu_runnable_load(rq);
 
-		/*
-		 * When comparing with imbalance, use cpu_runnable_load()
-		 * which is not scaled with the CPU capacity.
-		 */
+			if (nr_running == 1 && load > env->imbalance &&
+			    !check_cpu_capacity(rq, env->sd))
+				break;
 
-		if (rq->nr_running == 1 && load > env->imbalance &&
-		    !check_cpu_capacity(rq, env->sd))
-			continue;
+			/*
+			 * For the load comparisons with the other CPU's,
+			 * consider the cpu_runnable_load() scaled with the CPU
+			 * capacity, so that the load can be moved away from
+			 * the CPU that is potentially running at a lower
+			 * capacity.
+			 *
+			 * Thus we're looking for max(load_i / capacity_i),
+			 * crosswise multiplication to rid ourselves of the
+			 * division works out to:
+			 * load_i * capacity_j > load_j * capacity_i;
+			 * where j is our previous maximum.
+			 */
+			if (load * busiest_capacity > busiest_load * capacity) {
+				busiest_load = load;
+				busiest_capacity = capacity;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_util:
+			util = cpu_util(cpu_of(rq));
+
+			if (busiest_util < util) {
+				busiest_util = util;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_task:
+			if (busiest_nr < nr_running) {
+				busiest_nr = nr_running;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_misfit:
+			/*
+			 * For ASYM_CPUCAPACITY domains with misfit tasks we
+			 * simply seek the "biggest" misfit task.
+			 */
+			if (rq->misfit_task_load > busiest_load) {
+				busiest_load = rq->misfit_task_load;
+				busiest = rq;
+			}
+
+			break;
 
-		/*
-		 * For the load comparisons with the other CPU's, consider
-		 * the cpu_runnable_load() scaled with the CPU capacity, so
-		 * that the load can be moved away from the CPU that is
-		 * potentially running at a lower capacity.
-		 *
-		 * Thus we're looking for max(load_i / capacity_i), crosswise
-		 * multiplication to rid ourselves of the division works out
-		 * to: load_i * capacity_j > load_j * capacity_i;  where j is
-		 * our previous maximum.
-		 */
-		if (load * busiest_capacity > busiest_load * capacity) {
-			busiest_load = load;
-			busiest_capacity = capacity;
-			busiest = rq;
 		}
 	}
 
@@ -8584,7 +8777,7 @@  voluntary_active_balance(struct lb_env *env)
 			return 1;
 	}
 
-	if (env->src_grp_type == group_misfit_task)
+	if (env->migration_type == migrate_misfit)
 		return 1;
 
 	return 0;