diff mbox

[3/7] sched/fair: Correct unit of load_above_capacity

Message ID 20160503145641.GA759@e105550-lin.cambridge.arm.com
State New
Headers show

Commit Message

Morten Rasmussen May 3, 2016, 2:56 p.m. UTC
On Tue, May 03, 2016 at 12:52:33PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 29, 2016 at 08:32:40PM +0100, Dietmar Eggemann wrote:

> > From: Morten Rasmussen <morten.rasmussen@arm.com>

> > 

> > In calculate_imbalance() load_above_capacity currently has the unit

> > [load] while it is used as being [load/capacity]. Not only is it wrong it

> > also makes it unlikely that load_above_capacity is ever used as the

> > subsequent code picks the smaller of load_above_capacity and the avg_load

> > difference between the local and the busiest sched_groups.

> > 

> > This patch ensures that load_above_capacity has the right unit

> > [load/capacity].

> > 

> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

> > ---

> >  kernel/sched/fair.c | 6 ++++--

> >  1 file changed, 4 insertions(+), 2 deletions(-)

> > 

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index bc19fee94f39..c066574cff04 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -7016,9 +7016,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s

> >  	    local->group_type   == group_overloaded) {

> >  		load_above_capacity = busiest->sum_nr_running *

> >  					SCHED_LOAD_SCALE;

> 

> Given smpnice and cgroups; this starting point seems somewhat random.

> Then again, without cgroup and assuming all tasks are nice-0 it is still

> true. But I think the cgroup muck is rather universal these days.


It makes sense for non-grouped always-running nice-0 tasks, and not so
much for all other scenarios. We ignore smpnice, PELT, and cgroups :-(

> 

> The above SCHED_LOAD_SCALE is the 1 -> load metric, right?


Yes. For clarity I would probably have used NICE_O_LOAD instead. We are
establishing how much capacity sum_nr_running task would consume if they
are all non-grouped always-running nice-0 tasks.

> 

> > -		if (load_above_capacity > busiest->group_capacity)

> > +		if (load_above_capacity > busiest->group_capacity) {

> >  			load_above_capacity -= busiest->group_capacity;


If they would consume more than we have, we return how much more
otherwise LONG_MAX.

> > -		else

> > +			load_above_capacity *= SCHED_LOAD_SCALE;

> > +			load_above_capacity /= busiest->group_capacity;

> 

> And this one does load->capacity


It does load->[load/capacity]

The subsequent code does [load/capacity]->load:

	env->imbalance = min(
-->		max_pull * busiest->group_capacity,
		(sds->avg_load - local->avg_load) * local->group_capacity
	) / SCHED_CAPACITY_SCALE;

> 

> > +		} else

> >  			load_above_capacity = ~0UL;

> >  	}

> 

> 

> Is there anything better we can do? I know there's a fair bit of cruft

> in; but these assumptions that SCHED_LOAD_SCALE is one task, just bug

> the hell out of me.

>

> Would it make sense to make load_balance()->detach_tasks() ensure

> busiest->sum_nr_running >= busiest->group_weight or so?


That would make a lot of sense, because that is essentially what the
code does for non-SMT systems. The story is a bit different for SMT
though since:

	busiest->group_capacity < busiest->group_weight * SCHED_LOAD_SCALE.

This could lead us to try pulling more tasks to vacate SMT hw-threads.
However, I'm not convinced if it has any effect as:

	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);

is the _minimum_ of the actual load difference and the 'excess' load, so
load_above_capacity can't cause us to pull more load anyway. So I don't
see why it shouldn't be okay.

Something like the below should implement your proposal I think (only
boot tested on arm64):

---8<---

-- 
1.9.1
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fe30e6..37fcbec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5777,6 +5777,7 @@  struct lb_env {
 	int			new_dst_cpu;
 	enum cpu_idle_type	idle;
 	long			imbalance;
+	long			imbalance_tasks;
 	/* The set of CPUs under consideration for load-balancing */
 	struct cpumask		*cpus;
 
@@ -6018,7 +6019,7 @@  static int detach_tasks(struct lb_env *env)
 
 	lockdep_assert_held(&env->src_rq->lock);
 
-	if (env->imbalance <= 0)
+	if (env->imbalance <= 0 || env->imbalance_tasks == 0)
 		return 0;
 
 	while (!list_empty(tasks)) {
@@ -6072,9 +6073,9 @@  static int detach_tasks(struct lb_env *env)
 
 		/*
 		 * We only want to steal up to the prescribed amount of
-		 * weighted load.
+		 * weighted load or max number of tasks.
 		 */
-		if (env->imbalance <= 0)
+		if (env->imbalance <= 0 || env->imbalance_tasks <= detached)
 			break;
 
 		continue;
@@ -6873,12 +6874,14 @@  void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
  */
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	unsigned long max_pull, load_above_capacity = ~0UL;
+	unsigned long max_pull;
 	struct sg_lb_stats *local, *busiest;
 
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
+	env->imbalance_tasks = ~0UL;
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -6904,23 +6907,17 @@  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_nr_running *
-					SCHED_LOAD_SCALE;
-		if (load_above_capacity > busiest->group_capacity)
-			load_above_capacity -= busiest->group_capacity;
-		else
-			load_above_capacity = ~0UL;
+		if (busiest->sum_nr_running > busiest->group_weight)
+			env->imbalance_tasks =
+				busiest->sum_nr_running - busiest->group_weight;
 	}
 
 	/*
 	 * We're trying to get all the cpus to the average_load, so we don't
 	 * want to push ourselves above the average load, nor do we wish to
-	 * reduce the max loaded cpu below the average load. At the same time,
-	 * we also don't want to reduce the group load below the group capacity
-	 * (so that we can implement power-savings policies etc). Thus we look
-	 * for the minimum possible imbalance.
+	 * reduce the max loaded cpu below the average load.
 	 */
-	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
+	max_pull = busiest->avg_load - sds->avg_load;
 
 	/* How much load to actually move to equalise the imbalance */
 	env->imbalance = min(