diff mbox

[4/7] sched/fair: Clean up the logic in fix_small_imbalance()

Message ID 1461958364-675-5-git-send-email-dietmar.eggemann@arm.com
State New
Headers show

Commit Message

Dietmar Eggemann April 29, 2016, 7:32 p.m. UTC
Avoid the need to add scaled_busy_load_per_task on both sides of the if
condition to determine whether imbalance has to be set to
busiest->load_per_task or not.

The imbn variable was introduced with commit 2dd73a4f09be ("[PATCH]
sched: implement smpnice") and the original if condition was

  if (max_load - this_load >= busiest_load_per_task * imbn)

which over time changed into the current version where
scaled_busy_load_per_task is to be found on both sides of
the if condition.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

---

The original smpnice implementation sets imbalance to the avg task
load of the busiest sched group (sg) if the difference between the avg 
load of the busiest sg and the local sg is greater or equal 2 times
(imbn) the avg task load of the busiest sg if there is no task running
in the local sg or the avg task load of the busiest sg is smaller or
equal the avg task load of the local sg. Otherwise the imbn factor is
lowered to 1.

imbn set to 2 makes sense when all the tasks have the same priority so
in case there are n tasks on local sd there have to be n+2 tasks on the 
busiest sg to give load balance the chance to pull over one task.

imbn set to 1 makes sense when the avg task load of the busiest sg is
greater than the one of the local sg since there could be at least one 
task with a smaller load on the busiest sg which can be pulled over to
the local sg.

The current version lowered imbn effectively by one, so we use 1 instead
of 2 in case the avg task load of sg_busiest isn't greater than the one 
of sg_local and 0 instead of 1 in the other case. This behaviour is not in
sync with the explanation above on why we set imbn to certain values.

 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

-- 
1.9.1

Comments

Dietmar Eggemann May 3, 2016, 4:53 p.m. UTC | #1
On 03/05/16 11:12, Peter Zijlstra wrote:
> On Fri, Apr 29, 2016 at 08:32:41PM +0100, Dietmar Eggemann wrote:

>> Avoid the need to add scaled_busy_load_per_task on both sides of the if

>> condition to determine whether imbalance has to be set to

>> busiest->load_per_task or not.

>>

>> The imbn variable was introduced with commit 2dd73a4f09be ("[PATCH]

>> sched: implement smpnice") and the original if condition was

>>

>>   if (max_load - this_load >= busiest_load_per_task * imbn)

>>

>> which over time changed into the current version where

>> scaled_busy_load_per_task is to be found on both sides of

>> the if condition.

> 

> This appears to have started with:

> 

>   dd41f596cda0 ("sched: cfs core code")

> 

> which for unexplained reasons does:

> 

> -               if (max_load - this_load >= busiest_load_per_task * imbn) {

> +               if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=

> +                                       busiest_load_per_task * imbn) {

> 

> 

> And later patches (by me) change that FUZZ into a variable metric,

> because a fixed fuzz like that didn't at all work for the small loads

> that result from cgroup tasks.

> 

> 

> 

> Now fix_small_imbalance() always hurt my head; it originated in the

> original sched_domain balancer from Nick which wasn't smpnice aware; and

> lives on until today.


I see, all this code is already in the history.git kernel.

> 

> Its purpose is to determine if moving one task over is beneficial.

> However over time -- and smpnice started this -- the idea of _one_ task

> became quite muddled.

> 

> With the fine grained load accounting of today; does it even make sense

> to ask this question? IOW. what does fix_small_imbalance() really gain

> us -- other than a head-ache?


So task priority breaks the assumption that 1 task is equivalent to
SCHED_LOAD_SCALE and so does fine grained load accounting.

fix_small_imbalance() is called twice from calculate_imbalance, if we
would get rid of it, I don't know if we should bail out of lb in case
the avg load values don't align nicely (busiest > sd avg > local) or
just continue w/ lb.

In the second case, where the imbalance value is raised (to
busiest->load_per_task), we probably can just continue w/ lb, hoping
that there is a task on the src rq which fits the smaller imbalance value.
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c066574cff04..dc4828bbe50d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6915,7 +6915,7 @@  static inline
 void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	unsigned long tmp, capa_now = 0, capa_move = 0;
-	unsigned int imbn = 2;
+	unsigned int imbn = 1;
 	unsigned long scaled_busy_load_per_task;
 	struct sg_lb_stats *local, *busiest;
 
@@ -6925,13 +6925,13 @@  void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	if (!local->sum_nr_running)
 		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
 	else if (busiest->load_per_task > local->load_per_task)
-		imbn = 1;
+		imbn = 0;
 
 	scaled_busy_load_per_task =
 		(busiest->load_per_task * SCHED_CAPACITY_SCALE) /
 		busiest->group_capacity;
 
-	if (busiest->avg_load + scaled_busy_load_per_task >=
+	if (busiest->avg_load >=
 	    local->avg_load + (scaled_busy_load_per_task * imbn)) {
 		env->imbalance = busiest->load_per_task;
 		return;