diff mbox

sched/fair: Do not decay new task load on first enqueue

Message ID 20160928131309.GA5483@vingu-laptop
State New
Headers show

Commit Message

Vincent Guittot Sept. 28, 2016, 1:13 p.m. UTC
Le Wednesday 28 Sep 2016 à 05:27:54 (-0700), Vincent Guittot a écrit :
> On 28 September 2016 at 04:31, Dietmar Eggemann

> <dietmar.eggemann@arm.com> wrote:

> > On 28/09/16 12:19, Peter Zijlstra wrote:

> >> On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote:

> >>> On 28/09/16 11:14, Peter Zijlstra wrote:

> >>>> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote:

> >

> > [...]

> >

> >>> Not sure what you mean by 'after fixing' but the se is initialized with

> >>> a possibly stale 'now' value in post_init_entity_util_avg()->

> >>> attach_entity_load_avg() before the clock is updated in

> >>> activate_task()->enqueue_task().

> >>

> >> I meant that after I fix the above issue of calling post_init with a

> >> stale clock. So the + update_rq_clock(rq) in the patch.

> >

> > OK.

> >

> > [...]

> >

> >>>> While staring at this, I don't think we can still hit

> >>>> vruntime_normalized() with a new task, so I _think_ we can remove that

> >>>> !se->sum_exec_runtime clause there (and rejoice), no?

> >>>

> >>> I'm afraid that with accurate timing we will get the same situation that

> >>> we add and subtract the same amount of load (probably 1024 now and not

> >>> 1002 (or less)) to/from cfs_rq->runnable_load_avg for the initial (fork)

> >>> hackbench run.

> >>> After all, it's 'runnable' based.

> >>

> >> The idea was that since we now update rq clock before post_init and then

> >> leave it be, both post_init and enqueue see the exact same timestamp,

> >> and the delta is 0, resulting in no aging.

> >>

> >> Or did I fail to make that happen?

> >

> > No, but IMHO what Matt wants is ageing for the hackench tasks at the end

> > of their fork phase so there is a tiny amount of

> > cfs_rq->runnable_load_avg left on cpuX after the fork related dequeue so

> > the (load-based) fork-balancer chooses cpuY for the next hackbench task.

> > That's why he wanted to avoid the __update_load_avg(se) on enqueue (thus

> > adding 1024 to cfs_rq->runnable_load_avg) and do the ageing only on

> > dequeue (removing <1024 from cfs_rq->runnable_load_avg).

> 

> wanting  cfs_rq->runnable_load_avg to be not null when nothing is

> runnable on the cfs_rq seems a bit odd.

> We should better take into account cfs_rq->avg.load_avg or the

> cfs_rq->avg.util_avg in the select_idlest_group in this case


IIUC the problem raised by Matt, he see a regression because we now remove
during the dequeue the exact same load as during the enqueue so
cfs_rq->runnable_load_avg is null so we select a cfs_rq that might already have
a lot of hackbench blocked thread.
The fact that runnable_load_avg is null, when the cfs_rq doesn't have runnable
task, is quite correct and we should keep it. But when we look for the idlest
group, we have to take into account the blocked thread.

That's what i have tried to do below


---
 kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)



> 

> >

> >

Comments

Dietmar Eggemann Sept. 29, 2016, 4:15 p.m. UTC | #1
On 28/09/16 14:13, Vincent Guittot wrote:
> Le Wednesday 28 Sep 2016 à 05:27:54 (-0700), Vincent Guittot a écrit :

>> On 28 September 2016 at 04:31, Dietmar Eggemann

>> <dietmar.eggemann@arm.com> wrote:

>>> On 28/09/16 12:19, Peter Zijlstra wrote:

>>>> On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote:

>>>>> On 28/09/16 11:14, Peter Zijlstra wrote:

>>>>>> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote:


[...]

> IIUC the problem raised by Matt, he see a regression because we now remove

> during the dequeue the exact same load as during the enqueue so

> cfs_rq->runnable_load_avg is null so we select a cfs_rq that might already have

> a lot of hackbench blocked thread.


This is my understanding as well.

> The fact that runnable_load_avg is null, when the cfs_rq doesn't have runnable

> task, is quite correct and we should keep it. But when we look for the idlest

> group, we have to take into account the blocked thread.

> 

> That's what i have tried to do below


[...]

> +		/*

> +		 * In case that we have same runnable load (especially null

> +		 *  runnable load), we select the group with smallest blocked

> +		 *  load

> +		 */

> +			min_avg_load = avg_load;

> +			min_runnable_load = runnable_load;


Setting 'min_runnable_load' wouldn't be necessary here.

>  			idlest = group;

>  		}

> +

>  	} while (group = group->next, group != sd->groups);

>  

> -	if (!idlest || 100*this_load < imbalance*min_load)

> +	if (!idlest || 100*this_load < imbalance*min_runnable_load)

>  		return NULL;

>  	return idlest;


On the Hikey board (ARM64) (2 cluster, each 4 cpu's, so MC and DIE), the
first f_i_g (on DIE) is still based on rbl_load. So if the first
hackbench task (spawning all the worker task) runs on cluster1, and the
former worker p_X already blocks f_i_g returns cluster2, if p_X still
runs, it returns idlest=NULL and we continue with cluster1 for second
f_i_g on MC.

The additional 'else if' condition doesn't seem to help much because of
occurrences where an idle cpu (which never took a worker) still has a
small value of rbl_load (shouldn't actually happen, weighted_cpuload()
should be 0) so it is never chosen or it has even a negative impact in
the case where an idle cpu (which never took a worker) is not chosen
because its load (cfs->avg.load_avg) hasn't been updated for a long time
so another cpu with rbl_load = 0 and a smaller load is used (even though
a lot of worker where already placed on it).

There are also episodes where we 'pack' workers onto the cpu which is
initially picked in f_i_c (on DIE) because (100*this_load <
imbalance*min_load) is true in f_i_g on MC. Maybe we can get rid of this
for !sd->child ?

[...]
Vincent Guittot Oct. 3, 2016, 1:05 p.m. UTC | #2
On 29 September 2016 at 18:15, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
> On 28/09/16 14:13, Vincent Guittot wrote:

>> Le Wednesday 28 Sep 2016 à 05:27:54 (-0700), Vincent Guittot a écrit :

>>> On 28 September 2016 at 04:31, Dietmar Eggemann

>>> <dietmar.eggemann@arm.com> wrote:

>>>> On 28/09/16 12:19, Peter Zijlstra wrote:

>>>>> On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote:

>>>>>> On 28/09/16 11:14, Peter Zijlstra wrote:

>>>>>>> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote:

>

> [...]

>

>> IIUC the problem raised by Matt, he see a regression because we now remove

>> during the dequeue the exact same load as during the enqueue so

>> cfs_rq->runnable_load_avg is null so we select a cfs_rq that might already have

>> a lot of hackbench blocked thread.

>

> This is my understanding as well.

>

>> The fact that runnable_load_avg is null, when the cfs_rq doesn't have runnable

>> task, is quite correct and we should keep it. But when we look for the idlest

>> group, we have to take into account the blocked thread.

>>

>> That's what i have tried to do below

>

> [...]

>

>> +             /*

>> +              * In case that we have same runnable load (especially null

>> +              *  runnable load), we select the group with smallest blocked

>> +              *  load

>> +              */

>> +                     min_avg_load = avg_load;

>> +                     min_runnable_load = runnable_load;

>

> Setting 'min_runnable_load' wouldn't be necessary here.


fair enough

>

>>                       idlest = group;

>>               }

>> +

>>       } while (group = group->next, group != sd->groups);

>>

>> -     if (!idlest || 100*this_load < imbalance*min_load)

>> +     if (!idlest || 100*this_load < imbalance*min_runnable_load)

>>               return NULL;

>>       return idlest;

>

> On the Hikey board (ARM64) (2 cluster, each 4 cpu's, so MC and DIE), the

> first f_i_g (on DIE) is still based on rbl_load. So if the first

> hackbench task (spawning all the worker task) runs on cluster1, and the

> former worker p_X already blocks f_i_g returns cluster2, if p_X still

> runs, it returns idlest=NULL and we continue with cluster1 for second

> f_i_g on MC.

>

> The additional 'else if' condition doesn't seem to help much because of

> occurrences where an idle cpu (which never took a worker) still has a

> small value of rbl_load (shouldn't actually happen, weighted_cpuload()

> should be 0) so it is never chosen or it has even a negative impact in

> the case where an idle cpu (which never took a worker) is not chosen

> because its load (cfs->avg.load_avg) hasn't been updated for a long time

> so another cpu with rbl_load = 0 and a smaller load is used (even though

> a lot of worker where already placed on it).


So the elseif part is there to take care of the regression raised by
Matt where the runnable_load_avg is null because worker are blocked
and the same cpu is selected
This can be extended with a threshold in order to include small
differences that came from computation rounding

>

> There are also episodes where we 'pack' workers onto the cpu which is

> initially picked in f_i_c (on DIE) because (100*this_load <

> imbalance*min_load) is true in f_i_g on MC. Maybe we can get rid of this

> for !sd->child ?


This threshold is there to filter any small variations that are not
relevant. I'm going to extend the use of cfs_rq_load_avg() in all
conditions so we take into account blocked load everywhere instead of
only when runnable_load_avg is null

>

> [...]
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 06b3c47..702915e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5353,7 +5353,8 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
-	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long min_runnable_load = ULONG_MAX, this_load = 0;
+	unsigned long min_avg_load = ULONG_MAX;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
@@ -5361,7 +5362,7 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load;
+		unsigned long load, avg_load, runnable_load;
 		int local_group;
 		int i;
 
@@ -5375,6 +5376,7 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		/* Tally up the load of all CPUs in the group */
 		avg_load = 0;
+		runnable_load = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
@@ -5383,21 +5385,35 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 			else
 				load = target_load(i, load_idx);
 
-			avg_load += load;
+			runnable_load += load;
+
+			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs); 
 		}
 
 		/* Adjust by relative CPU capacity of the group */
 		avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
+		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
 
 		if (local_group) {
-			this_load = avg_load;
-		} else if (avg_load < min_load) {
-			min_load = avg_load;
+			this_load = runnable_load;
+		} else if (runnable_load < min_runnable_load) {
+			min_runnable_load = runnable_load;
+			min_avg_load = avg_load;
+			idlest = group;
+		} else if ((runnable_load == min_runnable_load) && (avg_load < min_avg_load)) {
+		/*
+		 * In case that we have same runnable load (especially null
+		 *  runnable load), we select the group with smallest blocked
+		 *  load
+		 */
+			min_avg_load = avg_load;
+			min_runnable_load = runnable_load;
 			idlest = group;
 		}
+
 	} while (group = group->next, group != sd->groups);
 
-	if (!idlest || 100*this_load < imbalance*min_load)
+	if (!idlest || 100*this_load < imbalance*min_runnable_load)
 		return NULL;
 	return idlest;
 }