diff mbox

[rfc] sched/fair: Use instantaneous load for fork/exec balancing

Message ID 5760115C.7040306@arm.com
State New
Headers show

Commit Message

Dietmar Eggemann June 14, 2016, 2:14 p.m. UTC
On 14/06/16 08:58, Mike Galbraith wrote:
> SUSE's regression testing noticed that...

> 

> 0905f04eb21f sched/fair: Fix new task's load avg removed from source CPU in wake_up_new_task()

> 

> ...introduced a hackbench regression, and indeed it does.  I think this

> regression has more to do with randomness than anything else, but in

> general...

> 

> While averaging calms down load balancing, helping to keep migrations

> down to a dull roar, it's not completely wonderful when it comes to

> things that live in the here and now, hackbench being one such.

> 

> time sh -c 'for i in `seq 1000`; do hackbench -p -P > /dev/null; done'

> 

> real    0m55.397s

> user    0m8.320s

> sys     5m40.789s

> 

> echo LB_INSTANTANEOUS_LOAD > /sys/kernel/debug/sched_features

> 

> real    0m48.049s

> user    0m6.510s

> sys     5m6.291s

> 

> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>


I see similar values on ARM64 (Juno r0: 2xCortex-A57 4xCortex-A53). OK,
1000 invocations of hackbench take a little bit longer but I guess it's
the fork's we're after.

- echo NO_LB_INSTANTANEOUS_LOAD > /sys/kernel/debug/sched_features

time sh -c 'for i in `seq 1000`; do hackbench -p -P > /dev/null; done'

root@juno:~# time sh -c 'for i in `seq 1000`; do hackbench -p -P >
/dev/null; done'

real	10m17.155s
user	2m56.976s
sys	38m0.324s

- echo LB_INSTANTANEOUS_LOAD > /sys/kernel/debug/sched_features

time sh -c 'for i in `seq 1000`; do hackbench -p -P > /dev/null; done'

real	9m49.832s
user	2m42.896s
sys	34m51.452s

- But I get a similar effect in case I initialize se->avg.load_avg w/ 0:

root@juno:~# time sh -c 'for i in `seq 1000`; do hackbench -p -P >
/dev/null; done'

real	9m55.396s
user	2m41.192s
sys	35m6.196s


IMHO, the hackbench performance "boost" w/o 0905f04eb21f is due to the
fact that a new task gets all it's load decayed (making it a small task)
in the __update_load_avg() call in remove_entity_load_avg() because its
se->avg.last_update_time value is 0 which creates a huge time difference
comparing it to cfs_rq->avg.last_update_time. The patch 0905f04eb21f
avoids this and thus the task stays big se->avg.load_avg = 1024.

It can't be a difference in the value of cfs_rq->removed_load_avg
because w/o the patch 0905f04eb21f, we atomic_long_add 0 and with the
patch we bail before the atomic_long_add().

[...]

Comments

Dietmar Eggemann June 15, 2016, 3:32 p.m. UTC | #1
On 14/06/16 17:40, Mike Galbraith wrote:
> On Tue, 2016-06-14 at 15:14 +0100, Dietmar Eggemann wrote:

> 

>> IMHO, the hackbench performance "boost" w/o 0905f04eb21f is due to the

>> fact that a new task gets all it's load decayed (making it a small task)

>> in the __update_load_avg() call in remove_entity_load_avg() because its

>> se->avg.last_update_time value is 0 which creates a huge time difference

>> comparing it to cfs_rq->avg.last_update_time. The patch 0905f04eb21f

>> avoids this and thus the task stays big se->avg.load_avg = 1024.

> 

> I don't care much at all about the hackbench "regression" in its own

> right, and what causes it, for me, bottom line is that there are cases

> where we need to be able to resolve, and can't, simply because we're

> looking at a fuzzy (rippling) reflection.


Understood. I just thought it would be nice to know why 0905f04eb21f
makes this problem even more visible. But so far I wasn't able to figure
out why this diff in se->avg.load_avg [1024 versus 0] has this effect on
cfs_rq->runnable_load_avg making it even less suitable in find idlest*.
enqueue_entity_load_avg()'s cfs_rq->runnable_load_* += sa->load_* looks
suspicious though.
> 

> In general, the fuzz helps us to not be so spastic.  I'm not sure that

> we really really need to care all that much, because I strongly suspect

> that it's only gonna make any difference at all in corner cases, but

> there are real world cases that matter.  I know for fact that schbench

> (facebook) which is at least based on a real world load fails early due

> to us stacking tasks due to that fuzzy view of reality.  In that case,

> it's because the fuzz consists of a high amplitude aging sawtooth..


... only for fork/exec? Which then would be related to the initial value
of se->avg.load_avg. Otherwise we could go back to pre b92486cbf2aa
"sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_task".

> find idlest* sees a collection of pesudo-random numbers, effectively,

> the fates pick idlest via lottery, get it wrong often enough that a big

> box _never_ reaches full utilization before we stack tasks, putting an

> end to the latency game.  For generic loads, the smoothing works, but

> for some corners, it blows chunks.  Fork/exec seemed like a spot where

> you really can't go wrong by looking at clear unadulterated reality.

> 

> 	-Mike

>
Dietmar Eggemann June 15, 2016, 7:03 p.m. UTC | #2
On 15/06/16 17:03, Mike Galbraith wrote:
> On Wed, 2016-06-15 at 16:32 +0100, Dietmar Eggemann wrote:

> 

>>> In general, the fuzz helps us to not be so spastic.  I'm not sure that

>>> we really really need to care all that much, because I strongly suspect

>>> that it's only gonna make any difference at all in corner cases, but

>>> there are real world cases that matter.  I know for fact that schbench

>>> (facebook) which is at least based on a real world load fails early due

>>> to us stacking tasks due to that fuzzy view of reality.  In that case,

>>> it's because the fuzz consists of a high amplitude aging sawtooth..

>>

>> ... only for fork/exec?

> 

> No.  Identical workers had longish work/sleep cycle, aging resulted in

> weights that ranged from roughly 300-700(ish), depending on when you

> peeked at them.

> 

> 	-Mike

> 


Isn't there a theoretical problem with the scale_load() on CONFIG_64BIT
machines on tip/sched/core? load.weight has a higher resolution than
runnable_load_avg (and so the values in the rq->cpu_load[] array).
Theoretically because [forkexec|wake]_idx is 0 so [target|source]_load()
is nothing else than weighted_cpuload().
Dietmar Eggemann June 16, 2016, 9:01 a.m. UTC | #3
On 16/06/16 04:33, Mike Galbraith wrote:
> On Wed, 2016-06-15 at 20:03 +0100, Dietmar Eggemann wrote:

> 

>> Isn't there a theoretical problem with the scale_load() on CONFIG_64BIT

>> machines on tip/sched/core? load.weight has a higher resolution than

>> runnable_load_avg (and so the values in the rq->cpu_load[] array).

>> Theoretically because [forkexec|wake]_idx is 0 so [target|source]_load()

>> is nothing else than weighted_cpuload().

> 

> I see a not so theoretical problem with my rfc in that I forgot to

> scale_load_down() if that's what you mean.


Yup. Theoretical in the sense that this_load and min_load will be
affected both the same way as long as load_idx = 0.

> 

> (changes nothing, reality was just extra special unadulterated;)


Agreed.

> 

> 	-Mike  

>
Dietmar Eggemann July 11, 2016, 8:58 a.m. UTC | #4
On 04/07/16 16:04, Matt Fleming wrote:
> On Wed, 15 Jun, at 04:32:58PM, Dietmar Eggemann wrote:

>> On 14/06/16 17:40, Mike Galbraith wrote:

>>> On Tue, 2016-06-14 at 15:14 +0100, Dietmar Eggemann wrote:

>>>

>>>> IMHO, the hackbench performance "boost" w/o 0905f04eb21f is due to the

>>>> fact that a new task gets all it's load decayed (making it a small task)

>>>> in the __update_load_avg() call in remove_entity_load_avg() because its

>>>> se->avg.last_update_time value is 0 which creates a huge time difference

>>>> comparing it to cfs_rq->avg.last_update_time. The patch 0905f04eb21f

>>>> avoids this and thus the task stays big se->avg.load_avg = 1024.

>>>

>>> I don't care much at all about the hackbench "regression" in its own

>>> right, and what causes it, for me, bottom line is that there are cases

>>> where we need to be able to resolve, and can't, simply because we're

>>> looking at a fuzzy (rippling) reflection.

>>

>> Understood. I just thought it would be nice to know why 0905f04eb21f

>> makes this problem even more visible. But so far I wasn't able to figure

>> out why this diff in se->avg.load_avg [1024 versus 0] has this effect on

>> cfs_rq->runnable_load_avg making it even less suitable in find idlest*.

>> enqueue_entity_load_avg()'s cfs_rq->runnable_load_* += sa->load_* looks

>> suspicious though.

> 

> In my testing without 0905f04eb21f I saw that se->avg.load_avg

> actually managed to skip being decayed at all before the task was

> dequeued, which meant that cfs_rq->runnable_load_avg was more likely

> to be zero after dequeue, for those workloads like hackbench that

> essentially are just a fork bomb.


Do you mean the first dequeue when the task is forked?

These are the pelt related functions which are called when the task is
forked:

detach_entity_load_avg
attach_entity_load_avg
remove_entity_load_avg <-- se->avg.load_avg is set to 0 w/o 0905f04eb21f
                           se->avg.load_avg stays 1024 w/ 0905f04eb21f
enqueue_entity_load_avg
attach_entity_load_avg (double attach is fixed on tip/sched/core)
dequeue_entity_load_avg

> se->avg.load_avg evaded decay because se->avg.period_contrib was being

> zero'd in __update_load_avg().


I don't see the relation to se->avg.period_contrib here. IMHO,
se->avg.period_contrib is purely there to manage the 3 different update
phases in __update_load_avg().

This difference in the initial se->avg.load_avg value [0 or 1024] has an
influence in wake_affine() [weight = p->se.avg.load_avg;] for the wakeup
handling of the hackbench tasks in the 'send/receive data' phase.

There are a couple of patches on tip/sched/core which might change the
behaviour of this: fork path, no double attach_entity_load_avg for new
task, no remove_entity_load_avg for new task, changes in effective_load ...

> With 0905f04eb21f applied, it's less likely (though not impossible)

> that ->period_contrib will be zero'd and so we usually end up with

> some residual load in cfs_rq->runnable_load_avg on dequeue, and hence,

> 

> 	cfs_rq->runnable_load_avg > se->avg.load_avg

> 

> even if 'se' is the only task on the runqueue.

> 

> FYI, below is my quick and dirty hack that restored hackbench

> performance for the few machines I checked. I didn't try schbench with

> it.

> 

> ---

> 

> From 4e9856ea3dc56e356195ca035dab7302754ce59b Mon Sep 17 00:00:00 2001

> From: Matt Fleming <matt@codeblueprint.co.uk>

> Date: Thu, 9 Jun 2016 19:48:14 +0100

> Subject: [PATCH] sched/fair: Reset ::runnable_load_avg when dequeueing last

>  entity

> 

> The task and runqueue load averages maintained in p->se.avg.load_avg

> and cfs_rq->runnable_load_avg respectively, can decay at different

> wall clock rates, which means that enqueueing and then dequeueing a

> task on an otherwise empty runqueue doesn't always leave

> ::runnable_load_avg with its initial value.

> 

> This can lead to the situation where cfs_rq->runnable_load_avg has a

> non-zero value even though there are no runnable entities on the

> runqueue. Assuming no entity is enqueued on this runqueue for some

> time this residual load average will decay gradually as the load

> averages are updated.

> 

> But we can optimise the special case of dequeueing the last entity and

> reset ::runnable_load_avg early, which gives a performance improvement

> to workloads that trigger the load balancer, such as fork-heavy

> applications when SD_BALANCE_FORK is set, because it gives a more up

> to date view of how busy the cpu is.

> 

> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>

> ---

>  kernel/sched/fair.c | 14 ++++++++++++--

>  1 file changed, 12 insertions(+), 2 deletions(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index c6dd8bab010c..408ee90c7ea8 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -3007,10 +3007,20 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)

>  static inline void

>  dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)

>  {

> +	unsigned long load_avg = 0;

> +

>  	update_load_avg(se, 1);

>  

> -	cfs_rq->runnable_load_avg =

> -		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);

> +	/*

> +	 * If we're about to dequeue the last runnable entity we can

> +	 * reset the runnable load average to zero instead of waiting

> +	 * for it to decay naturally. This gives the load balancer a

> +	 * more timely and accurate view of how busy this cpu is.

> +	 */

> +	if (cfs_rq->nr_running > 1)

> +		load_avg = max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);

> +

> +	cfs_rq->runnable_load_avg = load_avg;

>  	cfs_rq->runnable_load_sum =

>  		max_t(s64,  cfs_rq->runnable_load_sum - se->avg.load_sum, 0);

>  }

>
diff mbox

Patch

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,7 +680,7 @@  void init_entity_runnable_average(struct
sched_entity *se)
         * will definitely be update (after enqueue).
         */
        sa->period_contrib = 1023;
-       sa->load_avg = scale_load_down(se->load.weight);
+       sa->load_avg = scale_load_down(0);
        sa->load_sum = sa->load_avg * LOAD_AVG_MAX;