[v2] sched/cfs: make util/load_avg more stable

Message ID 1492620844-30979-1-git-send-email-vincent.guittot@linaro.org
State Superseded
Headers show

Commit Message

Vincent Guittot April 19, 2017, 4:54 p.m.
In the current implementation of load/util_avg, we assume that the ongoing
time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX,
even if part of the time segment still remains to run. As a consequence, this
remaining part is considered as idle time and generates unexpected variations
of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should
stay at 1023.

In order to keep the metric stable, we should not consider the ongoing time
segment when computing load/util_avg but only the segments that have already
fully elapsed. Bu to not consider the current time segment adds unwanted
latency in the load/util_avg responsivness especially when the time is scaled
instead of the contribution. Instead of waiting for the current time segment
to have fully elapsed before accounting it in load/util_avg, we can already
account the elapsed part but change the range used to compute load/util_avg
accordingly.

At the very beginning of a new time segment, the past segments have been
decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current
time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal
to MAX_LOAD_AVG. In fact, the max value is
sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment.

Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the
range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib].

As the elapsed part is already accounted in load/util_sum, we update the max
value according to the current position in the time segment instead of
removing its contribution.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---

Fold both patches in one

 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

-- 
2.7.4

Comments

Dietmar Eggemann April 25, 2017, 11:05 a.m. | #1
On 19/04/17 17:54, Vincent Guittot wrote:
> In the current implementation of load/util_avg, we assume that the ongoing

> time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX,

> even if part of the time segment still remains to run. As a consequence, this

> remaining part is considered as idle time and generates unexpected variations

> of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should


Why do you use the square brackets the other way around? Just curious.

1002 stands for 1024*y^1 w/ y = 4008/4096 or y^32 = 0.5, right ? Might
be worth mentioning.

> stay at 1023.

> 

> In order to keep the metric stable, we should not consider the ongoing time

> segment when computing load/util_avg but only the segments that have already

> fully elapsed. Bu to not consider the current time segment adds unwanted

> latency in the load/util_avg responsivness especially when the time is scaled

> instead of the contribution. Instead of waiting for the current time segment

> to have fully elapsed before accounting it in load/util_avg, we can already

> account the elapsed part but change the range used to compute load/util_avg

> accordingly.

> 

> At the very beginning of a new time segment, the past segments have been

> decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current

> time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal

> to MAX_LOAD_AVG. In fact, the max value is

> sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment.

> 

> Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the

> range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib].

> 

> As the elapsed part is already accounted in load/util_sum, we update the max

> value according to the current position in the time segment instead of

> removing its contribution.


Removing its contribution stands for '- 1024' of 'LOAD_AVG_MAX - 1024'
which was added in patch 1/2?

> 

> Suggested-by: Peter Zijlstra <peterz@infradead.org>

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---

> 

> Fold both patches in one

> 

>  kernel/sched/fair.c | 6 +++---

>  1 file changed, 3 insertions(+), 3 deletions(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 3f83a35..c3b8f0f 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -3017,12 +3017,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,

>  	/*

>  	 * Step 2: update *_avg.

>  	 */

> -	sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);

> +	sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);

>  	if (cfs_rq) {

>  		cfs_rq->runnable_load_avg =

> -			div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);

> +			div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);

>  	}

> -	sa->util_avg = sa->util_sum / LOAD_AVG_MAX;

> +	sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib);

>  

>  	return 1;

>  }

>
Vincent Guittot April 25, 2017, 12:40 p.m. | #2
On 25 April 2017 at 13:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 19/04/17 17:54, Vincent Guittot wrote:

>> In the current implementation of load/util_avg, we assume that the ongoing

>> time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX,

>> even if part of the time segment still remains to run. As a consequence, this

>> remaining part is considered as idle time and generates unexpected variations

>> of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should

>

> Why do you use the square brackets the other way around? Just curious.


This refers to the very beginning and very end of time segment formulas below.
That being said, 1024 is not reachable because at very end we have :
 1024*MAX_LOAD_AVG*y+1024*1023 = 1023,9997

1002 is not reachable because at very beg we have
1024*MAX_LOAD_AVG*y+ 1024*0 = 1002,0577

But we are working with integer so [1002..1024[ is probably more correct

>

> 1002 stands for 1024*y^1 w/ y = 4008/4096 or y^32 = 0.5, right ? Might

> be worth mentioning.

>

>> stay at 1023.

>>

>> In order to keep the metric stable, we should not consider the ongoing time

>> segment when computing load/util_avg but only the segments that have already

>> fully elapsed. Bu to not consider the current time segment adds unwanted

>> latency in the load/util_avg responsivness especially when the time is scaled

>> instead of the contribution. Instead of waiting for the current time segment

>> to have fully elapsed before accounting it in load/util_avg, we can already

>> account the elapsed part but change the range used to compute load/util_avg

>> accordingly.

>>

>> At the very beginning of a new time segment, the past segments have been

>> decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current

>> time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal

>> to MAX_LOAD_AVG. In fact, the max value is

>> sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment.

>>

>> Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the

>> range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib].

>>

>> As the elapsed part is already accounted in load/util_sum, we update the max

>> value according to the current position in the time segment instead of

>> removing its contribution.

>

> Removing its contribution stands for '- 1024' of 'LOAD_AVG_MAX - 1024'

> which was added in patch 1/2?


removing its contribution refers to  "- sa->period_contrib * weight"
and "- (running * sa->period_contrib << SCHED_CAPACITY_SHIFT))"  in
patch 1/2 of the previous version

>

>>

>> Suggested-by: Peter Zijlstra <peterz@infradead.org>

>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>> ---

>>

>> Fold both patches in one

>>

>>  kernel/sched/fair.c | 6 +++---

>>  1 file changed, 3 insertions(+), 3 deletions(-)

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index 3f83a35..c3b8f0f 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -3017,12 +3017,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,

>>       /*

>>        * Step 2: update *_avg.

>>        */

>> -     sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);

>> +     sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);

>>       if (cfs_rq) {

>>               cfs_rq->runnable_load_avg =

>> -                     div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);

>> +                     div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);

>>       }

>> -     sa->util_avg = sa->util_sum / LOAD_AVG_MAX;

>> +     sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib);

>>

>>       return 1;

>>  }

>>
Dietmar Eggemann April 25, 2017, 2:53 p.m. | #3
On 25/04/17 13:40, Vincent Guittot wrote:
> On 25 April 2017 at 13:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>> On 19/04/17 17:54, Vincent Guittot wrote:

>>> In the current implementation of load/util_avg, we assume that the ongoing

>>> time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX,

>>> even if part of the time segment still remains to run. As a consequence, this

>>> remaining part is considered as idle time and generates unexpected variations

>>> of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should

>>

>> Why do you use the square brackets the other way around? Just curious.

> 

> This refers to the very beginning and very end of time segment formulas below.

> That being said, 1024 is not reachable because at very end we have :

>  1024*MAX_LOAD_AVG*y+1024*1023 = 1023,9997

> 

> 1002 is not reachable because at very beg we have

> 1024*MAX_LOAD_AVG*y+ 1024*0 = 1002,0577

> 

> But we are working with integer so [1002..1024[ is probably more correct


OK, this is with y = 32nd-rt(0.5) exactly, understood.

I assume you mean LOAD_AVG_MAX instead of MAX_LOAD_AVG.

>> 1002 stands for 1024*y^1 w/ y = 4008/4096 or y^32 = 0.5, right ? Might

>> be worth mentioning.

>>

>>> stay at 1023.

>>>

>>> In order to keep the metric stable, we should not consider the ongoing time

>>> segment when computing load/util_avg but only the segments that have already

>>> fully elapsed. Bu to not consider the current time segment adds unwanted

>>> latency in the load/util_avg responsivness especially when the time is scaled

>>> instead of the contribution. Instead of waiting for the current time segment

>>> to have fully elapsed before accounting it in load/util_avg, we can already

>>> account the elapsed part but change the range used to compute load/util_avg

>>> accordingly.

>>>

>>> At the very beginning of a new time segment, the past segments have been

>>> decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current

>>> time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal

>>> to MAX_LOAD_AVG. In fact, the max value is

>>> sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment.


s/MAX_LOAD_AVG/LOAD_AVG_MAX

>>>

>>> Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the

>>> range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib].

>>>

>>> As the elapsed part is already accounted in load/util_sum, we update the max

>>> value according to the current position in the time segment instead of

>>> removing its contribution.

>>

>> Removing its contribution stands for '- 1024' of 'LOAD_AVG_MAX - 1024'

>> which was added in patch 1/2?

> 

> removing its contribution refers to  "- sa->period_contrib * weight"

> and "- (running * sa->period_contrib << SCHED_CAPACITY_SHIFT))"  in

> patch 1/2 of the previous version


Yup, makes sense, so the '-1024' is the influence of the current 'time
segment' (n = 0) then.

IMHO, the removing of contribution in patch 1/2 wouldn't take freq and
cpu scaling of contribution (which is still in accumulate_sum()) into
consideration.

>>> Suggested-by: Peter Zijlstra <peterz@infradead.org>

>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>>> ---

>>>

>>> Fold both patches in one

>>>

>>>  kernel/sched/fair.c | 6 +++---

>>>  1 file changed, 3 insertions(+), 3 deletions(-)

>>>

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>>> index 3f83a35..c3b8f0f 100644

>>> --- a/kernel/sched/fair.c

>>> +++ b/kernel/sched/fair.c

>>> @@ -3017,12 +3017,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,

>>>       /*

>>>        * Step 2: update *_avg.

>>>        */

>>> -     sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);

>>> +     sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);

>>>       if (cfs_rq) {

>>>               cfs_rq->runnable_load_avg =

>>> -                     div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);

>>> +                     div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);

>>>       }

>>> -     sa->util_avg = sa->util_sum / LOAD_AVG_MAX;

>>> +     sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib);

>>>

>>>       return 1;

>>>  }

>>>
Vincent Guittot April 25, 2017, 3:17 p.m. | #4
On 25 April 2017 at 16:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 25/04/17 13:40, Vincent Guittot wrote:

>> On 25 April 2017 at 13:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>>> On 19/04/17 17:54, Vincent Guittot wrote:

>>>> In the current implementation of load/util_avg, we assume that the ongoing

>>>> time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX,

>>>> even if part of the time segment still remains to run. As a consequence, this

>>>> remaining part is considered as idle time and generates unexpected variations

>>>> of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should

>>>

>>> Why do you use the square brackets the other way around? Just curious.

>>

>> This refers to the very beginning and very end of time segment formulas below.

>> That being said, 1024 is not reachable because at very end we have :

>>  1024*MAX_LOAD_AVG*y+1024*1023 = 1023,9997

>>

>> 1002 is not reachable because at very beg we have

>> 1024*MAX_LOAD_AVG*y+ 1024*0 = 1002,0577

>>

>> But we are working with integer so [1002..1024[ is probably more correct

>

> OK, this is with y = 32nd-rt(0.5) exactly, understood.

>

> I assume you mean LOAD_AVG_MAX instead of MAX_LOAD_AVG.


correct

>

>>> 1002 stands for 1024*y^1 w/ y = 4008/4096 or y^32 = 0.5, right ? Might

>>> be worth mentioning.

>>>

>>>> stay at 1023.

>>>>

>>>> In order to keep the metric stable, we should not consider the ongoing time

>>>> segment when computing load/util_avg but only the segments that have already

>>>> fully elapsed. Bu to not consider the current time segment adds unwanted

>>>> latency in the load/util_avg responsivness especially when the time is scaled

>>>> instead of the contribution. Instead of waiting for the current time segment

>>>> to have fully elapsed before accounting it in load/util_avg, we can already

>>>> account the elapsed part but change the range used to compute load/util_avg

>>>> accordingly.

>>>>

>>>> At the very beginning of a new time segment, the past segments have been

>>>> decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current

>>>> time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal

>>>> to MAX_LOAD_AVG. In fact, the max value is

>>>> sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment.

>

> s/MAX_LOAD_AVG/LOAD_AVG_MAX


ditto

>

>>>>

>>>> Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the

>>>> range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib].

>>>>

>>>> As the elapsed part is already accounted in load/util_sum, we update the max

>>>> value according to the current position in the time segment instead of

>>>> removing its contribution.

>>>

>>> Removing its contribution stands for '- 1024' of 'LOAD_AVG_MAX - 1024'

>>> which was added in patch 1/2?

>>

>> removing its contribution refers to  "- sa->period_contrib * weight"

>> and "- (running * sa->period_contrib << SCHED_CAPACITY_SHIFT))"  in

>> patch 1/2 of the previous version

>

> Yup, makes sense, so the '-1024' is the influence of the current 'time

> segment' (n = 0) then.


Yes

>

> IMHO, the removing of contribution in patch 1/2 wouldn't take freq and

> cpu scaling of contribution (which is still in accumulate_sum()) into

> consideration.


Yes you're right but as everything has been put in 1 single patch in
v2, it doesn't make any difference now

>

>>>> Suggested-by: Peter Zijlstra <peterz@infradead.org>

>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>>>> ---

>>>>

>>>> Fold both patches in one

>>>>

>>>>  kernel/sched/fair.c | 6 +++---

>>>>  1 file changed, 3 insertions(+), 3 deletions(-)

>>>>

>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>>>> index 3f83a35..c3b8f0f 100644

>>>> --- a/kernel/sched/fair.c

>>>> +++ b/kernel/sched/fair.c

>>>> @@ -3017,12 +3017,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,

>>>>       /*

>>>>        * Step 2: update *_avg.

>>>>        */

>>>> -     sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);

>>>> +     sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);

>>>>       if (cfs_rq) {

>>>>               cfs_rq->runnable_load_avg =

>>>> -                     div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);

>>>> +                     div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);

>>>>       }

>>>> -     sa->util_avg = sa->util_sum / LOAD_AVG_MAX;

>>>> +     sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib);

>>>>

>>>>       return 1;

>>>>  }

>>>>

Patch hide | download patch | download mbox

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3f83a35..c3b8f0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3017,12 +3017,12 @@  ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	/*
 	 * Step 2: update *_avg.
 	 */
-	sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+	sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
 	if (cfs_rq) {
 		cfs_rq->runnable_load_avg =
-			div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
+			div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
 	}
-	sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
+	sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib);
 
 	return 1;
 }