Message ID | 1492620844-30979-1-git-send-email-vincent.guittot@linaro.org |
---|---|

State | Superseded |

Headers | show |

On 19/04/17 17:54, Vincent Guittot wrote: > In the current implementation of load/util_avg, we assume that the ongoing > time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX, > even if part of the time segment still remains to run. As a consequence, this > remaining part is considered as idle time and generates unexpected variations > of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should Why do you use the square brackets the other way around? Just curious. 1002 stands for 1024*y^1 w/ y = 4008/4096 or y^32 = 0.5, right ? Might be worth mentioning. > stay at 1023. > > In order to keep the metric stable, we should not consider the ongoing time > segment when computing load/util_avg but only the segments that have already > fully elapsed. Bu to not consider the current time segment adds unwanted > latency in the load/util_avg responsivness especially when the time is scaled > instead of the contribution. Instead of waiting for the current time segment > to have fully elapsed before accounting it in load/util_avg, we can already > account the elapsed part but change the range used to compute load/util_avg > accordingly. > > At the very beginning of a new time segment, the past segments have been > decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current > time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal > to MAX_LOAD_AVG. In fact, the max value is > sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment. > > Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the > range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib]. > > As the elapsed part is already accounted in load/util_sum, we update the max > value according to the current position in the time segment instead of > removing its contribution. Removing its contribution stands for '- 1024' of 'LOAD_AVG_MAX - 1024' which was added in patch 1/2? > > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > > Fold both patches in one > > kernel/sched/fair.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 3f83a35..c3b8f0f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3017,12 +3017,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, > /* > * Step 2: update *_avg. > */ > - sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); > + sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); > if (cfs_rq) { > cfs_rq->runnable_load_avg = > - div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX); > + div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); > } > - sa->util_avg = sa->util_sum / LOAD_AVG_MAX; > + sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib); > > return 1; > } >

On 25 April 2017 at 13:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > On 19/04/17 17:54, Vincent Guittot wrote: >> In the current implementation of load/util_avg, we assume that the ongoing >> time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX, >> even if part of the time segment still remains to run. As a consequence, this >> remaining part is considered as idle time and generates unexpected variations >> of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should > > Why do you use the square brackets the other way around? Just curious. This refers to the very beginning and very end of time segment formulas below. That being said, 1024 is not reachable because at very end we have : 1024*MAX_LOAD_AVG*y+1024*1023 = 1023,9997 1002 is not reachable because at very beg we have 1024*MAX_LOAD_AVG*y+ 1024*0 = 1002,0577 But we are working with integer so [1002..1024[ is probably more correct > > 1002 stands for 1024*y^1 w/ y = 4008/4096 or y^32 = 0.5, right ? Might > be worth mentioning. > >> stay at 1023. >> >> In order to keep the metric stable, we should not consider the ongoing time >> segment when computing load/util_avg but only the segments that have already >> fully elapsed. Bu to not consider the current time segment adds unwanted >> latency in the load/util_avg responsivness especially when the time is scaled >> instead of the contribution. Instead of waiting for the current time segment >> to have fully elapsed before accounting it in load/util_avg, we can already >> account the elapsed part but change the range used to compute load/util_avg >> accordingly. >> >> At the very beginning of a new time segment, the past segments have been >> decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current >> time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal >> to MAX_LOAD_AVG. In fact, the max value is >> sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment. >> >> Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the >> range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib]. >> >> As the elapsed part is already accounted in load/util_sum, we update the max >> value according to the current position in the time segment instead of >> removing its contribution. > > Removing its contribution stands for '- 1024' of 'LOAD_AVG_MAX - 1024' > which was added in patch 1/2? removing its contribution refers to "- sa->period_contrib * weight" and "- (running * sa->period_contrib << SCHED_CAPACITY_SHIFT))" in patch 1/2 of the previous version > >> >> Suggested-by: Peter Zijlstra <peterz@infradead.org> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >> --- >> >> Fold both patches in one >> >> kernel/sched/fair.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 3f83a35..c3b8f0f 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -3017,12 +3017,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, >> /* >> * Step 2: update *_avg. >> */ >> - sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); >> + sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); >> if (cfs_rq) { >> cfs_rq->runnable_load_avg = >> - div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX); >> + div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); >> } >> - sa->util_avg = sa->util_sum / LOAD_AVG_MAX; >> + sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib); >> >> return 1; >> } >>

On 25/04/17 13:40, Vincent Guittot wrote: > On 25 April 2017 at 13:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >> On 19/04/17 17:54, Vincent Guittot wrote: >>> In the current implementation of load/util_avg, we assume that the ongoing >>> time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX, >>> even if part of the time segment still remains to run. As a consequence, this >>> remaining part is considered as idle time and generates unexpected variations >>> of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should >> >> Why do you use the square brackets the other way around? Just curious. > > This refers to the very beginning and very end of time segment formulas below. > That being said, 1024 is not reachable because at very end we have : > 1024*MAX_LOAD_AVG*y+1024*1023 = 1023,9997 > > 1002 is not reachable because at very beg we have > 1024*MAX_LOAD_AVG*y+ 1024*0 = 1002,0577 > > But we are working with integer so [1002..1024[ is probably more correct OK, this is with y = 32nd-rt(0.5) exactly, understood. I assume you mean LOAD_AVG_MAX instead of MAX_LOAD_AVG. >> 1002 stands for 1024*y^1 w/ y = 4008/4096 or y^32 = 0.5, right ? Might >> be worth mentioning. >> >>> stay at 1023. >>> >>> In order to keep the metric stable, we should not consider the ongoing time >>> segment when computing load/util_avg but only the segments that have already >>> fully elapsed. Bu to not consider the current time segment adds unwanted >>> latency in the load/util_avg responsivness especially when the time is scaled >>> instead of the contribution. Instead of waiting for the current time segment >>> to have fully elapsed before accounting it in load/util_avg, we can already >>> account the elapsed part but change the range used to compute load/util_avg >>> accordingly. >>> >>> At the very beginning of a new time segment, the past segments have been >>> decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current >>> time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal >>> to MAX_LOAD_AVG. In fact, the max value is >>> sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment. s/MAX_LOAD_AVG/LOAD_AVG_MAX >>> >>> Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the >>> range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib]. >>> >>> As the elapsed part is already accounted in load/util_sum, we update the max >>> value according to the current position in the time segment instead of >>> removing its contribution. >> >> Removing its contribution stands for '- 1024' of 'LOAD_AVG_MAX - 1024' >> which was added in patch 1/2? > > removing its contribution refers to "- sa->period_contrib * weight" > and "- (running * sa->period_contrib << SCHED_CAPACITY_SHIFT))" in > patch 1/2 of the previous version Yup, makes sense, so the '-1024' is the influence of the current 'time segment' (n = 0) then. IMHO, the removing of contribution in patch 1/2 wouldn't take freq and cpu scaling of contribution (which is still in accumulate_sum()) into consideration. >>> Suggested-by: Peter Zijlstra <peterz@infradead.org> >>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >>> --- >>> >>> Fold both patches in one >>> >>> kernel/sched/fair.c | 6 +++--- >>> 1 file changed, 3 insertions(+), 3 deletions(-) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 3f83a35..c3b8f0f 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -3017,12 +3017,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, >>> /* >>> * Step 2: update *_avg. >>> */ >>> - sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); >>> + sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); >>> if (cfs_rq) { >>> cfs_rq->runnable_load_avg = >>> - div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX); >>> + div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); >>> } >>> - sa->util_avg = sa->util_sum / LOAD_AVG_MAX; >>> + sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib); >>> >>> return 1; >>> } >>>

On 25 April 2017 at 16:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > On 25/04/17 13:40, Vincent Guittot wrote: >> On 25 April 2017 at 13:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >>> On 19/04/17 17:54, Vincent Guittot wrote: >>>> In the current implementation of load/util_avg, we assume that the ongoing >>>> time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX, >>>> even if part of the time segment still remains to run. As a consequence, this >>>> remaining part is considered as idle time and generates unexpected variations >>>> of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should >>> >>> Why do you use the square brackets the other way around? Just curious. >> >> This refers to the very beginning and very end of time segment formulas below. >> That being said, 1024 is not reachable because at very end we have : >> 1024*MAX_LOAD_AVG*y+1024*1023 = 1023,9997 >> >> 1002 is not reachable because at very beg we have >> 1024*MAX_LOAD_AVG*y+ 1024*0 = 1002,0577 >> >> But we are working with integer so [1002..1024[ is probably more correct > > OK, this is with y = 32nd-rt(0.5) exactly, understood. > > I assume you mean LOAD_AVG_MAX instead of MAX_LOAD_AVG. correct > >>> 1002 stands for 1024*y^1 w/ y = 4008/4096 or y^32 = 0.5, right ? Might >>> be worth mentioning. >>> >>>> stay at 1023. >>>> >>>> In order to keep the metric stable, we should not consider the ongoing time >>>> segment when computing load/util_avg but only the segments that have already >>>> fully elapsed. Bu to not consider the current time segment adds unwanted >>>> latency in the load/util_avg responsivness especially when the time is scaled >>>> instead of the contribution. Instead of waiting for the current time segment >>>> to have fully elapsed before accounting it in load/util_avg, we can already >>>> account the elapsed part but change the range used to compute load/util_avg >>>> accordingly. >>>> >>>> At the very beginning of a new time segment, the past segments have been >>>> decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current >>>> time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal >>>> to MAX_LOAD_AVG. In fact, the max value is >>>> sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment. > > s/MAX_LOAD_AVG/LOAD_AVG_MAX ditto > >>>> >>>> Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the >>>> range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib]. >>>> >>>> As the elapsed part is already accounted in load/util_sum, we update the max >>>> value according to the current position in the time segment instead of >>>> removing its contribution. >>> >>> Removing its contribution stands for '- 1024' of 'LOAD_AVG_MAX - 1024' >>> which was added in patch 1/2? >> >> removing its contribution refers to "- sa->period_contrib * weight" >> and "- (running * sa->period_contrib << SCHED_CAPACITY_SHIFT))" in >> patch 1/2 of the previous version > > Yup, makes sense, so the '-1024' is the influence of the current 'time > segment' (n = 0) then. Yes > > IMHO, the removing of contribution in patch 1/2 wouldn't take freq and > cpu scaling of contribution (which is still in accumulate_sum()) into > consideration. Yes you're right but as everything has been put in 1 single patch in v2, it doesn't make any difference now > >>>> Suggested-by: Peter Zijlstra <peterz@infradead.org> >>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >>>> --- >>>> >>>> Fold both patches in one >>>> >>>> kernel/sched/fair.c | 6 +++--- >>>> 1 file changed, 3 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>> index 3f83a35..c3b8f0f 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -3017,12 +3017,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, >>>> /* >>>> * Step 2: update *_avg. >>>> */ >>>> - sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); >>>> + sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); >>>> if (cfs_rq) { >>>> cfs_rq->runnable_load_avg = >>>> - div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX); >>>> + div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); >>>> } >>>> - sa->util_avg = sa->util_sum / LOAD_AVG_MAX; >>>> + sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib); >>>> >>>> return 1; >>>> } >>>>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3f83a35..c3b8f0f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3017,12 +3017,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, /* * Step 2: update *_avg. */ - sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); + sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); if (cfs_rq) { cfs_rq->runnable_load_avg = - div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX); + div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); } - sa->util_avg = sa->util_sum / LOAD_AVG_MAX; + sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib); return 1; }

`In the current implementation of load/util_avg, we assume that the ongoing time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX, even if part of the time segment still remains to run. As a consequence, this remaining part is considered as idle time and generates unexpected variations of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should stay at 1023. In order to keep the metric stable, we should not consider the ongoing time segment when computing load/util_avg but only the segments that have already fully elapsed. Bu to not consider the current time segment adds unwanted latency in the load/util_avg responsivness especially when the time is scaled instead of the contribution. Instead of waiting for the current time segment to have fully elapsed before accounting it in load/util_avg, we can already account the elapsed part but change the range used to compute load/util_avg accordingly. At the very beginning of a new time segment, the past segments have been decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal to MAX_LOAD_AVG. In fact, the max value is sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment. Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib]. As the elapsed part is already accounted in load/util_sum, we update the max value according to the current position in the time segment instead of removing its contribution. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- Fold both patches in one kernel/sched/fair.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) -- 2.7.4`