Message ID | 1490715345-13803-1-git-send-email-vincent.guittot@linaro.org |
---|---|

State | New |

Headers | show |

On 28 March 2017 at 17:35, Vincent Guittot <vincent.guittot@linaro.org> wrote: > The current implementation of load tracking invariance scales the contribution > with current frequency and uarch performance (only for utilization) of the > CPU. One main result of this formula is that the figures are capped by current > capacity of CPU. Another one is that the load_avg is not invariant because not > scaled with uarch. > > The util_avg of a periodic task that runs r time slots every p time slots > varies in the range : > > U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) > > with U is the max util_avg value = SCHED_CAPACITY_SCALE > > At a lower capacity, the range becomes: > > U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p) > > with C reflecting the compute capacity ratio between current capacity and > max capacity. > > so C tries to compensate changes in (1-y^r') but it can't be accurate. > > Instead of scaling the contribution value of PELT algo, we should scale the > running time. The PELT signal aims to track the amount of computation of tasks > and/or rq so it seems more correct to scale the running time to reflect the > effective amount of computation done since the last update. > > In order to be fully invariant, we need to apply the same amount of running > time and idle time whatever the current capacity. Because running at lower > capacity implies that the task will run longer, we have to track the amount of > "stolen" idle time and to apply it when task becomes idle. > > But once we have reached the maximum utilization value (SCHED_CAPACITY_SCALE), > it means that the task is seen as an always-running task whatever the capacity of > the cpu (even at max compute capacity). In this case, we can discard the "stolen" > idle times which becomes meaningless. In order to cope with rounding effect of > PELT algo we take a margin and consider task with utilization greater than 1000 > (vs 1024 max) as an always-running task. > > Then, we can use the same algorithm for both utilization and load and > simplify __update_load_avg now that the load of a task doesn't have to be > capped by CPU uarch. > > The responsivness of PELT is improved when CPU is not running at max > capacity with this new algorithm. I have put below some examples of > duration to reach some typical load values according to the capacity of the > CPU with current implementation and with this patch. > > Util (%) max capacity half capacity(mainline) half capacity(w/ patch) > 972 (95%) 138ms not reachable 276ms > 486 (47.5%) 30ms 138ms 60ms > 256 (25%) 13ms 32ms 26ms > > On my hikey (octo ARM platform) with schedutil governor, the time to reach > max OPP when starting from a null utilization, decreases from 223ms with > current scale invariance down to 121ms with the new algorithm. For this > test, i have enable arch_scale_freq for arm64. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- I just noticed that i forgot to mentioned that this patch was a v2 and the discussion about v1 can be found here: https://lkml.org/lkml/2015/11/24/432 Change since v1: * track stolen idle time to make PELT fully invariant Vincent > include/linux/sched.h | 1 + > kernel/sched/fair.c | 49 ++++++++++++++++++++++++++++++++++--------------- > 2 files changed, 35 insertions(+), 15 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index d67eee8..ca9d00f 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -313,6 +313,7 @@ struct load_weight { > */ > struct sched_avg { > u64 last_update_time; > + u64 stolen_idle_time; > u64 load_sum; > u32 util_sum; > u32 period_contrib; > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 31453d5..d1514cb 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -735,6 +735,7 @@ void init_entity_runnable_average(struct sched_entity *se) > struct sched_avg *sa = &se->avg; > > sa->last_update_time = 0; > + sa->stolen_idle_time = 0; > /* > * sched_avg's period_contrib should be strictly less then 1024, so > * we give it 1023 to make sure it is almost a period (1024us), and > @@ -2852,10 +2853,9 @@ static __always_inline int > __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > unsigned long weight, int running, struct cfs_rq *cfs_rq) > { > - u64 delta, scaled_delta, periods; > + u64 delta, periods; > u32 contrib; > - unsigned int delta_w, scaled_delta_w, decayed = 0; > - unsigned long scale_freq, scale_cpu; > + unsigned int delta_w, decayed = 0; > > delta = now - sa->last_update_time; > /* > @@ -2876,8 +2876,30 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > return 0; > sa->last_update_time = now; > > - scale_freq = arch_scale_freq_capacity(NULL, cpu); > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > + if (running) { > + sa->stolen_idle_time += delta; > + /* > + * scale the elapsed time to reflect the real amount of > + * computation > + */ > + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu)); > + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu)); > + > + /* > + * Track the amount of stolen idle time due to running at > + * lower capacity > + */ > + sa->stolen_idle_time -= delta; > + } else if (!weight) { > + if (sa->util_sum < (LOAD_AVG_MAX * 1000)) { > + /* > + * Add the idle time stolen by running at lower compute > + * capacity > + */ > + delta += sa->stolen_idle_time; > + } > + sa->stolen_idle_time = 0; > + } > > /* delta_w is the amount already accumulated against our next period */ > delta_w = sa->period_contrib; > @@ -2893,16 +2915,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > * period and accrue it. > */ > delta_w = 1024 - delta_w; > - scaled_delta_w = cap_scale(delta_w, scale_freq); > if (weight) { > - sa->load_sum += weight * scaled_delta_w; > + sa->load_sum += weight * delta_w; > if (cfs_rq) { > cfs_rq->runnable_load_sum += > - weight * scaled_delta_w; > + weight * delta_w; > } > } > if (running) > - sa->util_sum += scaled_delta_w * scale_cpu; > + sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT; > > delta -= delta_w; > > @@ -2919,25 +2940,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > /* Efficiently calculate \sum (1..n_period) 1024*y^i */ > contrib = __compute_runnable_contrib(periods); > - contrib = cap_scale(contrib, scale_freq); > if (weight) { > sa->load_sum += weight * contrib; > if (cfs_rq) > cfs_rq->runnable_load_sum += weight * contrib; > } > if (running) > - sa->util_sum += contrib * scale_cpu; > + sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; > } > > /* Remainder of delta accrued against u_0` */ > - scaled_delta = cap_scale(delta, scale_freq); > if (weight) { > - sa->load_sum += weight * scaled_delta; > + sa->load_sum += weight * delta; > if (cfs_rq) > - cfs_rq->runnable_load_sum += weight * scaled_delta; > + cfs_rq->runnable_load_sum += weight * delta; > } > if (running) > - sa->util_sum += scaled_delta * scale_cpu; > + sa->util_sum += delta << SCHED_CAPACITY_SHIFT; > > sa->period_contrib += delta; > > -- > 2.7.4 >

diff --git a/include/linux/sched.h b/include/linux/sched.h index d67eee8..ca9d00f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -313,6 +313,7 @@ struct load_weight { */ struct sched_avg { u64 last_update_time; + u64 stolen_idle_time; u64 load_sum; u32 util_sum; u32 period_contrib; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 31453d5..d1514cb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -735,6 +735,7 @@ void init_entity_runnable_average(struct sched_entity *se) struct sched_avg *sa = &se->avg; sa->last_update_time = 0; + sa->stolen_idle_time = 0; /* * sched_avg's period_contrib should be strictly less then 1024, so * we give it 1023 to make sure it is almost a period (1024us), and @@ -2852,10 +2853,9 @@ static __always_inline int __update_load_avg(u64 now, int cpu, struct sched_avg *sa, unsigned long weight, int running, struct cfs_rq *cfs_rq) { - u64 delta, scaled_delta, periods; + u64 delta, periods; u32 contrib; - unsigned int delta_w, scaled_delta_w, decayed = 0; - unsigned long scale_freq, scale_cpu; + unsigned int delta_w, decayed = 0; delta = now - sa->last_update_time; /* @@ -2876,8 +2876,30 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, return 0; sa->last_update_time = now; - scale_freq = arch_scale_freq_capacity(NULL, cpu); - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); + if (running) { + sa->stolen_idle_time += delta; + /* + * scale the elapsed time to reflect the real amount of + * computation + */ + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu)); + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu)); + + /* + * Track the amount of stolen idle time due to running at + * lower capacity + */ + sa->stolen_idle_time -= delta; + } else if (!weight) { + if (sa->util_sum < (LOAD_AVG_MAX * 1000)) { + /* + * Add the idle time stolen by running at lower compute + * capacity + */ + delta += sa->stolen_idle_time; + } + sa->stolen_idle_time = 0; + } /* delta_w is the amount already accumulated against our next period */ delta_w = sa->period_contrib; @@ -2893,16 +2915,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, * period and accrue it. */ delta_w = 1024 - delta_w; - scaled_delta_w = cap_scale(delta_w, scale_freq); if (weight) { - sa->load_sum += weight * scaled_delta_w; + sa->load_sum += weight * delta_w; if (cfs_rq) { cfs_rq->runnable_load_sum += - weight * scaled_delta_w; + weight * delta_w; } } if (running) - sa->util_sum += scaled_delta_w * scale_cpu; + sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT; delta -= delta_w; @@ -2919,25 +2940,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, /* Efficiently calculate \sum (1..n_period) 1024*y^i */ contrib = __compute_runnable_contrib(periods); - contrib = cap_scale(contrib, scale_freq); if (weight) { sa->load_sum += weight * contrib; if (cfs_rq) cfs_rq->runnable_load_sum += weight * contrib; } if (running) - sa->util_sum += contrib * scale_cpu; + sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; } /* Remainder of delta accrued against u_0` */ - scaled_delta = cap_scale(delta, scale_freq); if (weight) { - sa->load_sum += weight * scaled_delta; + sa->load_sum += weight * delta; if (cfs_rq) - cfs_rq->runnable_load_sum += weight * scaled_delta; + cfs_rq->runnable_load_sum += weight * delta; } if (running) - sa->util_sum += scaled_delta * scale_cpu; + sa->util_sum += delta << SCHED_CAPACITY_SHIFT; sa->period_contrib += delta;

`The current implementation of load tracking invariance scales the contribution with current frequency and uarch performance (only for utilization) of the CPU. One main result of this formula is that the figures are capped by current capacity of CPU. Another one is that the load_avg is not invariant because not scaled with uarch. The util_avg of a periodic task that runs r time slots every p time slots varies in the range : U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) with U is the max util_avg value = SCHED_CAPACITY_SCALE At a lower capacity, the range becomes: U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p) with C reflecting the compute capacity ratio between current capacity and max capacity. so C tries to compensate changes in (1-y^r') but it can't be accurate. Instead of scaling the contribution value of PELT algo, we should scale the running time. The PELT signal aims to track the amount of computation of tasks and/or rq so it seems more correct to scale the running time to reflect the effective amount of computation done since the last update. In order to be fully invariant, we need to apply the same amount of running time and idle time whatever the current capacity. Because running at lower capacity implies that the task will run longer, we have to track the amount of "stolen" idle time and to apply it when task becomes idle. But once we have reached the maximum utilization value (SCHED_CAPACITY_SCALE), it means that the task is seen as an always-running task whatever the capacity of the cpu (even at max compute capacity). In this case, we can discard the "stolen" idle times which becomes meaningless. In order to cope with rounding effect of PELT algo we take a margin and consider task with utilization greater than 1000 (vs 1024 max) as an always-running task. Then, we can use the same algorithm for both utilization and load and simplify __update_load_avg now that the load of a task doesn't have to be capped by CPU uarch. The responsivness of PELT is improved when CPU is not running at max capacity with this new algorithm. I have put below some examples of duration to reach some typical load values according to the capacity of the CPU with current implementation and with this patch. Util (%) max capacity half capacity(mainline) half capacity(w/ patch) 972 (95%) 138ms not reachable 276ms 486 (47.5%) 30ms 138ms 60ms 256 (25%) 13ms 32ms 26ms On my hikey (octo ARM platform) with schedutil governor, the time to reach max OPP when starting from a null utilization, decreases from 223ms with current scale invariance down to 121ms with the new algorithm. For this test, i have enable arch_scale_freq for arm64. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- include/linux/sched.h | 1 + kernel/sched/fair.c | 49 ++++++++++++++++++++++++++++++++++--------------- 2 files changed, 35 insertions(+), 15 deletions(-) -- 2.7.4`