Message ID | 1459528717-17339-1-git-send-email-leo.yan@linaro.org |
---|---|
State | New |
Headers | show |
On 04/01/2016 12:49 PM, Peter Zijlstra wrote: > On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote: >> When task is migrated from CPU_A to CPU_B, scheduler will decrease >> the task's load/util from the task's cfs_rq and also add them into >> migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this >> cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is >> migrated to CPU_B, then CPU_A still have task's stale value for >> load/util; on the other hand CPU_B also cannot reflect new load/util >> which introduced by the task. >> >> So this patch is to operate the task's load/util to cpu's cfs_rq, so >> finally cpu's cfs_rq can really reflect task's migration. > > Sorry but that is unintelligible. > > What is the problem? Why do we care? How did you fix it? and at what > cost? I think I follow - Leo please correct me if I mangle your intentions. It's an issue that Morten and Dietmar had mentioned to me as well. Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a task group other than the root. The task migrates from one CPU to another. The cfs_rq.avg structures on the src and dest CPUs corresponding to the group containing the task are updated immediately via remove_entity_load_avg()/update_cfs_rq_load_avg() and attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root on the src and dest are not immediately updated. The root cfs_rq.avg must catch up over time with PELT. As to why we care, there's at least one issue which may or may not be Leo's - the root cfs_rq is the one whose avg structure we read from to determine the frequency with schedutil. If you have a cpu-bound task in a non-root cgroup which periodically migrates among CPUs on an otherwise idle system, I believe each time it migrates the frequency would drop to fmin and have to ramp up again with PELT. Leo I noticed you did not modify detach_entity_load_average(). I think this would be needed to avoid the task's stats being double counted for a while after switched_from_fair() or task_move_group_fair().
On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote: > On 04/01/2016 12:49 PM, Peter Zijlstra wrote: > > On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote: > >> When task is migrated from CPU_A to CPU_B, scheduler will decrease > >> the task's load/util from the task's cfs_rq and also add them into > >> migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this > >> cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is > >> migrated to CPU_B, then CPU_A still have task's stale value for > >> load/util; on the other hand CPU_B also cannot reflect new load/util > >> which introduced by the task. > >> > >> So this patch is to operate the task's load/util to cpu's cfs_rq, so > >> finally cpu's cfs_rq can really reflect task's migration. > > > > Sorry but that is unintelligible. > > > > What is the problem? Why do we care? How did you fix it? and at what > > cost? Sorry. I should take time to write commit with quality. > I think I follow - Leo please correct me if I mangle your intentions. > It's an issue that Morten and Dietmar had mentioned to me as well. > > Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a > task group other than the root. The task migrates from one CPU to > another. The cfs_rq.avg structures on the src and dest CPUs > corresponding to the group containing the task are updated immediately > via remove_entity_load_avg()/update_cfs_rq_load_avg() and > attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root > on the src and dest are not immediately updated. The root cfs_rq.avg > must catch up over time with PELT. > > As to why we care, there's at least one issue which may or may not be > Leo's - the root cfs_rq is the one whose avg structure we read from to > determine the frequency with schedutil. If you have a cpu-bound task in > a non-root cgroup which periodically migrates among CPUs on an otherwise > idle system, I believe each time it migrates the frequency would drop to > fmin and have to ramp up again with PELT. Steve, thanks for explanation and totally agree. My initial purpose is not from schedutil's perspective, but just did some basic analysis for utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then launch a periodic task (period: 100ms, duty cycle: 40%) and limit its affinity to only two CPUs. So observed the same result like you said. After applied this patch, I can get much better result for the CPU's utilization after task's migration. Please see the parsed result for CPU's utilization: http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.png http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg.png > Leo I noticed you did not modify detach_entity_load_average(). I think > this would be needed to avoid the task's stats being double counted for > a while after switched_from_fair() or task_move_group_fair(). Will add it. Thanks, Leo Yan
On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote: > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote: > > I think I follow - Leo please correct me if I mangle your intentions. > > It's an issue that Morten and Dietmar had mentioned to me as well. Yes. We have been working on this issue for a while without getting to a nice solution yet. > > Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a > > task group other than the root. The task migrates from one CPU to > > another. The cfs_rq.avg structures on the src and dest CPUs > > corresponding to the group containing the task are updated immediately > > via remove_entity_load_avg()/update_cfs_rq_load_avg() and > > attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root > > on the src and dest are not immediately updated. The root cfs_rq.avg > > must catch up over time with PELT. Yes. The problem is that it is only the cfs_rq.avg of the cfs_rq where the is enqueued/dequeued that gets immediately updated. If the cfs_rq is a group cfs_rq its group entity se.avg doesn't get updated immediately. It has to adapt over time at the pace defined by the geometric series. The impact of a task migration therefore doesn't trickle through to the root cfs_rq.avg. This behaviour is one of the fundamental changes Yuyang introduced with his rewrite of PELT. > > As to why we care, there's at least one issue which may or may not be > > Leo's - the root cfs_rq is the one whose avg structure we read from to > > determine the frequency with schedutil. If you have a cpu-bound task in > > a non-root cgroup which periodically migrates among CPUs on an otherwise > > idle system, I believe each time it migrates the frequency would drop to > > fmin and have to ramp up again with PELT. It makes any scheduling decision based on utilization difficult if fair group scheduling is used as cpu_util() doesn't give an up-to-date picture of any utilization caused by task in task groups. For the energy-aware scheduling patches and patches we have in the pipeline for improving capacity awareness in the scheduler we rely on cpu_util(). > Steve, thanks for explanation and totally agree. My initial purpose is > not from schedutil's perspective, but just did some basic analysis for > utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then > launch a periodic task (period: 100ms, duty cycle: 40%) and limit its > affinity to only two CPUs. So observed the same result like you said. > > After applied this patch, I can get much better result for the CPU's > utilization after task's migration. Please see the parsed result for > CPU's utilization: > http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.png > http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg.png > > > Leo I noticed you did not modify detach_entity_load_average(). I think > > this would be needed to avoid the task's stats being double counted for > > a while after switched_from_fair() or task_move_group_fair(). I'm afraid that the solution to problem is more complicated than that :-( You are adding/removing a contribution from the root cfs_rq.avg which isn't part of the signal in the first place. The root cfs_rq.avg only contains the sum of the load/util of the sched_entities on the cfs_rq. If you remove the contribution of the tasks from there you may end up double-accounting for the task migration. Once due to you patch and then again slowly over time as the group sched_entity starts reflecting that the task has migrated. Furthermore, for group scheduling to make sense it has to be the task_h_load() you add/remove otherwise the group weighting is completely lost. Or am I completely misreading your patch? I don't think the slow response time for _load_ is necessarily a big problem. Otherwise we would have had people complaining already about group scheduling being broken. It is however a problem for all the initiatives that built on utilization. We have to either make the updates trickle through the group hierarchy for utilization, which is difficult with making a lot of changes to the current code structure, or introduce a new avg structure at root level which contains the sum of task utilization for all _tasks_ (not groups) in the group hierarchy and maintain it separately. None of those two are particularly pretty. Better suggestions?
On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0fe30e6..10ca1a9 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2825,12 +2825,24 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); > static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > { > struct sched_avg *sa = &cfs_rq->avg; > + struct sched_avg *cpu_sa = NULL; > int decayed, removed = 0; > + int cpu = cpu_of(rq_of(cfs_rq)); > + > + if (&cpu_rq(cpu)->cfs != cfs_rq) > + cpu_sa = &cpu_rq(cpu)->cfs.avg; > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > sa->load_avg = max_t(long, sa->load_avg - r, 0); > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > + > + if (cpu_sa) { > + cpu_sa->load_avg = max_t(long, cpu_sa->load_avg - r, 0); > + cpu_sa->load_sum = max_t(s64, > + cpu_sa->load_sum - r * LOAD_AVG_MAX, 0); I think this is not quite right. 'r' is the load contribution removed from a group cfs_rq. Unless the task is the only task in the group and the group shares are default, this value is different from the load contribution of the task at root level (task_h_load()). And how about nested groups? I think you may end up removing the contribution of a task multiple times from the root cfs_rq if it is in a nested group. You will need to either redesign group scheduling so you get the load to trickle down automatically and with the right contribution, or maintain a new sum at the root bypassing the existing design. I'm not sure if the latter makes sense for load though. It would make more sense for utilization only. > @@ -2919,6 +2939,13 @@ skip_aging: > cfs_rq->avg.load_sum += se->avg.load_sum; > cfs_rq->avg.util_avg += se->avg.util_avg; > cfs_rq->avg.util_sum += se->avg.util_sum; > + > + if (&cpu_rq(cpu)->cfs != cfs_rq) { > + cpu_rq(cpu)->cfs.avg.load_avg += se->avg.load_avg; > + cpu_rq(cpu)->cfs.avg.load_sum += se->avg.load_sum; > + cpu_rq(cpu)->cfs.avg.util_avg += se->avg.util_avg; > + cpu_rq(cpu)->cfs.avg.util_sum += se->avg.util_sum; > + } Same problem as above. You should be adding the right amount of task contribution here. Something similar to task_h_load(). The problem with using task_h_load() is that it is not updated immediately either, so maintaining a stable signal based on that might be difficult.
On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote: > On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote: > > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote: > > > I think I follow - Leo please correct me if I mangle your intentions. > > > It's an issue that Morten and Dietmar had mentioned to me as well. > > Yes. We have been working on this issue for a while without getting to a > nice solution yet. Good to know this. This patch is mainly for discussion purpose. [...] > > > Leo I noticed you did not modify detach_entity_load_average(). I think > > > this would be needed to avoid the task's stats being double counted for > > > a while after switched_from_fair() or task_move_group_fair(). > > I'm afraid that the solution to problem is more complicated than that > :-( > > You are adding/removing a contribution from the root cfs_rq.avg which > isn't part of the signal in the first place. The root cfs_rq.avg only > contains the sum of the load/util of the sched_entities on the cfs_rq. > If you remove the contribution of the tasks from there you may end up > double-accounting for the task migration. Once due to you patch and then > again slowly over time as the group sched_entity starts reflecting that > the task has migrated. Furthermore, for group scheduling to make sense > it has to be the task_h_load() you add/remove otherwise the group > weighting is completely lost. Or am I completely misreading your patch? Here have one thing want to confirm firstly: though CFS has maintained task group's hierarchy, but between task group's cfs_rq.avg and root cfs_rq.avg, CFS updates these signals independently rather than accouting them by crossing the hierarchy. So currently CFS decreases the group's cfs_rq.avg for task's migration, but it don't iterate task group's hierarchy to root cfs_rq.avg. I don't understand your meantioned the second accounting by "then again slowly over time as the group sched_entity starts reflecting that the task has migrated." Another question is: does cfs_rq.avg _ONLY_ signal historic behavior but not present behavior? so even the task has been migrated we still need decay it slowly? Or this will be different between load and util? > I don't think the slow response time for _load_ is necessarily a big > problem. Otherwise we would have had people complaining already about > group scheduling being broken. It is however a problem for all the > initiatives that built on utilization. Or maybe we need seperate utilization and load, these two signals have different semantics and purpose. Thanks, Leo Yan
On Tue, Apr 05, 2016 at 02:30:03AM +0800, Yuyang Du wrote: > On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote: > > On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote: > > > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote: > > > > I think I follow - Leo please correct me if I mangle your intentions. > > > > It's an issue that Morten and Dietmar had mentioned to me as well. > > > > Yes. We have been working on this issue for a while without getting to a > > nice solution yet. > > So do you want a "flat hirarchy" for util_avg - just do util_avg for > rq and task respectively? Seems it is what you want, and it is even easier? Pretty much, yes. I can't think of a good reason why we need the utilization of groups as long as we have the task utilization and the sum of those for the root cfs_rq. I'm not saying it can't be implemented, just saying that it will make utilization tracking for groups redundant and possibly duplicate or hack some the existing code to implement the new root utilization sum.
On Tue, Apr 05, 2016 at 02:56:44PM +0800, Leo Yan wrote: > On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote: > > On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote: > > > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote: > > > > I think I follow - Leo please correct me if I mangle your intentions. > > > > It's an issue that Morten and Dietmar had mentioned to me as well. > > > > Yes. We have been working on this issue for a while without getting to a > > nice solution yet. > > Good to know this. This patch is mainly for discussion purpose. > > [...] > > > > > Leo I noticed you did not modify detach_entity_load_average(). I think > > > > this would be needed to avoid the task's stats being double counted for > > > > a while after switched_from_fair() or task_move_group_fair(). > > > > I'm afraid that the solution to problem is more complicated than that > > :-( > > > > You are adding/removing a contribution from the root cfs_rq.avg which > > isn't part of the signal in the first place. The root cfs_rq.avg only > > contains the sum of the load/util of the sched_entities on the cfs_rq. > > If you remove the contribution of the tasks from there you may end up > > double-accounting for the task migration. Once due to you patch and then > > again slowly over time as the group sched_entity starts reflecting that > > the task has migrated. Furthermore, for group scheduling to make sense > > it has to be the task_h_load() you add/remove otherwise the group > > weighting is completely lost. Or am I completely misreading your patch? > > Here have one thing want to confirm firstly: though CFS has maintained > task group's hierarchy, but between task group's cfs_rq.avg and root > cfs_rq.avg, CFS updates these signals independently rather than accouting > them by crossing the hierarchy. > > So currently CFS decreases the group's cfs_rq.avg for task's migration, > but it don't iterate task group's hierarchy to root cfs_rq.avg. I > don't understand your meantioned the second accounting by "then again > slowly over time as the group sched_entity starts reflecting that the > task has migrated." The problem is that there is direct link between a group sched_entity's se->avg and se->my_q.avg. The latter is the sum of PELT load/util of the sched_entities (tasks or nested groups) on the group cfs_rq, while the former is the PELT load/util of the group entity which is not based on cfs_rq sum, but basically just tracks whether that group entity has been running/runnable or not, but weighted by group load code which is updating the weight occasionally. In other words, we do go up/down the hierarchy when tasks migrate, but we only update the se->my_q.avg (cfs_rq), not the se->avg which is the load of the group seen by the parent cfs_rq. So the immediate update of the group cfs_rq.avg where the task sched_entity is enqueued/dequeued doesn't trickle through the hierarchy instantaneously. > Another question is: does cfs_rq.avg _ONLY_ signal historic behavior but > not present behavior? so even the task has been migrated we still need > decay it slowly? Or this will be different between load and util? cfs_rq.avg is instantaneously updated on task migration as it is the sum of the PELT contributions of the sched_entities associated with that cfs_rq. The group se->avg is not a sum, it behaves just as if it a task which has a variable load_weight which is determined by group weighting code, but otherwise identical. No adding/removing of contributions when tasks migrate. > > > I don't think the slow response time for _load_ is necessarily a big > > problem. Otherwise we would have had people complaining already about > > group scheduling being broken. It is however a problem for all the > > initiatives that built on utilization. > > Or maybe we need seperate utilization and load, these two signals > have different semantics and purpose. I think that is up for discussion. People might have different views on the semantics of utilization. I see them as very similar in the non-group scheduling case, one is based on running time and not priority weighted, the other is based on runnable time and has priority weighting. Otherwise they are the same. However, in the group scheduling case, I think they should behave somewhat differently. Load is priority scaled and is designed to ensure fair scheduling when the system is fully utilized, where utilization provides a metric the estimates the actual busy time of the cpus. Group load is scaled such that is capped no matter how much actual cpu time the group gets across the system. I don't think it makes sense to do the same for utilization as it would not represent the actual compute demand. It should be treated as a 'flat hierarchy' as Yuyang mentions in his reply, so the sum at the root cfs_rq is a proper estimate of the utilization of the cpu regardless of whether tasks are grouped or not.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0fe30e6..10ca1a9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2825,12 +2825,24 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) { struct sched_avg *sa = &cfs_rq->avg; + struct sched_avg *cpu_sa = NULL; int decayed, removed = 0; + int cpu = cpu_of(rq_of(cfs_rq)); + + if (&cpu_rq(cpu)->cfs != cfs_rq) + cpu_sa = &cpu_rq(cpu)->cfs.avg; if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); sa->load_avg = max_t(long, sa->load_avg - r, 0); sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); + + if (cpu_sa) { + cpu_sa->load_avg = max_t(long, cpu_sa->load_avg - r, 0); + cpu_sa->load_sum = max_t(s64, + cpu_sa->load_sum - r * LOAD_AVG_MAX, 0); + } + removed = 1; } @@ -2838,6 +2850,12 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0); sa->util_avg = max_t(long, sa->util_avg - r, 0); sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0); + + if (cpu_sa) { + cpu_sa->util_avg = max_t(long, cpu_sa->util_avg - r, 0); + cpu_sa->util_sum = max_t(s64, + cpu_sa->util_sum - r * LOAD_AVG_MAX, 0); + } } decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, @@ -2896,6 +2914,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg) static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { + int cpu = cpu_of(rq_of(cfs_rq)); + if (!sched_feat(ATTACH_AGE_LOAD)) goto skip_aging; @@ -2919,6 +2939,13 @@ skip_aging: cfs_rq->avg.load_sum += se->avg.load_sum; cfs_rq->avg.util_avg += se->avg.util_avg; cfs_rq->avg.util_sum += se->avg.util_sum; + + if (&cpu_rq(cpu)->cfs != cfs_rq) { + cpu_rq(cpu)->cfs.avg.load_avg += se->avg.load_avg; + cpu_rq(cpu)->cfs.avg.load_sum += se->avg.load_sum; + cpu_rq(cpu)->cfs.avg.util_avg += se->avg.util_avg; + cpu_rq(cpu)->cfs.avg.util_sum += se->avg.util_sum; + } } static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
When task is migrated from CPU_A to CPU_B, scheduler will decrease the task's load/util from the task's cfs_rq and also add them into migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is migrated to CPU_B, then CPU_A still have task's stale value for load/util; on the other hand CPU_B also cannot reflect new load/util which introduced by the task. So this patch is to operate the task's load/util to cpu's cfs_rq, so finally cpu's cfs_rq can really reflect task's migration. Signed-off-by: Leo Yan <leo.yan@linaro.org> --- kernel/sched/fair.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) -- 1.9.1