[RFC] sched/fair: let cpu's cfs_rq to reflect task migration

Message ID 1459528717-17339-1-git-send-email-leo.yan@linaro.org
State New
Headers show

Commit Message

Leo Yan April 1, 2016, 4:38 p.m.
When task is migrated from CPU_A to CPU_B, scheduler will decrease
the task's load/util from the task's cfs_rq and also add them into
migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this
cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is
migrated to CPU_B, then CPU_A still have task's stale value for
load/util; on the other hand CPU_B also cannot reflect new load/util
which introduced by the task.

So this patch is to operate the task's load/util to cpu's cfs_rq, so
finally cpu's cfs_rq can really reflect task's migration.

Signed-off-by: Leo Yan <leo.yan@linaro.org>

---
 kernel/sched/fair.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

-- 
1.9.1

Comments

Steve Muckle April 1, 2016, 10:28 p.m. | #1
On 04/01/2016 12:49 PM, Peter Zijlstra wrote:
> On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote:

>> When task is migrated from CPU_A to CPU_B, scheduler will decrease

>> the task's load/util from the task's cfs_rq and also add them into

>> migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this

>> cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is

>> migrated to CPU_B, then CPU_A still have task's stale value for

>> load/util; on the other hand CPU_B also cannot reflect new load/util

>> which introduced by the task.

>>

>> So this patch is to operate the task's load/util to cpu's cfs_rq, so

>> finally cpu's cfs_rq can really reflect task's migration.

> 

> Sorry but that is unintelligible.

> 

> What is the problem? Why do we care? How did you fix it? and at what

> cost?


I think I follow - Leo please correct me if I mangle your intentions.
It's an issue that Morten and Dietmar had mentioned to me as well.

Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a
task group other than the root. The task migrates from one CPU to
another. The cfs_rq.avg structures on the src and dest CPUs
corresponding to the group containing the task are updated immediately
via remove_entity_load_avg()/update_cfs_rq_load_avg() and
attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root
on the src and dest are not immediately updated. The root cfs_rq.avg
must catch up over time with PELT.

As to why we care, there's at least one issue which may or may not be
Leo's - the root cfs_rq is the one whose avg structure we read from to
determine the frequency with schedutil. If you have a cpu-bound task in
a non-root cgroup which periodically migrates among CPUs on an otherwise
idle system, I believe each time it migrates the frequency would drop to
fmin and have to ramp up again with PELT.

Leo I noticed you did not modify detach_entity_load_average(). I think
this would be needed to avoid the task's stats being double counted for
a while after switched_from_fair() or task_move_group_fair().
Leo Yan April 2, 2016, 7:11 a.m. | #2
On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
> On 04/01/2016 12:49 PM, Peter Zijlstra wrote:

> > On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote:

> >> When task is migrated from CPU_A to CPU_B, scheduler will decrease

> >> the task's load/util from the task's cfs_rq and also add them into

> >> migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this

> >> cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is

> >> migrated to CPU_B, then CPU_A still have task's stale value for

> >> load/util; on the other hand CPU_B also cannot reflect new load/util

> >> which introduced by the task.

> >>

> >> So this patch is to operate the task's load/util to cpu's cfs_rq, so

> >> finally cpu's cfs_rq can really reflect task's migration.

> > 

> > Sorry but that is unintelligible.

> > 

> > What is the problem? Why do we care? How did you fix it? and at what

> > cost?


Sorry. I should take time to write commit with quality.

> I think I follow - Leo please correct me if I mangle your intentions.

> It's an issue that Morten and Dietmar had mentioned to me as well.

> 

> Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a

> task group other than the root. The task migrates from one CPU to

> another. The cfs_rq.avg structures on the src and dest CPUs

> corresponding to the group containing the task are updated immediately

> via remove_entity_load_avg()/update_cfs_rq_load_avg() and

> attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root

> on the src and dest are not immediately updated. The root cfs_rq.avg

> must catch up over time with PELT.

> 

> As to why we care, there's at least one issue which may or may not be

> Leo's - the root cfs_rq is the one whose avg structure we read from to

> determine the frequency with schedutil. If you have a cpu-bound task in

> a non-root cgroup which periodically migrates among CPUs on an otherwise

> idle system, I believe each time it migrates the frequency would drop to

> fmin and have to ramp up again with PELT.


Steve, thanks for explanation and totally agree. My initial purpose is
not from schedutil's perspective, but just did some basic analysis for
utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then
launch a periodic task (period: 100ms, duty cycle: 40%) and limit its
affinity to only two CPUs. So observed the same result like you said.

After applied this patch, I can get much better result for the CPU's
utilization after task's migration. Please see the parsed result for
CPU's utilization:
http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.png
http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg.png

> Leo I noticed you did not modify detach_entity_load_average(). I think

> this would be needed to avoid the task's stats being double counted for

> a while after switched_from_fair() or task_move_group_fair().


Will add it.

Thanks,
Leo Yan
Morten Rasmussen April 4, 2016, 8:48 a.m. | #3
On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
> On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:

> > I think I follow - Leo please correct me if I mangle your intentions.

> > It's an issue that Morten and Dietmar had mentioned to me as well.


Yes. We have been working on this issue for a while without getting to a
nice solution yet.

> > Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a

> > task group other than the root. The task migrates from one CPU to

> > another. The cfs_rq.avg structures on the src and dest CPUs

> > corresponding to the group containing the task are updated immediately

> > via remove_entity_load_avg()/update_cfs_rq_load_avg() and

> > attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root

> > on the src and dest are not immediately updated. The root cfs_rq.avg

> > must catch up over time with PELT.


Yes. The problem is that it is only the cfs_rq.avg of the cfs_rq where
the is enqueued/dequeued that gets immediately updated. If the cfs_rq is
a group cfs_rq its group entity se.avg doesn't get updated immediately.
It has to adapt over time at the pace defined by the geometric series.
The impact of a task migration therefore doesn't trickle through to the
root cfs_rq.avg. This behaviour is one of the fundamental changes Yuyang
introduced with his rewrite of PELT.

> > As to why we care, there's at least one issue which may or may not be

> > Leo's - the root cfs_rq is the one whose avg structure we read from to

> > determine the frequency with schedutil. If you have a cpu-bound task in

> > a non-root cgroup which periodically migrates among CPUs on an otherwise

> > idle system, I believe each time it migrates the frequency would drop to

> > fmin and have to ramp up again with PELT.


It makes any scheduling decision based on utilization difficult if fair
group scheduling is used as cpu_util() doesn't give an up-to-date
picture of any utilization caused by task in task groups.

For the energy-aware scheduling patches and patches we have in the
pipeline for improving capacity awareness in the scheduler we rely on
cpu_util().

> Steve, thanks for explanation and totally agree. My initial purpose is

> not from schedutil's perspective, but just did some basic analysis for

> utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then

> launch a periodic task (period: 100ms, duty cycle: 40%) and limit its

> affinity to only two CPUs. So observed the same result like you said.

> 

> After applied this patch, I can get much better result for the CPU's

> utilization after task's migration. Please see the parsed result for

> CPU's utilization:

> http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.png

> http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg.png

> 

> > Leo I noticed you did not modify detach_entity_load_average(). I think

> > this would be needed to avoid the task's stats being double counted for

> > a while after switched_from_fair() or task_move_group_fair().


I'm afraid that the solution to problem is more complicated than that
:-(

You are adding/removing a contribution from the root cfs_rq.avg which
isn't part of the signal in the first place. The root cfs_rq.avg only
contains the sum of the load/util of the sched_entities on the cfs_rq.
If you remove the contribution of the tasks from there you may end up
double-accounting for the task migration. Once due to you patch and then
again slowly over time as the group sched_entity starts reflecting that
the task has migrated. Furthermore, for group scheduling to make sense
it has to be the task_h_load() you add/remove otherwise the group
weighting is completely lost. Or am I completely misreading your patch?

I don't think the slow response time for _load_ is necessarily a big
problem. Otherwise we would have had people complaining already about
group scheduling being broken. It is however a problem for all the
initiatives that built on utilization.

We have to either make the updates trickle through the group hierarchy
for utilization, which is difficult with making a lot of changes to the
current code structure, or introduce a new avg structure at root level
which contains the sum of task utilization for all _tasks_ (not groups)
in the group hierarchy and maintain it separately.

None of those two are particularly pretty. Better suggestions?
Morten Rasmussen April 4, 2016, 9:01 a.m. | #4
On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 0fe30e6..10ca1a9 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -2825,12 +2825,24 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);

>  static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

>  {

>  	struct sched_avg *sa = &cfs_rq->avg;

> +	struct sched_avg *cpu_sa = NULL;

>  	int decayed, removed = 0;

> +	int cpu = cpu_of(rq_of(cfs_rq));

> +

> +	if (&cpu_rq(cpu)->cfs != cfs_rq)

> +		cpu_sa = &cpu_rq(cpu)->cfs.avg;

>  

>  	if (atomic_long_read(&cfs_rq->removed_load_avg)) {

>  		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);

>  		sa->load_avg = max_t(long, sa->load_avg - r, 0);

>  		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);

> +

> +		if (cpu_sa) {

> +			cpu_sa->load_avg = max_t(long, cpu_sa->load_avg - r, 0);

> +			cpu_sa->load_sum = max_t(s64,

> +					cpu_sa->load_sum - r * LOAD_AVG_MAX, 0);


I think this is not quite right. 'r' is the load contribution removed
from a group cfs_rq. Unless the task is the only task in the group and
the group shares are default, this value is different from the load
contribution of the task at root level (task_h_load()).

And how about nested groups? I think you may end up removing the
contribution of a task multiple times from the root cfs_rq if it is in a
nested group.

You will need to either redesign group scheduling so you get the load to
trickle down automatically and with the right contribution, or maintain
a new sum at the root bypassing the existing design. I'm not sure if the
latter makes sense for load though. It would make more sense for
utilization only.

> @@ -2919,6 +2939,13 @@ skip_aging:

>  	cfs_rq->avg.load_sum += se->avg.load_sum;

>  	cfs_rq->avg.util_avg += se->avg.util_avg;

>  	cfs_rq->avg.util_sum += se->avg.util_sum;

> +

> +	if (&cpu_rq(cpu)->cfs != cfs_rq) {

> +		cpu_rq(cpu)->cfs.avg.load_avg += se->avg.load_avg;

> +		cpu_rq(cpu)->cfs.avg.load_sum += se->avg.load_sum;

> +		cpu_rq(cpu)->cfs.avg.util_avg += se->avg.util_avg;

> +		cpu_rq(cpu)->cfs.avg.util_sum += se->avg.util_sum;

> +	}


Same problem as above. You should be adding the right amount of task
contribution here. Something similar to task_h_load(). The problem with
using task_h_load() is that it is not updated immediately either, so
maintaining a stable signal based on that might be difficult.
Leo Yan April 5, 2016, 6:56 a.m. | #5
On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote:
> On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:

> > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:

> > > I think I follow - Leo please correct me if I mangle your intentions.

> > > It's an issue that Morten and Dietmar had mentioned to me as well.

> 

> Yes. We have been working on this issue for a while without getting to a

> nice solution yet.


Good to know this. This patch is mainly for discussion purpose.

[...]

> > > Leo I noticed you did not modify detach_entity_load_average(). I think

> > > this would be needed to avoid the task's stats being double counted for

> > > a while after switched_from_fair() or task_move_group_fair().

> 

> I'm afraid that the solution to problem is more complicated than that

> :-(

> 

> You are adding/removing a contribution from the root cfs_rq.avg which

> isn't part of the signal in the first place. The root cfs_rq.avg only

> contains the sum of the load/util of the sched_entities on the cfs_rq.

> If you remove the contribution of the tasks from there you may end up

> double-accounting for the task migration. Once due to you patch and then

> again slowly over time as the group sched_entity starts reflecting that

> the task has migrated. Furthermore, for group scheduling to make sense

> it has to be the task_h_load() you add/remove otherwise the group

> weighting is completely lost. Or am I completely misreading your patch?


Here have one thing want to confirm firstly: though CFS has maintained
task group's hierarchy, but between task group's cfs_rq.avg and root
cfs_rq.avg, CFS updates these signals independently rather than accouting
them by crossing the hierarchy.

So currently CFS decreases the group's cfs_rq.avg for task's migration,
but it don't iterate task group's hierarchy to root cfs_rq.avg. I
don't understand your meantioned the second accounting by "then again
slowly over time as the group sched_entity starts reflecting that the
task has migrated."

Another question is: does cfs_rq.avg _ONLY_ signal historic behavior but
not present behavior? so even the task has been migrated we still need
decay it slowly? Or this will be different between load and util?

> I don't think the slow response time for _load_ is necessarily a big

> problem. Otherwise we would have had people complaining already about

> group scheduling being broken. It is however a problem for all the

> initiatives that built on utilization.


Or maybe we need seperate utilization and load, these two signals
have different semantics and purpose.

Thanks,
Leo Yan
Morten Rasmussen April 5, 2016, 7:51 a.m. | #6
On Tue, Apr 05, 2016 at 02:30:03AM +0800, Yuyang Du wrote:
> On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote:

> > On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:

> > > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:

> > > > I think I follow - Leo please correct me if I mangle your intentions.

> > > > It's an issue that Morten and Dietmar had mentioned to me as well.

> > 

> > Yes. We have been working on this issue for a while without getting to a

> > nice solution yet.

>  

> So do you want a "flat hirarchy" for util_avg - just do util_avg for

> rq and task respectively? Seems it is what you want, and it is even easier?


Pretty much, yes. I can't think of a good reason why we need the
utilization of groups as long as we have the task utilization and the
sum of those for the root cfs_rq.

I'm not saying it can't be implemented, just saying that it will make
utilization tracking for groups redundant and possibly duplicate or hack
some the existing code to implement the new root utilization sum.
Morten Rasmussen April 5, 2016, 9:13 a.m. | #7
On Tue, Apr 05, 2016 at 02:56:44PM +0800, Leo Yan wrote:
> On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote:

> > On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:

> > > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:

> > > > I think I follow - Leo please correct me if I mangle your intentions.

> > > > It's an issue that Morten and Dietmar had mentioned to me as well.

> > 

> > Yes. We have been working on this issue for a while without getting to a

> > nice solution yet.

> 

> Good to know this. This patch is mainly for discussion purpose.

> 

> [...]

> 

> > > > Leo I noticed you did not modify detach_entity_load_average(). I think

> > > > this would be needed to avoid the task's stats being double counted for

> > > > a while after switched_from_fair() or task_move_group_fair().

> > 

> > I'm afraid that the solution to problem is more complicated than that

> > :-(

> > 

> > You are adding/removing a contribution from the root cfs_rq.avg which

> > isn't part of the signal in the first place. The root cfs_rq.avg only

> > contains the sum of the load/util of the sched_entities on the cfs_rq.

> > If you remove the contribution of the tasks from there you may end up

> > double-accounting for the task migration. Once due to you patch and then

> > again slowly over time as the group sched_entity starts reflecting that

> > the task has migrated. Furthermore, for group scheduling to make sense

> > it has to be the task_h_load() you add/remove otherwise the group

> > weighting is completely lost. Or am I completely misreading your patch?

> 

> Here have one thing want to confirm firstly: though CFS has maintained

> task group's hierarchy, but between task group's cfs_rq.avg and root

> cfs_rq.avg, CFS updates these signals independently rather than accouting

> them by crossing the hierarchy.

> 

> So currently CFS decreases the group's cfs_rq.avg for task's migration,

> but it don't iterate task group's hierarchy to root cfs_rq.avg. I

> don't understand your meantioned the second accounting by "then again

> slowly over time as the group sched_entity starts reflecting that the

> task has migrated."


The problem is that there is direct link between a group sched_entity's
se->avg and se->my_q.avg. The latter is the sum of PELT load/util of the
sched_entities (tasks or nested groups) on the group cfs_rq, while the
former is the PELT load/util of the group entity which is not based on
cfs_rq sum, but basically just tracks whether that group entity has been
running/runnable or not, but weighted by group load code which is
updating the weight occasionally.

In other words, we do go up/down the hierarchy when tasks migrate, but
we only update the se->my_q.avg (cfs_rq), not the se->avg which is the
load of the group seen by the parent cfs_rq. So the immediate update of
the group cfs_rq.avg where the task sched_entity is enqueued/dequeued
doesn't trickle through the hierarchy instantaneously.

> Another question is: does cfs_rq.avg _ONLY_ signal historic behavior but

> not present behavior? so even the task has been migrated we still need

> decay it slowly? Or this will be different between load and util?


cfs_rq.avg is instantaneously updated on task migration as it is the sum
of the PELT contributions of the sched_entities associated with that
cfs_rq. The group se->avg is not a sum, it behaves just as if it a task
which has a variable load_weight which is determined by group weighting
code, but otherwise identical. No adding/removing of contributions when
tasks migrate.

> 

> > I don't think the slow response time for _load_ is necessarily a big

> > problem. Otherwise we would have had people complaining already about

> > group scheduling being broken. It is however a problem for all the

> > initiatives that built on utilization.

> 

> Or maybe we need seperate utilization and load, these two signals

> have different semantics and purpose.


I think that is up for discussion. People might have different views on
the semantics of utilization. I see them as very similar in the
non-group scheduling case, one is based on running time and not priority
weighted, the other is based on runnable time and has priority
weighting. Otherwise they are the same.

However, in the group scheduling case, I think they should behave
somewhat differently. Load is priority scaled and is designed to ensure
fair scheduling when the system is fully utilized, where utilization
provides a metric the estimates the actual busy time of the cpus. Group
load is scaled such that is capped no matter how much actual cpu time
the group gets across the system. I don't think it makes sense to do the
same for utilization as it would not represent the actual compute
demand. It should be treated as a 'flat hierarchy' as Yuyang mentions in
his reply, so the sum at the root cfs_rq is a proper estimate of the
utilization of the cpu regardless of whether tasks are grouped or not.

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fe30e6..10ca1a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2825,12 +2825,24 @@  static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
 static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
 	struct sched_avg *sa = &cfs_rq->avg;
+	struct sched_avg *cpu_sa = NULL;
 	int decayed, removed = 0;
+	int cpu = cpu_of(rq_of(cfs_rq));
+
+	if (&cpu_rq(cpu)->cfs != cfs_rq)
+		cpu_sa = &cpu_rq(cpu)->cfs.avg;
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+
+		if (cpu_sa) {
+			cpu_sa->load_avg = max_t(long, cpu_sa->load_avg - r, 0);
+			cpu_sa->load_sum = max_t(s64,
+					cpu_sa->load_sum - r * LOAD_AVG_MAX, 0);
+		}
+
 		removed = 1;
 	}
 
@@ -2838,6 +2850,12 @@  static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
 		sa->util_avg = max_t(long, sa->util_avg - r, 0);
 		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+
+		if (cpu_sa) {
+			cpu_sa->util_avg = max_t(long, cpu_sa->util_avg - r, 0);
+			cpu_sa->util_sum = max_t(s64,
+					cpu_sa->util_sum - r * LOAD_AVG_MAX, 0);
+		}
 	}
 
 	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
@@ -2896,6 +2914,8 @@  static inline void update_load_avg(struct sched_entity *se, int update_tg)
 
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	int cpu = cpu_of(rq_of(cfs_rq));
+
 	if (!sched_feat(ATTACH_AGE_LOAD))
 		goto skip_aging;
 
@@ -2919,6 +2939,13 @@  skip_aging:
 	cfs_rq->avg.load_sum += se->avg.load_sum;
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
+
+	if (&cpu_rq(cpu)->cfs != cfs_rq) {
+		cpu_rq(cpu)->cfs.avg.load_avg += se->avg.load_avg;
+		cpu_rq(cpu)->cfs.avg.load_sum += se->avg.load_sum;
+		cpu_rq(cpu)->cfs.avg.util_avg += se->avg.util_avg;
+		cpu_rq(cpu)->cfs.avg.util_sum += se->avg.util_sum;
+	}
 }
 
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)