[RFC,6/7] sched: cfs: cpu frequency scaling based on task placement

Message ID 1413958051-7103-7-git-send-email-mturquette@linaro.org
State New
Headers show

Commit Message

Mike Turquette Oct. 22, 2014, 6:07 a.m.
{en,de}queue_task_fair are updated to track which cpus will have changed
utilization values as function of task queueing. The affected cpus are
passed on to arch_eval_cpu_freq for further machine-specific processing
based on a selectable policy.

arch_scale_cpu_freq is called from run_rebalance_domains as a way to
kick off the scaling process (via wake_up_process), so as to prevent
re-entering the {en,de}queue code.

All of the call sites in this patch are up for discussion. Does it make
sense to track which cpus have updated statistics in enqueue_fair_task?
I chose this because I wanted to gather statistics for all cpus affected
in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
next version of this patch will focus on the simpler case of not using
scheduler cgroups, which should remove a good chunk of this code,
including the cpumask stuff.

Also discussed at LPC14 is that fact that load_balance is a very
interesting place to do this as frequency can be considered in concert
with task placement. Please put forth any ideas on a sensible way to do
this.

Is run_rebalance_domains a logical place to change cpu frequency? What
other call sites make sense?

Even for platforms that can target a cpu frequency without sleeping
(x86, some ARM platforms with PM microcontrollers) it is currently
necessary to always kick the frequency target work out into a kthread.
This is because of the rw_sem usage in the cpufreq core which might
sleep. Replacing that lock type is probably a good idea.

Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
---
 kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

Comments

Preeti U Murthy Oct. 23, 2014, 4:03 a.m. | #1
Hi Mike,

On 10/22/2014 11:37 AM, Mike Turquette wrote:
> {en,de}queue_task_fair are updated to track which cpus will have changed
> utilization values as function of task queueing. The affected cpus are
> passed on to arch_eval_cpu_freq for further machine-specific processing
> based on a selectable policy.
> 
> arch_scale_cpu_freq is called from run_rebalance_domains as a way to
> kick off the scaling process (via wake_up_process), so as to prevent
> re-entering the {en,de}queue code.
> 
> All of the call sites in this patch are up for discussion. Does it make
> sense to track which cpus have updated statistics in enqueue_fair_task?
> I chose this because I wanted to gather statistics for all cpus affected
> in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the

Can you explain how pstate selection can get affected by the presence of
task groups? We are after all concerned with the cpu load. So when we
enqueue/dequeue a task, we update the cpu load and pass it on for cpu
pstate scaling. How does this change if we have task groups?
I know that this issue was brought up during LPC, but I have yet not
managed to gain clarity here.

> next version of this patch will focus on the simpler case of not using
> scheduler cgroups, which should remove a good chunk of this code,
> including the cpumask stuff.
> 
> Also discussed at LPC14 is that fact that load_balance is a very
> interesting place to do this as frequency can be considered in concert
> with task placement. Please put forth any ideas on a sensible way to do
> this.
> 
> Is run_rebalance_domains a logical place to change cpu frequency? What
> other call sites make sense?
> 
> Even for platforms that can target a cpu frequency without sleeping
> (x86, some ARM platforms with PM microcontrollers) it is currently
> necessary to always kick the frequency target work out into a kthread.
> This is because of the rw_sem usage in the cpufreq core which might
> sleep. Replacing that lock type is probably a good idea.
> 
> Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
> ---
>  kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1af6f6d..3619f63 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se = &p->se;
> +	struct cpumask update_cpus;
> +
> +	cpumask_clear(&update_cpus);
> 
>  	for_each_sched_entity(se) {
>  		if (se->on_rq)
> @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> 
>  		update_cfs_shares(cfs_rq);
>  		update_entity_load_avg(se, 1);
> +		/* track cpus that need to be re-evaluated */
> +		cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);

All the cfs_rqs that you iterate through here will belong to the same
rq/cpu right?

Regards
Preeti U Murthy
Peter Zijlstra Oct. 27, 2014, 3:55 p.m. | #2
On Tue, Oct 21, 2014 at 11:07:30PM -0700, Mike Turquette wrote:
> {en,de}queue_task_fair are updated to track which cpus will have changed
> utilization values as function of task queueing. The affected cpus are
> passed on to arch_eval_cpu_freq for further machine-specific processing
> based on a selectable policy.

Yeah, I'm not sure about the arch eval hook, ideally it'd be all
integrated with the energy model.

> arch_scale_cpu_freq is called from run_rebalance_domains as a way to
> kick off the scaling process (via wake_up_process), so as to prevent
> re-entering the {en,de}queue code.

We might want a better name for that :-) dvfs_set_freq() or whatnot, or
maybe preserve the cpufreq_*() namespace, people seen to know that that
is the linux dvfs name.

> All of the call sites in this patch are up for discussion. Does it make
> sense to track which cpus have updated statistics in enqueue_fair_task?

Like I said, I don't think so, we guestimate and approximate everything
anyhow, don't bother trying to be 'perfect' here, its excessively
expensive.

> I chose this because I wanted to gather statistics for all cpus affected
> in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
> next version of this patch will focus on the simpler case of not using
> scheduler cgroups, which should remove a good chunk of this code,
> including the cpumask stuff.

Yes please, make the cpumask stuff go away :-)

> Also discussed at LPC14 is that fact that load_balance is a very
> interesting place to do this as frequency can be considered in concert
> with task placement. Please put forth any ideas on a sensible way to do
> this.

Ideally it'd be natural fallout of Morten's energy model.

If you take a multi-core energy model, find its bifurcations and map its
solution spaces I suspect there to be a fairly small set of actual
behaviours.

The problem is, nobody seems to have done this yet so we don't know.

Once you've done this, you can try and minimize the model by proving you
retain all behaviour modes, but for now Morten has a rather full
parameter space (not complete though, and the impact of the missing
parameters might or might not be relevant, impossible to prove until we
have the above done).

> Is run_rebalance_domains a logical place to change cpu frequency? What
> other call sites make sense?

For the legacy systems, maybe.

> Even for platforms that can target a cpu frequency without sleeping
> (x86, some ARM platforms with PM microcontrollers) it is currently
> necessary to always kick the frequency target work out into a kthread.
> This is because of the rw_sem usage in the cpufreq core which might
> sleep. Replacing that lock type is probably a good idea.

I think it would be best to start with this, ideally we'd be able to RCU
free the thing such that either holding the rwsem or rcu_read_lock is
sufficient for usage, that way the sleeping muck can grab the rwsem, the
non-sleeping stuff can grab rcu_read_lock.

But I've not looked at the cpufreq stuff at all.
Dietmar Eggemann Oct. 27, 2014, 5:42 p.m. | #3
Hi Mike,

On 22/10/14 07:07, Mike Turquette wrote:
> {en,de}queue_task_fair are updated to track which cpus will have changed
> utilization values as function of task queueing.

The sentence is a little bit misleading. We update the se utilization
contrib and the cfs_rq utilization in {en,de}queue_task_fair for a
specific se and a specific cpu = rq_of(cfs_rq_of(se))->cpu .

> The affected cpus are
> passed on to arch_eval_cpu_freq for further machine-specific processing
> based on a selectable policy.

I'm not sure if separating the evaluation and the setting of the cpu
frequency makes sense. You could evaluate and possibly set the cpu
frequency in one go. Right now you evaluate if the cfs_rq utilization
exceeds the thresholds for the current index every time a task is
enqueued or dequeued but that's not necessary since you only try to set
the cpu frequency in the softirq. The history (and the future if we
consider blocked utilization) is already captured in the cfs_rq
utilization itself.

> 
> arch_scale_cpu_freq is called from run_rebalance_domains as a way to
> kick off the scaling process (via wake_up_process), so as to prevent
> re-entering the {en,de}queue code.

The name is misleading from the viewpoint of the CFS sched class. The
original scaling function of the CFS scheduler
(arch_scale_{freq,smt/cpu,rt}_capacity) scale capacity based on
frequency, uarch or rt. So your function should be call
arch_scale_util_cpu_freq or even better arch_set_cpu_freq.

> 
> All of the call sites in this patch are up for discussion. Does it make
> sense to track which cpus have updated statistics in enqueue_fair_task?

Not really because cfs_rq utilization tracks the history/(future) of cpu
utilization and you can evaluate the signal when you want to set the cpu
frequency.

> I chose this because I wanted to gather statistics for all cpus affected
> in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
> next version of this patch will focus on the simpler case of not using
> scheduler cgroups, which should remove a good chunk of this code,
> including the cpumask stuff.

I don't understand why you should care about task groups at all. The
task groups contribution to the utilization of a cpu should be already
encountered for in the appropriate cpu's cfs_rq utilization signal.

But I can see a dependency to the fact that there is a difference
between systems with per-cluster (package) or per-cpu frequency scaling.
But there is no SD_SHARE_FREQDOMAIN (sched domain flag) today which
applied to the SD level MC could tell you tahts we deal with per-cluster
frequency scaling though.
On systems with per-cpu frequency scaling you can set the frequency for
individual cpus by hooking into the scheduler but for systems with
per-cluster frequency scaling, you would have to respect the maximum cpu
utilization of all cpus in the cluster.

A similar problem occurs with hardware threads (SMT sd level).

But I don't know right now how the sd topology hierarchy can become
handy here.

> 
> Also discussed at LPC14 is that fact that load_balance is a very
> interesting place to do this as frequency can be considered in concert
> with task placement. Please put forth any ideas on a sensible way to do
> this.
> 
> Is run_rebalance_domains a logical place to change cpu frequency? What
> other call sites make sense?

At least it's a good place to test this feature for now.

> 
> Even for platforms that can target a cpu frequency without sleeping
> (x86, some ARM platforms with PM microcontrollers) it is currently
> necessary to always kick the frequency target work out into a kthread.
> This is because of the rw_sem usage in the cpufreq core which might
> sleep. Replacing that lock type is probably a good idea.
> 
> Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
> ---
>  kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1af6f6d..3619f63 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se = &p->se;
> +	struct cpumask update_cpus;
> +
> +	cpumask_clear(&update_cpus);
>  
>  	for_each_sched_entity(se) {
>  		if (se->on_rq)
> @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  
>  		update_cfs_shares(cfs_rq);
>  		update_entity_load_avg(se, 1);
> +		/* track cpus that need to be re-evaluated */
> +		cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
>  	}
>  
> +	/* !CONFIG_FAIR_GROUP_SCHED */
>  	if (!se) {
>  		update_rq_runnable_avg(rq, rq->nr_running);
>  		add_nr_running(rq, 1);
> +
> +		/*
> +		 * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
> +		 * typedef update_cpus into an int and skip all of the cpumask
> +		 * stuff
> +		 */
> +		cpumask_set_cpu(cpu_of(rq), &update_cpus);
>  	}
> +
> +	if (energy_aware())
> +		if (!cpumask_empty(&update_cpus))
> +			arch_eval_cpu_freq(&update_cpus);
> +
>  	hrtick_update(rq);
>  }
>  
> @@ -4049,6 +4067,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se = &p->se;
>  	int task_sleep = flags & DEQUEUE_SLEEP;
> +	struct cpumask update_cpus;
> +
> +	cpumask_clear(&update_cpus);
>  
>  	for_each_sched_entity(se) {
>  		cfs_rq = cfs_rq_of(se);
> @@ -4089,12 +4110,27 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  
>  		update_cfs_shares(cfs_rq);
>  		update_entity_load_avg(se, 1);
> +		/* track runqueues/cpus that need to be re-evaluated */
> +		cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
>  	}
>  
> +	/* !CONFIG_FAIR_GROUP_SCHED */
>  	if (!se) {
>  		sub_nr_running(rq, 1);
>  		update_rq_runnable_avg(rq, 1);
> +
> +		/*
> +		 * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
> +		 * typedef update_cpus into an int and skip all of the cpumask
> +		 * stuff
> +		 */
> +		cpumask_set_cpu(cpu_of(rq), &update_cpus);
>  	}
> +
> +	if (energy_aware())
> +		if (!cpumask_empty(&update_cpus))
> +			arch_eval_cpu_freq(&update_cpus);
> +
>  	hrtick_update(rq);
>  }
>  
> @@ -7536,6 +7572,9 @@ static void run_rebalance_domains(struct softirq_action *h)
>  	 * stopped.
>  	 */
>  	nohz_idle_balance(this_rq, idle);
> +
> +	if (energy_aware())
> +		arch_scale_cpu_freq();
>  }
>  
>  /*
>

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1af6f6d..3619f63 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3999,6 +3999,9 @@  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	struct cpumask update_cpus;
+
+	cpumask_clear(&update_cpus);
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
@@ -4028,12 +4031,27 @@  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
+		/* track cpus that need to be re-evaluated */
+		cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
 	}
 
+	/* !CONFIG_FAIR_GROUP_SCHED */
 	if (!se) {
 		update_rq_runnable_avg(rq, rq->nr_running);
 		add_nr_running(rq, 1);
+
+		/*
+		 * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
+		 * typedef update_cpus into an int and skip all of the cpumask
+		 * stuff
+		 */
+		cpumask_set_cpu(cpu_of(rq), &update_cpus);
 	}
+
+	if (energy_aware())
+		if (!cpumask_empty(&update_cpus))
+			arch_eval_cpu_freq(&update_cpus);
+
 	hrtick_update(rq);
 }
 
@@ -4049,6 +4067,9 @@  static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
 	int task_sleep = flags & DEQUEUE_SLEEP;
+	struct cpumask update_cpus;
+
+	cpumask_clear(&update_cpus);
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
@@ -4089,12 +4110,27 @@  static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
+		/* track runqueues/cpus that need to be re-evaluated */
+		cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
 	}
 
+	/* !CONFIG_FAIR_GROUP_SCHED */
 	if (!se) {
 		sub_nr_running(rq, 1);
 		update_rq_runnable_avg(rq, 1);
+
+		/*
+		 * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
+		 * typedef update_cpus into an int and skip all of the cpumask
+		 * stuff
+		 */
+		cpumask_set_cpu(cpu_of(rq), &update_cpus);
 	}
+
+	if (energy_aware())
+		if (!cpumask_empty(&update_cpus))
+			arch_eval_cpu_freq(&update_cpus);
+
 	hrtick_update(rq);
 }
 
@@ -7536,6 +7572,9 @@  static void run_rebalance_domains(struct softirq_action *h)
 	 * stopped.
 	 */
 	nohz_idle_balance(this_rq, idle);
+
+	if (energy_aware())
+		arch_scale_cpu_freq();
 }
 
 /*