[RFC,2/2] sched: Enqueue tasks on a cpu with only SCHED_IDLE tasks

Message ID 71a578a0c5b39169fe74ad378ee41eaf546844ac.1543229820.git.viresh.kumar@linaro.org
State New
Headers show
Series
  • sched: Power optimizations with SCHED_IDLE
Related show

Commit Message

Viresh Kumar Nov. 26, 2018, 11:20 a.m.
The scheduler tries to schedule a newly wakeup task on an idle CPU to
make sure the new task gets chance to run as soon as possible, for
performance reasons.

The SCHED_IDLE scheduling policy is used for tasks which have the lowest
priority and there is no hurry in running them. If all the tasks
currently enqueued on a CPU have their policy set to SCHED_IDLE, then
any new task (non SCHED_IDLE) enqueued on that CPU should normally get a
chance to run immediately. This patch takes advantage of this to save
power in some cases by avoiding waking up an idle CPU (which may be in
some deep idle state) and enqueuing the new task on a CPU which only has
SCHED_IDLE tasks.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

---
 kernel/sched/core.c  | 23 ++++++++++++++++++++
 kernel/sched/fair.c  | 50 +++++++++++++++++++++++++++++++-------------
 kernel/sched/sched.h |  3 +++
 3 files changed, 62 insertions(+), 14 deletions(-)

-- 
2.19.1.568.g152ad8e3369a

Comments

Quentin Perret Nov. 26, 2018, 12:37 p.m. | #1
Hi Viresh,

On Monday 26 Nov 2018 at 16:50:24 (+0530), Viresh Kumar wrote:
> The scheduler tries to schedule a newly wakeup task on an idle CPU to

> make sure the new task gets chance to run as soon as possible, for

> performance reasons.

> 

> The SCHED_IDLE scheduling policy is used for tasks which have the lowest

> priority and there is no hurry in running them. If all the tasks

> currently enqueued on a CPU have their policy set to SCHED_IDLE, then

> any new task (non SCHED_IDLE) enqueued on that CPU should normally get a

> chance to run immediately. This patch takes advantage of this to save

> power in some cases by avoiding waking up an idle CPU (which may be in

> some deep idle state) and enqueuing the new task on a CPU which only has

> SCHED_IDLE tasks.


So, avoiding to wake-up a CPU isn't always good for energy. You may
prefer to spread tasks in order to keep the OPP low, for example. What
you're trying to achieve here can be actively harmful for both energy
and performance in some cases, I think.

Also, packing will reduce your chances to go cluster idle (yes you're
not guaranteed to go cluster idle either if you spread depending how
the tasks align in time, but at least there's a chance). So, even from
the idle perspective it's not obvious we actually want to do that.

And finally, the placement that this patch tries to achieve is
inherently unbalanced IIUC. So, unless you hide this behind the EAS
static key, you'll need to make sure the periodic/idle load balance code
doesn't kill all the work you do in the wake-up path. So I'm not sure
this patch really works in practice in its current state.

Now, I think you have a point by saying we could possibly be a bit
smarter with the way we deal with SCHED_IDLE tasks, especially if they
are going to be used more (is that a certainty BTW ?), I'm just not
entirely convinced with the 'power' argument yet.

Maybe there is something we could do if, say we need to schedule a
SCHED_NORMAL task and all CPUs have roughly the same load and/or
utilization numbers, then if a CPU is busy running SCHED_IDLE tasks we
should select it in priority since we know for a fact it's not running
anything important.

What do you think ?

Thanks,
Quentin
Viresh Kumar Nov. 27, 2018, 10:24 a.m. | #2
Hi Quentin,

On 26-11-18, 12:37, Quentin Perret wrote:
> On Monday 26 Nov 2018 at 16:50:24 (+0530), Viresh Kumar wrote:

> > The scheduler tries to schedule a newly wakeup task on an idle CPU to

> > make sure the new task gets chance to run as soon as possible, for

> > performance reasons.

> > 

> > The SCHED_IDLE scheduling policy is used for tasks which have the lowest

> > priority and there is no hurry in running them. If all the tasks

> > currently enqueued on a CPU have their policy set to SCHED_IDLE, then

> > any new task (non SCHED_IDLE) enqueued on that CPU should normally get a

> > chance to run immediately. This patch takes advantage of this to save

> > power in some cases by avoiding waking up an idle CPU (which may be in

> > some deep idle state) and enqueuing the new task on a CPU which only has

> > SCHED_IDLE tasks.

> 

> So, avoiding to wake-up a CPU isn't always good for energy. You may

> prefer to spread tasks in order to keep the OPP low, for example. What

> you're trying to achieve here can be actively harmful for both energy

> and performance in some cases, I think.


Yeah, we may end up packing SCHED_IDLE tasks to a single CPU in this case.

We know that dynamic energy is significantly more than static energy and that is
what we should care more about. Yes, higher OPP should be avoided (apart from
performance reasons), but isn't it better (for power) to run a single CPU at
somewhat higher OPP (1GHz ?) instead of running four of them at lower OPPs (500
MHz) ?

> Also, packing will reduce your chances to go cluster idle (yes you're

> not guaranteed to go cluster idle either if you spread depending how

> the tasks align in time, but at least there's a chance). So, even from

> the idle perspective it's not obvious we actually want to do that.


But do we really want to fire all CPUs of a cluster to finish the work earlier
and go cluster idle ? We don't really believe in race-to-idle as that's why we
have the whole DVFS thing here, right ?

> And finally, the placement that this patch tries to achieve is

> inherently unbalanced IIUC. So, unless you hide this behind the EAS

> static key, you'll need to make sure the periodic/idle load balance code

> doesn't kill all the work you do in the wake-up path. So I'm not sure

> this patch really works in practice in its current state.


True, I intentionally left the load-balancer code as is to avoid larger diff for
now. The idea was to get more feedback on the whole thing before investing too
much on it.

> Now, I think you have a point by saying we could possibly be a bit

> smarter with the way we deal with SCHED_IDLE tasks, especially if they

> are going to be used more (is that a certainty BTW ?), I'm just not

> entirely convinced with the 'power' argument yet.


Todd confirmed earlier (privately) that most (?) of the android background tasks
can actually be migrated to use SCHED_IDLE stuff as there is no urgency in
scheduling them normally.

@Todd, can you please provide some inputs here as well ?

> Maybe there is something we could do if, say we need to schedule a

> SCHED_NORMAL task and all CPUs have roughly the same load and/or

> utilization numbers, then if a CPU is busy running SCHED_IDLE tasks we

> should select it in priority since we know for a fact it's not running

> anything important.

> 

> What do you think ?


Sure, I am not saying that the approach taken by this patch is the best or the
worst. We need to come up with better policy on how we can benefit from the
SCHED_IDLE policy and that's where I am looking for inputs from all of you.

Thanks for the feedback.

-- 
viresh
Quentin Perret Nov. 27, 2018, 11 a.m. | #3
Hi Viresh,

On Tuesday 27 Nov 2018 at 15:54:42 (+0530), Viresh Kumar wrote:
> Hi Quentin,

> 

> On 26-11-18, 12:37, Quentin Perret wrote:

> > On Monday 26 Nov 2018 at 16:50:24 (+0530), Viresh Kumar wrote:

> > > The scheduler tries to schedule a newly wakeup task on an idle CPU to

> > > make sure the new task gets chance to run as soon as possible, for

> > > performance reasons.

> > > 

> > > The SCHED_IDLE scheduling policy is used for tasks which have the lowest

> > > priority and there is no hurry in running them. If all the tasks

> > > currently enqueued on a CPU have their policy set to SCHED_IDLE, then

> > > any new task (non SCHED_IDLE) enqueued on that CPU should normally get a

> > > chance to run immediately. This patch takes advantage of this to save

> > > power in some cases by avoiding waking up an idle CPU (which may be in

> > > some deep idle state) and enqueuing the new task on a CPU which only has

> > > SCHED_IDLE tasks.

> > 

> > So, avoiding to wake-up a CPU isn't always good for energy. You may

> > prefer to spread tasks in order to keep the OPP low, for example. What

> > you're trying to achieve here can be actively harmful for both energy

> > and performance in some cases, I think.

> 

> Yeah, we may end up packing SCHED_IDLE tasks to a single CPU in this case.

> 

> We know that dynamic energy is significantly more than static energy and that is

> what we should care more about. Yes, higher OPP should be avoided (apart from

> performance reasons), but isn't it better (for power) to run a single CPU at

> somewhat higher OPP (1GHz ?) instead of running four of them at lower OPPs (500

> MHz) ?


I guess that really depends on your platform (which is why EAS is using
an Energy Model BTW, it's pretty hard to find one heuristic that works
well for all topologies out there). But your example is a bit unfair I
think. You should compare 1 CPU at 1GHz vs 4 CPUs at 250MHz. Otherwise
you're not resourcing the tasks adequately in one of the cases.

> 

> > Also, packing will reduce your chances to go cluster idle (yes you're

> > not guaranteed to go cluster idle either if you spread depending how

> > the tasks align in time, but at least there's a chance). So, even from

> > the idle perspective it's not obvious we actually want to do that.

> 

> But do we really want to fire all CPUs of a cluster to finish the work earlier

> and go cluster idle ? We don't really believe in race-to-idle as that's why we

> have the whole DVFS thing here, right ?


Right, I'm certainly not advocating for a race-to-idle policy here. What
I'm saying is that, if you can avoid to raise the OPP by spreading, it's
often a good thing from an energy standpoint, because as you said the
dynamic energy is generally the most expensive part. But even if you can
pack the tasks on a single CPU without having to raise the OPP, it's not
obvious this is the right thing to do either since that will prevent you
from going cluster idle (or reduce the time you _could_ spend cluster
idle at least).

That kind of packing vs spreading energy assessment is really hard to
do in general, especially without knowing the costs of running at
different idle states.

> > And finally, the placement that this patch tries to achieve is

> > inherently unbalanced IIUC. So, unless you hide this behind the EAS

> > static key, you'll need to make sure the periodic/idle load balance code

> > doesn't kill all the work you do in the wake-up path. So I'm not sure

> > this patch really works in practice in its current state.

> 

> True, I intentionally left the load-balancer code as is to avoid larger diff for

> now. The idea was to get more feedback on the whole thing before investing too

> much on it.


OK I see :-)

> 

> > Now, I think you have a point by saying we could possibly be a bit

> > smarter with the way we deal with SCHED_IDLE tasks, especially if they

> > are going to be used more (is that a certainty BTW ?), I'm just not

> > entirely convinced with the 'power' argument yet.

> 

> Todd confirmed earlier (privately) that most (?) of the android background tasks

> can actually be migrated to use SCHED_IDLE stuff as there is no urgency in

> scheduling them normally.


Ah, that's interesting ... If there are tasks in Android that we really
don't care about (that is it's actually fine to starve them), then maybe
we should put those in SCHED_IDLE indeed ... That'll leave the stage for
the other tasks that do have stronger requirements.

So yeah, I agree it's worth investigating.

> @Todd, can you please provide some inputs here as well ?

> 

> > Maybe there is something we could do if, say we need to schedule a

> > SCHED_NORMAL task and all CPUs have roughly the same load and/or

> > utilization numbers, then if a CPU is busy running SCHED_IDLE tasks we

> > should select it in priority since we know for a fact it's not running

> > anything important.

> > 

> > What do you think ?

> 

> Sure, I am not saying that the approach taken by this patch is the best or the

> worst. We need to come up with better policy on how we can benefit from the

> SCHED_IDLE policy and that's where I am looking for inputs from all of you.


Right so my overall advice would be to try an avoid to hard-code a pure
packing heuristic like that (unless you have loads of numbers to backup
the idea and show it works well), but perhaps to use the policy of the
tasks to try and break the tie between CPU candidates that cannot be
differentiated otherwise because the other metrics (load / utilization)
are roughly equivalent).

We really ought to make sure, however, that we have a strong use case
for SCHED_IDLE tasks in Android or elsewhere first before adding any
kind infrastructure for it.

Anyways, just my two cents ...

> Thanks for the feedback.


I hope that's useful :-)

Thanks,
Quentin

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3d87a28da378..176eed77b18e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4020,6 +4020,29 @@  int available_idle_cpu(int cpu)
 	return 1;
 }
 
+/* CPU only has SCHED_IDLE tasks enqueued */
+int cpu_only_has_sched_idle_tasks(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	return unlikely(rq->nr_running &&
+			rq->nr_running == rq->cfs.idle_h_nr_running);
+}
+
+int available_sched_idle_cpu(int cpu)
+{
+	if (vcpu_is_preempted(cpu))
+		return 0;
+
+	if (idle_cpu(cpu))
+		return 1;
+
+	if (cpu_only_has_sched_idle_tasks(cpu))
+		return 1;
+
+	return 0;
+}
+
 /**
  * idle_task - return the idle task for a given CPU.
  * @cpu: the processor in question.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ad0b09ddddc0..3a029c740d51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5623,9 +5623,10 @@  wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	 * on one CPU.
 	 */
 	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
+		return available_sched_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
 
-	if (sync && cpu_rq(this_cpu)->nr_running == 1)
+	if ((sync && cpu_rq(this_cpu)->nr_running == 1) ||
+	    cpu_only_has_sched_idle_tasks(this_cpu))
 		return this_cpu;
 
 	return nr_cpumask_bits;
@@ -5888,6 +5889,9 @@  find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 				latest_idle_timestamp = rq->idle_stamp;
 				shallowest_idle_cpu = i;
 			}
+		} else if (cpu_only_has_sched_idle_tasks(i) && !vcpu_is_preempted(i)) {
+			/* Prefer CPU with only SCHED_IDLE tasks */
+			return i;
 		} else if (shallowest_idle_cpu == -1) {
 			load = weighted_cpuload(cpu_rq(i));
 			if (load < min_load) {
@@ -6049,7 +6053,7 @@  static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
  */
 static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
 {
-	int cpu;
+	int cpu, last_idle_cpu = -1;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
@@ -6057,11 +6061,18 @@  static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu(cpu, cpu_smt_mask(target)) {
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (available_idle_cpu(cpu))
-			return cpu;
+		if (!vcpu_is_preempted(cpu)) {
+			if (idle_cpu(cpu)) {
+				/* Prefer CPU with only SCHED_IDLE tasks */
+				last_idle_cpu = cpu;
+				continue;
+			}
+			if (cpu_only_has_sched_idle_tasks(cpu))
+				return cpu;
+		}
 	}
 
-	return -1;
+	return last_idle_cpu;
 }
 
 #else /* CONFIG_SCHED_SMT */
@@ -6089,7 +6100,7 @@  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, nr = INT_MAX;
+	int cpu, nr = INT_MAX, last_idle_cpu = -1;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6116,12 +6127,23 @@  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	time = local_clock();
 
 	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
-		if (!--nr)
-			return -1;
+		if (!--nr) {
+			if (last_idle_cpu == -1)
+				return -1;
+			cpu = last_idle_cpu;
+			break;
+		}
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (available_idle_cpu(cpu))
-			break;
+		if (!vcpu_is_preempted(cpu)) {
+			if (idle_cpu(cpu)) {
+				/* Prefer CPU with only SCHED_IDLE tasks */
+				last_idle_cpu = cpu;
+				continue;
+			}
+			if (cpu_only_has_sched_idle_tasks(cpu))
+				break;
+		}
 	}
 
 	time = local_clock() - time;
@@ -6140,13 +6162,13 @@  static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
-	if (available_idle_cpu(target))
+	if (available_sched_idle_cpu(target))
 		return target;
 
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
-	if (prev != target && cpus_share_cache(prev, target) && available_idle_cpu(prev))
+	if (prev != target && cpus_share_cache(prev, target) && available_sched_idle_cpu(prev))
 		return prev;
 
 	/* Check a recently used CPU as a potential idle candidate: */
@@ -6154,7 +6176,7 @@  static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
-	    available_idle_cpu(recent_used_cpu) &&
+	    available_sched_idle_cpu(recent_used_cpu) &&
 	    cpumask_test_cpu(p->recent_used_cpu, &p->cpus_allowed)) {
 		/*
 		 * Replace recent_used_cpu with prev as it is a potential
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 86a388c506ac..ecd016c64ee2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1828,6 +1828,9 @@  extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
+extern int cpu_only_has_sched_idle_tasks(int cpu);
+extern int available_sched_idle_cpu(int cpu);
+
 #ifdef CONFIG_SCHED_HRTICK
 
 /*