[RFC,V2,2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

Message ID 59b37c56b8fcb834f7d3234e776eaeff74ad117f.1556182965.git.viresh.kumar@linaro.org
State New
Headers show
Series
  • sched/fair: Fallback to sched-idle CPU for better performance
Related show

Commit Message

Viresh Kumar April 25, 2019, 9:37 a.m.
We target for an idle CPU in select_idle_sibling() to run the next task,
but in case we don't find idle CPUs it is better to pick a CPU which
will run the task the soonest, for performance reason. A CPU which isn't
idle but has only SCHED_IDLE activity queued on it should be a good
target based on this criteria as any normal fair task will most likely
preempt the currently running SCHED_IDLE task immediately. In fact,
choosing a SCHED_IDLE CPU shall give better results as it should be able
to run the task sooner than an idle CPU (which requires to be woken up
from an idle state).

This patch updates the fast path to fallback to a sched-idle CPU if the
idle CPU isn't found, the slow path can be updated separately later.

Following is the order in which select_idle_sibling() picks up next CPU
to run the task now:

1. idle_cpu(target) OR sched_idle_cpu(target)
2. idle_cpu(prev) OR sched_idle_cpu(prev)
3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu)
4. idle core(sd)
5. idle_cpu(sd)
6. sched_idle_cpu(sd)
7. idle_cpu(p) - smt
8. sched_idle_cpu(p)- smt

Though the policy can be tweaked a bit if we want to have different
priorities.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

---
 kernel/sched/fair.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

-- 
2.21.0.rc0.269.g1a574e7a288b

Comments

Peter Zijlstra May 10, 2019, 7:21 a.m. | #1
On Thu, Apr 25, 2019 at 03:07:40PM +0530, Viresh Kumar wrote:
> We target for an idle CPU in select_idle_sibling() to run the next task,

> but in case we don't find idle CPUs it is better to pick a CPU which

> will run the task the soonest, for performance reason. A CPU which isn't

> idle but has only SCHED_IDLE activity queued on it should be a good

> target based on this criteria as any normal fair task will most likely

> preempt the currently running SCHED_IDLE task immediately. In fact,

> choosing a SCHED_IDLE CPU shall give better results as it should be able

> to run the task sooner than an idle CPU (which requires to be woken up

> from an idle state).

> 

> This patch updates the fast path to fallback to a sched-idle CPU if the

> idle CPU isn't found, the slow path can be updated separately later.

> 

> Following is the order in which select_idle_sibling() picks up next CPU

> to run the task now:

> 

> 1. idle_cpu(target) OR sched_idle_cpu(target)

> 2. idle_cpu(prev) OR sched_idle_cpu(prev)

> 3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu)

> 4. idle core(sd)

> 5. idle_cpu(sd)

> 6. sched_idle_cpu(sd)

> 7. idle_cpu(p) - smt

> 8. sched_idle_cpu(p)- smt

> 

> Though the policy can be tweaked a bit if we want to have different

> priorities.


I don't hate his per se; but the whole select_idle_sibling() thing is
something that needs looking at.

There was the task stealing thing from Steve that looked interesting and
that would render your apporach unfeasible.
Viresh Kumar May 13, 2019, 9:34 a.m. | #2
On 10-05-19, 09:21, Peter Zijlstra wrote:
> On Thu, Apr 25, 2019 at 03:07:40PM +0530, Viresh Kumar wrote:

> > We target for an idle CPU in select_idle_sibling() to run the next task,

> > but in case we don't find idle CPUs it is better to pick a CPU which

> > will run the task the soonest, for performance reason. A CPU which isn't

> > idle but has only SCHED_IDLE activity queued on it should be a good

> > target based on this criteria as any normal fair task will most likely

> > preempt the currently running SCHED_IDLE task immediately. In fact,

> > choosing a SCHED_IDLE CPU shall give better results as it should be able

> > to run the task sooner than an idle CPU (which requires to be woken up

> > from an idle state).

> > 

> > This patch updates the fast path to fallback to a sched-idle CPU if the

> > idle CPU isn't found, the slow path can be updated separately later.

> > 

> > Following is the order in which select_idle_sibling() picks up next CPU

> > to run the task now:

> > 

> > 1. idle_cpu(target) OR sched_idle_cpu(target)

> > 2. idle_cpu(prev) OR sched_idle_cpu(prev)

> > 3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu)

> > 4. idle core(sd)

> > 5. idle_cpu(sd)

> > 6. sched_idle_cpu(sd)

> > 7. idle_cpu(p) - smt

> > 8. sched_idle_cpu(p)- smt

> > 

> > Though the policy can be tweaked a bit if we want to have different

> > priorities.

> 

> I don't hate his per se; but the whole select_idle_sibling() thing is

> something that needs looking at.

> 

> There was the task stealing thing from Steve that looked interesting and

> that would render your apporach unfeasible.


I am surely missing something as I don't see how that patchset will
make this patchset perform badly, than what it already does.

The idea of this patchset is to find a CPU which can run the task the
soonest if no other CPU is idle. If a CPU is idle we still want to run
the task on that one to finish work asap. This patchset only updates
the fast path right now and doesn't touch slow-path and periodic/idle
load-balance path. That would be the next step for sure though.

Steve's patchset (IIUC) adds a new fast way of doing idle-load-balance
at the LLC level, that is no different than normal idle-load-balancing
for this patchset. In fact, I will say that Steve's patchset makes our
work easier to extend going forward as we can capitalize on the new
*fast* infrastructure to pull tasks even when a CPU isn't fully idle
but only has sched-idle stuff on it.

Does this makes sense ?

@Song: Thanks for giving this a try and I am really happy to see your
results. I do see that we still don't get the performance we wanted,
perhaps because we only touch the fast path. Maybe load-balance screws
it up for us at a later point of time and CPUs are left with only
sched-idle tasks. Not sure though.

-- 
viresh
Peter Zijlstra May 13, 2019, 11:35 a.m. | #3
On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:
> On 10-05-19, 09:21, Peter Zijlstra wrote:


> > I don't hate his per se; but the whole select_idle_sibling() thing is

> > something that needs looking at.

> > 

> > There was the task stealing thing from Steve that looked interesting and

> > that would render your apporach unfeasible.

> 

> I am surely missing something as I don't see how that patchset will

> make this patchset perform badly, than what it already does.


Nah; I just misremembered. I know Oracle has a patch set poking at
select_idle_siblings() _somewhere_ (as do I), and I just found the wrong
one.

Basically everybody is complaining select_idle_sibling() is too
expensive for checking the entire LLC domain, except for FB (and thus
likely some other workloads too) that depend on it to kill their tail
latency.

But I suppose we could still do this, even if we scan only a subset of
the LLC, just keep track of the last !idle CPU running only SCHED_IDLE
tasks and pick that if you do not (in your limited scan) find a better
candidate.
Steven Sistare May 14, 2019, 4:03 p.m. | #4
On 5/13/2019 7:35 AM, Peter Zijlstra wrote:
> On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:

>> On 10-05-19, 09:21, Peter Zijlstra wrote:

> 

>>> I don't hate his per se; but the whole select_idle_sibling() thing is

>>> something that needs looking at.

>>>

>>> There was the task stealing thing from Steve that looked interesting and

>>> that would render your apporach unfeasible.

>>

>> I am surely missing something as I don't see how that patchset will

>> make this patchset perform badly, than what it already does.

> 

> Nah; I just misremembered. I know Oracle has a patch set poking at

> select_idle_siblings() _somewhere_ (as do I), and I just found the wrong

> one.

> 

> Basically everybody is complaining select_idle_sibling() is too

> expensive for checking the entire LLC domain, except for FB (and thus

> likely some other workloads too) that depend on it to kill their tail

> latency.

> 

> But I suppose we could still do this, even if we scan only a subset of

> the LLC, just keep track of the last !idle CPU running only SCHED_IDLE

> tasks and pick that if you do not (in your limited scan) find a better

> candidate.


Subhra posted a patch that incrementally searches for an idle CPU in the LLC,
remembering the last CPU examined, and searching a fixed number of CPUs from there.
That technique is compatible with the one that Viresh suggests; the incremental
search would stop if a SCHED_IDLE cpu was found.

I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle CPUs,
updated with atomic operations.  Performance was basically unchanged for the
workloads I tested, and I inserted timers around the idle search showing it was
a very small fraction of time both before and after my changes.  That led me to
ignore the push side and optimize the pull side with task stealing.

I would be very interested in hearing from folks that have workloads that demonstrate
that select_idle_sibling is too expensive.

- Steve
Subhra Mazumdar May 14, 2019, 5:27 p.m. | #5
On 5/14/19 9:03 AM, Steven Sistare wrote:
> On 5/13/2019 7:35 AM, Peter Zijlstra wrote:

>> On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:

>>> On 10-05-19, 09:21, Peter Zijlstra wrote:

>>>> I don't hate his per se; but the whole select_idle_sibling() thing is

>>>> something that needs looking at.

>>>>

>>>> There was the task stealing thing from Steve that looked interesting and

>>>> that would render your apporach unfeasible.

>>> I am surely missing something as I don't see how that patchset will

>>> make this patchset perform badly, than what it already does.

>> Nah; I just misremembered. I know Oracle has a patch set poking at

>> select_idle_siblings() _somewhere_ (as do I), and I just found the wrong

>> one.

>>

>> Basically everybody is complaining select_idle_sibling() is too

>> expensive for checking the entire LLC domain, except for FB (and thus

>> likely some other workloads too) that depend on it to kill their tail

>> latency.

>>

>> But I suppose we could still do this, even if we scan only a subset of

>> the LLC, just keep track of the last !idle CPU running only SCHED_IDLE

>> tasks and pick that if you do not (in your limited scan) find a better

>> candidate.

> Subhra posted a patch that incrementally searches for an idle CPU in the LLC,

> remembering the last CPU examined, and searching a fixed number of CPUs from there.

> That technique is compatible with the one that Viresh suggests; the incremental

> search would stop if a SCHED_IDLE cpu was found.

This was the last version of patchset I sent:
https://lkml.org/lkml/2018/6/28/810

Also select_idle_core is a net -ve for certain workloads like OLTP. So I
had put a SCHED_FEAT to be able to disable it.

Thanks,
Subhra
>

> I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle CPUs,

> updated with atomic operations.  Performance was basically unchanged for the

> workloads I tested, and I inserted timers around the idle search showing it was

> a very small fraction of time both before and after my changes.  That led me to

> ignore the push side and optimize the pull side with task stealing.

>

> I would be very interested in hearing from folks that have workloads that demonstrate

> that select_idle_sibling is too expensive.

>

> - Steve
Subhra Mazumdar May 14, 2019, 5:36 p.m. | #6
On 5/14/19 10:27 AM, Subhra Mazumdar wrote:
>

> On 5/14/19 9:03 AM, Steven Sistare wrote:

>> On 5/13/2019 7:35 AM, Peter Zijlstra wrote:

>>> On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:

>>>> On 10-05-19, 09:21, Peter Zijlstra wrote:

>>>>> I don't hate his per se; but the whole select_idle_sibling() thing is

>>>>> something that needs looking at.

>>>>>

>>>>> There was the task stealing thing from Steve that looked 

>>>>> interesting and

>>>>> that would render your apporach unfeasible.

>>>> I am surely missing something as I don't see how that patchset will

>>>> make this patchset perform badly, than what it already does.

>>> Nah; I just misremembered. I know Oracle has a patch set poking at

>>> select_idle_siblings() _somewhere_ (as do I), and I just found the 

>>> wrong

>>> one.

>>>

>>> Basically everybody is complaining select_idle_sibling() is too

>>> expensive for checking the entire LLC domain, except for FB (and thus

>>> likely some other workloads too) that depend on it to kill their tail

>>> latency.

>>>

>>> But I suppose we could still do this, even if we scan only a subset of

>>> the LLC, just keep track of the last !idle CPU running only SCHED_IDLE

>>> tasks and pick that if you do not (in your limited scan) find a better

>>> candidate.

>> Subhra posted a patch that incrementally searches for an idle CPU in 

>> the LLC,

>> remembering the last CPU examined, and searching a fixed number of 

>> CPUs from there.

>> That technique is compatible with the one that Viresh suggests; the 

>> incremental

>> search would stop if a SCHED_IDLE cpu was found.

> This was the last version of patchset I sent:

> https://lkml.org/lkml/2018/6/28/810

>

> Also select_idle_core is a net -ve for certain workloads like OLTP. So I

> had put a SCHED_FEAT to be able to disable it.

Forgot to add, the cpumask_weight computation may not be O(1) with large
number of CPUs, so needs to be precomputed in a per-cpu variable to further
optimize. That part is missing from the above patchset.
>

> Thanks,

> Subhra

>>

>> I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap 

>> of idle CPUs,

>> updated with atomic operations.  Performance was basically unchanged 

>> for the

>> workloads I tested, and I inserted timers around the idle search 

>> showing it was

>> a very small fraction of time both before and after my changes.  That 

>> led me to

>> ignore the push side and optimize the pull side with task stealing.

>>

>> I would be very interested in hearing from folks that have workloads 

>> that demonstrate

>> that select_idle_sibling is too expensive.

>>

>> - Steve
Viresh Kumar June 4, 2019, 2:58 a.m. | #7
On 25-04-19, 15:07, Viresh Kumar wrote:
> We target for an idle CPU in select_idle_sibling() to run the next task,

> but in case we don't find idle CPUs it is better to pick a CPU which

> will run the task the soonest, for performance reason. A CPU which isn't

> idle but has only SCHED_IDLE activity queued on it should be a good

> target based on this criteria as any normal fair task will most likely

> preempt the currently running SCHED_IDLE task immediately. In fact,

> choosing a SCHED_IDLE CPU shall give better results as it should be able

> to run the task sooner than an idle CPU (which requires to be woken up

> from an idle state).

> 

> This patch updates the fast path to fallback to a sched-idle CPU if the

> idle CPU isn't found, the slow path can be updated separately later.

> 

> Following is the order in which select_idle_sibling() picks up next CPU

> to run the task now:

> 

> 1. idle_cpu(target) OR sched_idle_cpu(target)

> 2. idle_cpu(prev) OR sched_idle_cpu(prev)

> 3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu)

> 4. idle core(sd)

> 5. idle_cpu(sd)

> 6. sched_idle_cpu(sd)

> 7. idle_cpu(p) - smt

> 8. sched_idle_cpu(p)- smt

> 

> Though the policy can be tweaked a bit if we want to have different

> priorities.

> 

> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

> ---

>  kernel/sched/fair.c | 28 +++++++++++++++++++++-------

>  1 file changed, 21 insertions(+), 7 deletions(-)


Hi Peter,

I was looking to send V3 with the changes you suggested for the patch 1/2, are
there any changes that I should be doing in this patch along with it ?

-- 
viresh

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6511cb57acdd..fbaefb9a9296 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6057,6 +6057,15 @@  static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 	return new_cpu;
 }
 
+/* CPU only has SCHED_IDLE tasks enqueued */
+static int sched_idle_cpu(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
+			rq->nr_running);
+}
+
 #ifdef CONFIG_SCHED_SMT
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
@@ -6154,7 +6163,7 @@  static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
  */
 static int select_idle_smt(struct task_struct *p, int target)
 {
-	int cpu;
+	int cpu, si_cpu = -1;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
@@ -6164,9 +6173,11 @@  static int select_idle_smt(struct task_struct *p, int target)
 			continue;
 		if (available_idle_cpu(cpu))
 			return cpu;
+		if (si_cpu == -1 && sched_idle_cpu(cpu))
+			si_cpu = cpu;
 	}
 
-	return -1;
+	return si_cpu;
 }
 
 #else /* CONFIG_SCHED_SMT */
@@ -6194,7 +6205,7 @@  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, nr = INT_MAX;
+	int cpu, nr = INT_MAX, si_cpu = -1;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6222,11 +6233,13 @@  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
 		if (!--nr)
-			return -1;
+			return si_cpu;
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
 		if (available_idle_cpu(cpu))
 			break;
+		if (si_cpu == -1 && sched_idle_cpu(cpu))
+			si_cpu = cpu;
 	}
 
 	time = local_clock() - time;
@@ -6245,13 +6258,14 @@  static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
-	if (available_idle_cpu(target))
+	if (available_idle_cpu(target) || sched_idle_cpu(target))
 		return target;
 
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
-	if (prev != target && cpus_share_cache(prev, target) && available_idle_cpu(prev))
+	if (prev != target && cpus_share_cache(prev, target) &&
+	    (available_idle_cpu(prev) || sched_idle_cpu(prev)))
 		return prev;
 
 	/* Check a recently used CPU as a potential idle candidate: */
@@ -6259,7 +6273,7 @@  static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
-	    available_idle_cpu(recent_used_cpu) &&
+	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
 	    cpumask_test_cpu(p->recent_used_cpu, &p->cpus_allowed)) {
 		/*
 		 * Replace recent_used_cpu with prev as it is a potential