diff mbox series

[v3,09/10] sched/fair: use load instead of runnable load in wakeup path

Message ID 1568878421-12301-10-git-send-email-vincent.guittot@linaro.org
State New
Headers show
Series sched/fair: rework the CFS load balance | expand

Commit Message

Vincent Guittot Sept. 19, 2019, 7:33 a.m. UTC
runnable load has been introduced to take into account the case where
blocked load biases the wake up path which may end to select an overloaded
CPU with a large number of runnable tasks instead of an underutilized
CPU with a huge blocked load.

Tha wake up path now starts to looks for idle CPUs before comparing
runnable load and it's worth aligning the wake up path with the
load_balance.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

-- 
2.7.4

Comments

Rik van Riel Oct. 7, 2019, 3:14 p.m. UTC | #1
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> runnable load has been introduced to take into account the case where

> blocked load biases the wake up path which may end to select an

> overloaded

> CPU with a large number of runnable tasks instead of an underutilized

> CPU with a huge blocked load.

> 

> Tha wake up path now starts to looks for idle CPUs before comparing

> runnable load and it's worth aligning the wake up path with the

> load_balance.

> 

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>


On a single socket system, patches 9 & 10 have the
result of driving a woken up task (when wake_wide is
true) to the CPU core with the lowest blocked load,
even when there is an idle core the task could run on
right now.

With the whole series applied, I see a 1-2% regression
in CPU use due to that issue.

With only patches 1-8 applied, I see a 1% improvement in
CPU use for that same workload.

Given that it looks like select_idle_sibling and
find_idlest_group_cpu do roughly the same thing, I
wonder if it is enough to simply add an additional
test to find_idlest_group to have it return the
LLC sg, if it is called on the LLC sd on a single
socket system.

That way find_idlest_group_cpu can still find an
idle core like it does today.

Does that seem like a reasonable thing?

I can run tests with that :)

-- 
All Rights Reversed.
Vincent Guittot Oct. 7, 2019, 3:27 p.m. UTC | #2
On Mon, 7 Oct 2019 at 17:14, Rik van Riel <riel@surriel.com> wrote:
>

> On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:

> > runnable load has been introduced to take into account the case where

> > blocked load biases the wake up path which may end to select an

> > overloaded

> > CPU with a large number of runnable tasks instead of an underutilized

> > CPU with a huge blocked load.

> >

> > Tha wake up path now starts to looks for idle CPUs before comparing

> > runnable load and it's worth aligning the wake up path with the

> > load_balance.

> >

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>

> On a single socket system, patches 9 & 10 have the

> result of driving a woken up task (when wake_wide is

> true) to the CPU core with the lowest blocked load,

> even when there is an idle core the task could run on

> right now.

>

> With the whole series applied, I see a 1-2% regression

> in CPU use due to that issue.

>

> With only patches 1-8 applied, I see a 1% improvement in

> CPU use for that same workload.


Thanks for testing.
patch 8-9 have just replaced runnable load  by blocked load and then
removed the duplicated metrics in find_idlest_group.
I'm preparing an additional patch that reworks  find_idlest_group() to
behave similarly to find_busiest_group(). It gathers statistics what
it already does, then classifies the groups and finally selects the
idlest one. This should fix the problem that you mentioned above when
it selects a group with lowest blocked load whereas there are idle
cpus in another group with high blocked load.

>

> Given that it looks like select_idle_sibling and

> find_idlest_group_cpu do roughly the same thing, I

> wonder if it is enough to simply add an additional

> test to find_idlest_group to have it return the

> LLC sg, if it is called on the LLC sd on a single

> socket system.


That make sense to me

>

> That way find_idlest_group_cpu can still find an

> idle core like it does today.

>

> Does that seem like a reasonable thing?


That's worth testing

>

> I can run tests with that :)

>

> --

> All Rights Reversed.
Rik van Riel Oct. 7, 2019, 6:06 p.m. UTC | #3
On Mon, 2019-10-07 at 17:27 +0200, Vincent Guittot wrote:
> On Mon, 7 Oct 2019 at 17:14, Rik van Riel <riel@surriel.com> wrote:

> > On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:

> > > runnable load has been introduced to take into account the case

> > > where

> > > blocked load biases the wake up path which may end to select an

> > > overloaded

> > > CPU with a large number of runnable tasks instead of an

> > > underutilized

> > > CPU with a huge blocked load.

> > > 

> > > Tha wake up path now starts to looks for idle CPUs before

> > > comparing

> > > runnable load and it's worth aligning the wake up path with the

> > > load_balance.

> > > 

> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> > 

> > On a single socket system, patches 9 & 10 have the

> > result of driving a woken up task (when wake_wide is

> > true) to the CPU core with the lowest blocked load,

> > even when there is an idle core the task could run on

> > right now.

> > 

> > With the whole series applied, I see a 1-2% regression

> > in CPU use due to that issue.

> > 

> > With only patches 1-8 applied, I see a 1% improvement in

> > CPU use for that same workload.

> 

> Thanks for testing.

> patch 8-9 have just replaced runnable load  by blocked load and then

> removed the duplicated metrics in find_idlest_group.

> I'm preparing an additional patch that reworks  find_idlest_group()

> to

> behave similarly to find_busiest_group(). It gathers statistics what

> it already does, then classifies the groups and finally selects the

> idlest one. This should fix the problem that you mentioned above when

> it selects a group with lowest blocked load whereas there are idle

> cpus in another group with high blocked load.


That should do the trick!

> > Given that it looks like select_idle_sibling and

> > find_idlest_group_cpu do roughly the same thing, I

> > wonder if it is enough to simply add an additional

> > test to find_idlest_group to have it return the

> > LLC sg, if it is called on the LLC sd on a single

> > socket system.

> 

> That make sense to me

> 

> > That way find_idlest_group_cpu can still find an

> > idle core like it does today.

> > 

> > Does that seem like a reasonable thing?

> 

> That's worth testing


I'll give it a try.

Doing the full find_idlest_group heuristic
inside an LLC seems like it would be overkill,
anyway.

-- 
All Rights Reversed.
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index acca869..39a37ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5485,7 +5485,7 @@  wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	s64 this_eff_load, prev_eff_load;
 	unsigned long task_load;
 
-	this_eff_load = cpu_runnable_load(cpu_rq(this_cpu));
+	this_eff_load = cpu_load(cpu_rq(this_cpu));
 
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
@@ -5503,7 +5503,7 @@  wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		this_eff_load *= 100;
 	this_eff_load *= capacity_of(prev_cpu);
 
-	prev_eff_load = cpu_runnable_load(cpu_rq(prev_cpu));
+	prev_eff_load = cpu_load(cpu_rq(prev_cpu));
 	prev_eff_load -= task_load;
 	if (sched_feat(WA_BIAS))
 		prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
@@ -5591,7 +5591,7 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_span(group)) {
-			load = cpu_runnable_load(cpu_rq(i));
+			load = cpu_load(cpu_rq(i));
 			runnable_load += load;
 
 			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
@@ -5732,7 +5732,7 @@  find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 				continue;
 			}
 
-			load = cpu_runnable_load(cpu_rq(i));
+			load = cpu_load(cpu_rq(i));
 			if (load < min_load) {
 				min_load = load;
 				least_loaded_cpu = i;