[2/2,v2] sched: use load_avg for selecting idlest group

Message ID 20161205092735.GA9161@linaro.org
State New
Headers show

Commit Message

Vincent Guittot Dec. 5, 2016, 9:27 a.m.
Le Saturday 03 Dec 2016 à 21:47:07 (+0000), Matt Fleming a écrit :
> On Fri, 02 Dec, at 07:31:04PM, Brendan Gregg wrote:

> > 

> > For background, is this from the "A decade of wasted cores" paper's

> > patches?

> 

> No, this patch fixes an issue I originally reported here,

> 

>   https://lkml.kernel.org/r/20160923115808.2330-1-matt@codeblueprint.co.uk

> 

> Essentially, if you have an idle or partially-idle system and a

> workload that consists of fork()'ing a bunch of tasks, where each of

> those tasks immediately sleeps waiting for some wakeup, then those

> tasks aren't spread across all idle CPUs very well.

> 

> We saw this issue when running hackbench with a small loop count, such

> that the actual benchmark setup (fork()'ing) is where the majority of

> the runtime is spent.

> 

> In that scenario, there's a large potential/blocked load, but

> essentially no runnable load, and the balance on fork scheduler code

> only cares about runnable load without Vincent's patch applied.

> 

> The closest thing I can find in the "A decade of wasted cores" paper

> is "The Overload-on-Wakeup bug", but I don't think that's the issue

> here since,

> 

>   a) We're balancing on fork, not wakeup

>   b) The fork on balance code balances across nodes OK

> 

> > What's the expected typical gain? Thanks,

> 

> The results are still coming back from the SUSE performance test grid

> but they do show that this patch is mainly a win for multi-socket

> machines with more than 8 cores or thereabouts.

> 

>  [ Vincent, I'll follow up to your PATCH 1/2 with the results that are

>    specifically for that patch ]

> 

> Assuming a fork-intensive or fork-dominated workload, and a

> multi-socket machine, such as this 2 socket, NUMA, with 12 cores and

> HT enabled (48 cpus), we saw a very clear win between +10% and +15%

> for processes communicating via pipes,

> 

>   (1) tip-sched = tip/sched/core branch

>   (2) fix-fig-for-fork = (1) + PATCH 1/2

>   (3) fix-sig = (1) + (2) + PATCH 2/2

> 

> hackbench-process-pipes

>                          4.9.0-rc6             4.9.0-rc6             4.9.0-rc6

>                          tip-sched      fix-fig-for-fork               fix-sig

> Amean    1        0.0717 (  0.00%)      0.0696 (  2.99%)      0.0730 ( -1.79%)

> Amean    4        0.1244 (  0.00%)      0.1200 (  3.56%)      0.1190 (  4.36%)

> Amean    7        0.1891 (  0.00%)      0.1937 ( -2.42%)      0.1831 (  3.17%)

> Amean    12       0.2964 (  0.00%)      0.3116 ( -5.11%)      0.2784 (  6.07%)

> Amean    21       0.4011 (  0.00%)      0.4090 ( -1.96%)      0.3574 ( 10.90%)

> Amean    30       0.4944 (  0.00%)      0.4654 (  5.87%)      0.4171 ( 15.63%)

> Amean    48       0.6113 (  0.00%)      0.6309 ( -3.20%)      0.5331 ( 12.78%)

> Amean    79       0.8616 (  0.00%)      0.8706 ( -1.04%)      0.7710 ( 10.51%)

> Amean    110      1.1304 (  0.00%)      1.2211 ( -8.02%)      1.0163 ( 10.10%)

> Amean    141      1.3754 (  0.00%)      1.4279 ( -3.81%)      1.2803 (  6.92%)

> Amean    172      1.6217 (  0.00%)      1.7367 ( -7.09%)      1.5363 (  5.27%)

> Amean    192      1.7809 (  0.00%)      2.0199 (-13.42%)      1.7129 (  3.82%)

> 

> Things look even better when using threads and pipes, with wins

> between 11% and 29% when looking at results outside of the noise,

> 

> hackbench-thread-pipes

>                          4.9.0-rc6             4.9.0-rc6             4.9.0-rc6

>                          tip-sched      fix-fig-for-fork               fix-sig

> Amean    1        0.0736 (  0.00%)      0.0794 ( -7.96%)      0.0779 ( -5.83%)

> Amean    4        0.1709 (  0.00%)      0.1690 (  1.09%)      0.1663 (  2.68%)

> Amean    7        0.2836 (  0.00%)      0.3080 ( -8.61%)      0.2640 (  6.90%)

> Amean    12       0.4393 (  0.00%)      0.4843 (-10.24%)      0.4090 (  6.89%)

> Amean    21       0.5821 (  0.00%)      0.6369 ( -9.40%)      0.5126 ( 11.95%)

> Amean    30       0.6557 (  0.00%)      0.6459 (  1.50%)      0.5711 ( 12.90%)

> Amean    48       0.7924 (  0.00%)      0.7760 (  2.07%)      0.6286 ( 20.68%)

> Amean    79       1.0534 (  0.00%)      1.0551 ( -0.16%)      0.8481 ( 19.49%)

> Amean    110      1.5286 (  0.00%)      1.4504 (  5.11%)      1.1121 ( 27.24%)

> Amean    141      1.9507 (  0.00%)      1.7790 (  8.80%)      1.3804 ( 29.23%)

> Amean    172      2.2261 (  0.00%)      2.3330 ( -4.80%)      1.6336 ( 26.62%)

> Amean    192      2.3753 (  0.00%)      2.3307 (  1.88%)      1.8246 ( 23.19%)

> 

> Somewhat surprisingly, I can see improvements for UMA machines with

> fewer cores when the workload heavily saturates the machine and the

> workload isn't dominated by fork. Such heavy saturation isn't super

> realistic, but still interesting. I haven't dug into why these results

> occurred, but I am happy things didn't instead fall off a cliff.

> 

> Here's a 4-cpu UMA box showing some improvement at the higher end,

> 

> hackbench-process-pipes

>                         4.9.0-rc6             4.9.0-rc6             4.9.0-rc6

>                         tip-sched      fix-fig-for-fork               fix-sig

> Amean    1       3.5060 (  0.00%)      3.5747 ( -1.96%)      3.5117 ( -0.16%)

> Amean    3       7.7113 (  0.00%)      7.8160 ( -1.36%)      7.7747 ( -0.82%)

> Amean    5      11.4453 (  0.00%)     11.5710 ( -1.10%)     11.3870 (  0.51%)

> Amean    7      15.3147 (  0.00%)     15.9420 ( -4.10%)     15.8450 ( -3.46%)

> Amean    12     25.5110 (  0.00%)     24.3410 (  4.59%)     22.6717 ( 11.13%)

> Amean    16     32.3010 (  0.00%)     28.5897 ( 11.49%)     25.7473 ( 20.29%)


Hi Matt,

Thanks for the results.

During the review, it has been pointed out by Morten that the test condition
(100*this_avg_load < imbalance_scale*min_avg_load) makes more sense than
(100*min_avg_load > imbalance_scale*this_avg_load). But i see lower
performances with this change. Coud you run tests with the change below on
top of the patchset ?

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.7.4

Comments

Matt Fleming Dec. 5, 2016, 1:35 p.m. | #1
On Mon, 05 Dec, at 10:27:36AM, Vincent Guittot wrote:
> 

> Hi Matt,

> 

> Thanks for the results.

> 

> During the review, it has been pointed out by Morten that the test condition

> (100*this_avg_load < imbalance_scale*min_avg_load) makes more sense than

> (100*min_avg_load > imbalance_scale*this_avg_load). But i see lower

> performances with this change. Coud you run tests with the change below on

> top of the patchset ?

> 

> ---

>  kernel/sched/fair.c | 2 +-

>  1 file changed, 1 insertion(+), 1 deletion(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index e8d1ae7..0129fbb 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -5514,7 +5514,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

>  	if (!idlest ||

>  	    (min_runnable_load > (this_runnable_load + imbalance)) ||

>  	    ((this_runnable_load < (min_runnable_load + imbalance)) &&

> -			(100*min_avg_load > imbalance_scale*this_avg_load)))

> +			(100*this_avg_load < imbalance_scale*min_avg_load)))

>  		return NULL;

>  	return idlest;

>  }


Queued for testing.
Matt Fleming Dec. 8, 2016, 2:09 p.m. | #2
On Mon, 05 Dec, at 01:35:46PM, Matt Fleming wrote:
> On Mon, 05 Dec, at 10:27:36AM, Vincent Guittot wrote:

> > 

> > Hi Matt,

> > 

> > Thanks for the results.

> > 

> > During the review, it has been pointed out by Morten that the test condition

> > (100*this_avg_load < imbalance_scale*min_avg_load) makes more sense than

> > (100*min_avg_load > imbalance_scale*this_avg_load). But i see lower

> > performances with this change. Coud you run tests with the change below on

> > top of the patchset ?

> > 

> > ---

> >  kernel/sched/fair.c | 2 +-

> >  1 file changed, 1 insertion(+), 1 deletion(-)

> > 

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index e8d1ae7..0129fbb 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -5514,7 +5514,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

> >  	if (!idlest ||

> >  	    (min_runnable_load > (this_runnable_load + imbalance)) ||

> >  	    ((this_runnable_load < (min_runnable_load + imbalance)) &&

> > -			(100*min_avg_load > imbalance_scale*this_avg_load)))

> > +			(100*this_avg_load < imbalance_scale*min_avg_load)))

> >  		return NULL;

> >  	return idlest;

> >  }

> 

> Queued for testing.


Most of the machines didn't notice the difference with this new patch.
However, I did come across one test that showed a negative change,


hackbench-thread-pipes
                        4.9.0-rc6             4.9.0-rc6             4.9.0-rc6             4.9.0-rc6
                        tip-sched      fix-fig-for-fork               fix-sig   fix-fig-for-fork-v2
Amean    1       0.1266 (  0.00%)      0.1269 ( -0.23%)      0.1287 ( -1.69%)      0.1357 ( -7.22%)
Amean    4       0.4989 (  0.00%)      0.5174 ( -3.72%)      0.5251 ( -5.27%)      0.5117 ( -2.58%)
Amean    7       0.8510 (  0.00%)      0.8517 ( -0.08%)      0.8964 ( -5.34%)      0.8801 ( -3.42%)
Amean    12      1.0699 (  0.00%)      1.0484 (  2.00%)      1.0147 (  5.15%)      1.0759 ( -0.56%)
Amean    21      1.2816 (  0.00%)      1.2140 (  5.27%)      1.1879 (  7.31%)      1.2414 (  3.13%)
Amean    30      1.4440 (  0.00%)      1.4003 (  3.03%)      1.3969 (  3.26%)      1.4057 (  2.65%)
Amean    48      1.5773 (  0.00%)      1.5983 ( -1.33%)      1.3984 ( 11.34%)      1.5624 (  0.94%)
Amean    79      2.2343 (  0.00%)      2.3066 ( -3.24%)      2.0053 ( 10.25%)      2.2630 ( -1.29%)
Amean    96      2.6736 (  0.00%)      2.4313 (  9.06%)      2.4181 (  9.55%)      2.4717 (  7.55%)

           4.9.0-rc6   4.9.0-rc6   4.9.0-rc6   4.9.0-rc6
           tip-schedfix-fig-for-fork     fix-sigfix-fig-for-fork-v2
User          129.53      128.64      127.70      131.00
System       1784.54     1744.21     1654.08     1744.00
Elapsed        92.07       90.44       86.95       91.00

Looking at the 48 and 79 groups rows for mean there's a noticeable
drop off in performance of ~10%, which should be outside of the noise
for this test. This is a 2 socket, 4 NUMA node (yes, really), 24 cpus
AMD opteron circa 2010.

Given the age of this machine, I don't think it's worth holding up the
patch.
Vincent Guittot Dec. 8, 2016, 2:33 p.m. | #3
On 8 December 2016 at 15:09, Matt Fleming <matt@codeblueprint.co.uk> wrote:
> On Mon, 05 Dec, at 01:35:46PM, Matt Fleming wrote:

>> On Mon, 05 Dec, at 10:27:36AM, Vincent Guittot wrote:

>> >

>> > Hi Matt,

>> >

>> > Thanks for the results.

>> >

>> > During the review, it has been pointed out by Morten that the test condition

>> > (100*this_avg_load < imbalance_scale*min_avg_load) makes more sense than

>> > (100*min_avg_load > imbalance_scale*this_avg_load). But i see lower

>> > performances with this change. Coud you run tests with the change below on

>> > top of the patchset ?

>> >

>> > ---

>> >  kernel/sched/fair.c | 2 +-

>> >  1 file changed, 1 insertion(+), 1 deletion(-)

>> >

>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> > index e8d1ae7..0129fbb 100644

>> > --- a/kernel/sched/fair.c

>> > +++ b/kernel/sched/fair.c

>> > @@ -5514,7 +5514,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

>> >     if (!idlest ||

>> >         (min_runnable_load > (this_runnable_load + imbalance)) ||

>> >         ((this_runnable_load < (min_runnable_load + imbalance)) &&

>> > -                   (100*min_avg_load > imbalance_scale*this_avg_load)))

>> > +                   (100*this_avg_load < imbalance_scale*min_avg_load)))

>> >             return NULL;

>> >     return idlest;

>> >  }

>>

>> Queued for testing.

>

> Most of the machines didn't notice the difference with this new patch.

> However, I did come across one test that showed a negative change,


OK so you don't see perf impact between the 2 test condition except
for the machine below
So IMHO, we should use the new test condition which tries to keep task
local if there is no other CPU that is obviously less loaded.

I'm going to prepare a new version of the patchset and apply all comments

Thanks
Vincent

>

>

> hackbench-thread-pipes

>                         4.9.0-rc6             4.9.0-rc6             4.9.0-rc6             4.9.0-rc6

>                         tip-sched      fix-fig-for-fork               fix-sig   fix-fig-for-fork-v2

> Amean    1       0.1266 (  0.00%)      0.1269 ( -0.23%)      0.1287 ( -1.69%)      0.1357 ( -7.22%)

> Amean    4       0.4989 (  0.00%)      0.5174 ( -3.72%)      0.5251 ( -5.27%)      0.5117 ( -2.58%)

> Amean    7       0.8510 (  0.00%)      0.8517 ( -0.08%)      0.8964 ( -5.34%)      0.8801 ( -3.42%)

> Amean    12      1.0699 (  0.00%)      1.0484 (  2.00%)      1.0147 (  5.15%)      1.0759 ( -0.56%)

> Amean    21      1.2816 (  0.00%)      1.2140 (  5.27%)      1.1879 (  7.31%)      1.2414 (  3.13%)

> Amean    30      1.4440 (  0.00%)      1.4003 (  3.03%)      1.3969 (  3.26%)      1.4057 (  2.65%)

> Amean    48      1.5773 (  0.00%)      1.5983 ( -1.33%)      1.3984 ( 11.34%)      1.5624 (  0.94%)

> Amean    79      2.2343 (  0.00%)      2.3066 ( -3.24%)      2.0053 ( 10.25%)      2.2630 ( -1.29%)

> Amean    96      2.6736 (  0.00%)      2.4313 (  9.06%)      2.4181 (  9.55%)      2.4717 (  7.55%)

>

>            4.9.0-rc6   4.9.0-rc6   4.9.0-rc6   4.9.0-rc6

>            tip-schedfix-fig-for-fork     fix-sigfix-fig-for-fork-v2

> User          129.53      128.64      127.70      131.00

> System       1784.54     1744.21     1654.08     1744.00

> Elapsed        92.07       90.44       86.95       91.00

>

> Looking at the 48 and 79 groups rows for mean there's a noticeable

> drop off in performance of ~10%, which should be outside of the noise

> for this test. This is a 2 socket, 4 NUMA node (yes, really), 24 cpus

> AMD opteron circa 2010.

>

> Given the age of this machine, I don't think it's worth holding up the

> patch.

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8d1ae7..0129fbb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5514,7 +5514,7 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	if (!idlest ||
 	    (min_runnable_load > (this_runnable_load + imbalance)) ||
 	    ((this_runnable_load < (min_runnable_load + imbalance)) &&
-			(100*min_avg_load > imbalance_scale*this_avg_load)))
+			(100*this_avg_load < imbalance_scale*min_avg_load)))
 		return NULL;
 	return idlest;
 }