[2/2,v2] sched: use load_avg for selecting idlest group

Le Saturday 03 Dec 2016 à 21:47:07 (+0000), Matt Fleming a écrit :
> On Fri, 02 Dec, at 07:31:04PM, Brendan Gregg wrote:

> > 

> > For background, is this from the "A decade of wasted cores" paper's

> > patches?

> 

> No, this patch fixes an issue I originally reported here,

> 

>   https://lkml.kernel.org/r/20160923115808.2330-1-matt@codeblueprint.co.uk

> 

> Essentially, if you have an idle or partially-idle system and a

> workload that consists of fork()'ing a bunch of tasks, where each of

> those tasks immediately sleeps waiting for some wakeup, then those

> tasks aren't spread across all idle CPUs very well.

> 

> We saw this issue when running hackbench with a small loop count, such

> that the actual benchmark setup (fork()'ing) is where the majority of

> the runtime is spent.

> 

> In that scenario, there's a large potential/blocked load, but

> essentially no runnable load, and the balance on fork scheduler code

> only cares about runnable load without Vincent's patch applied.

> 

> The closest thing I can find in the "A decade of wasted cores" paper

> is "The Overload-on-Wakeup bug", but I don't think that's the issue

> here since,

> 

>   a) We're balancing on fork, not wakeup

>   b) The fork on balance code balances across nodes OK

> 

> > What's the expected typical gain? Thanks,

> 

> The results are still coming back from the SUSE performance test grid

> but they do show that this patch is mainly a win for multi-socket

> machines with more than 8 cores or thereabouts.

> 

>  [ Vincent, I'll follow up to your PATCH 1/2 with the results that are

>    specifically for that patch ]

> 

> Assuming a fork-intensive or fork-dominated workload, and a

> multi-socket machine, such as this 2 socket, NUMA, with 12 cores and

> HT enabled (48 cpus), we saw a very clear win between +10% and +15%

> for processes communicating via pipes,

> 

>   (1) tip-sched = tip/sched/core branch

>   (2) fix-fig-for-fork = (1) + PATCH 1/2

>   (3) fix-sig = (1) + (2) + PATCH 2/2

> 

> hackbench-process-pipes

>                          4.9.0-rc6             4.9.0-rc6             4.9.0-rc6

>                          tip-sched      fix-fig-for-fork               fix-sig

> Amean    1        0.0717 (  0.00%)      0.0696 (  2.99%)      0.0730 ( -1.79%)

> Amean    4        0.1244 (  0.00%)      0.1200 (  3.56%)      0.1190 (  4.36%)

> Amean    7        0.1891 (  0.00%)      0.1937 ( -2.42%)      0.1831 (  3.17%)

> Amean    12       0.2964 (  0.00%)      0.3116 ( -5.11%)      0.2784 (  6.07%)

> Amean    21       0.4011 (  0.00%)      0.4090 ( -1.96%)      0.3574 ( 10.90%)

> Amean    30       0.4944 (  0.00%)      0.4654 (  5.87%)      0.4171 ( 15.63%)

> Amean    48       0.6113 (  0.00%)      0.6309 ( -3.20%)      0.5331 ( 12.78%)

> Amean    79       0.8616 (  0.00%)      0.8706 ( -1.04%)      0.7710 ( 10.51%)

> Amean    110      1.1304 (  0.00%)      1.2211 ( -8.02%)      1.0163 ( 10.10%)

> Amean    141      1.3754 (  0.00%)      1.4279 ( -3.81%)      1.2803 (  6.92%)

> Amean    172      1.6217 (  0.00%)      1.7367 ( -7.09%)      1.5363 (  5.27%)

> Amean    192      1.7809 (  0.00%)      2.0199 (-13.42%)      1.7129 (  3.82%)

> 

> Things look even better when using threads and pipes, with wins

> between 11% and 29% when looking at results outside of the noise,

> 

> hackbench-thread-pipes

>                          4.9.0-rc6             4.9.0-rc6             4.9.0-rc6

>                          tip-sched      fix-fig-for-fork               fix-sig

> Amean    1        0.0736 (  0.00%)      0.0794 ( -7.96%)      0.0779 ( -5.83%)

> Amean    4        0.1709 (  0.00%)      0.1690 (  1.09%)      0.1663 (  2.68%)

> Amean    7        0.2836 (  0.00%)      0.3080 ( -8.61%)      0.2640 (  6.90%)

> Amean    12       0.4393 (  0.00%)      0.4843 (-10.24%)      0.4090 (  6.89%)

> Amean    21       0.5821 (  0.00%)      0.6369 ( -9.40%)      0.5126 ( 11.95%)

> Amean    30       0.6557 (  0.00%)      0.6459 (  1.50%)      0.5711 ( 12.90%)

> Amean    48       0.7924 (  0.00%)      0.7760 (  2.07%)      0.6286 ( 20.68%)

> Amean    79       1.0534 (  0.00%)      1.0551 ( -0.16%)      0.8481 ( 19.49%)

> Amean    110      1.5286 (  0.00%)      1.4504 (  5.11%)      1.1121 ( 27.24%)

> Amean    141      1.9507 (  0.00%)      1.7790 (  8.80%)      1.3804 ( 29.23%)

> Amean    172      2.2261 (  0.00%)      2.3330 ( -4.80%)      1.6336 ( 26.62%)

> Amean    192      2.3753 (  0.00%)      2.3307 (  1.88%)      1.8246 ( 23.19%)

> 

> Somewhat surprisingly, I can see improvements for UMA machines with

> fewer cores when the workload heavily saturates the machine and the

> workload isn't dominated by fork. Such heavy saturation isn't super

> realistic, but still interesting. I haven't dug into why these results

> occurred, but I am happy things didn't instead fall off a cliff.

> 

> Here's a 4-cpu UMA box showing some improvement at the higher end,

> 

> hackbench-process-pipes

>                         4.9.0-rc6             4.9.0-rc6             4.9.0-rc6

>                         tip-sched      fix-fig-for-fork               fix-sig

> Amean    1       3.5060 (  0.00%)      3.5747 ( -1.96%)      3.5117 ( -0.16%)

> Amean    3       7.7113 (  0.00%)      7.8160 ( -1.36%)      7.7747 ( -0.82%)

> Amean    5      11.4453 (  0.00%)     11.5710 ( -1.10%)     11.3870 (  0.51%)

> Amean    7      15.3147 (  0.00%)     15.9420 ( -4.10%)     15.8450 ( -3.46%)

> Amean    12     25.5110 (  0.00%)     24.3410 (  4.59%)     22.6717 ( 11.13%)

> Amean    16     32.3010 (  0.00%)     28.5897 ( 11.49%)     25.7473 ( 20.29%)

Hi Matt,

Thanks for the results.

During the review, it has been pointed out by Morten that the test condition
(100*this_avg_load < imbalance_scale*min_avg_load) makes more sense than
(100*min_avg_load > imbalance_scale*this_avg_load). But i see lower
performances with this change. Coud you run tests with the change below on
top of the patchset ?

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.7.4

[2/2,v2] sched: use load_avg for selecting idlest group

Commit Message

Comments

Patch