[1/2,v2] sched: fix find_idlest_group for fork

Message ID 1480088073-11642-2-git-send-email-vincent.guittot@linaro.org
State Superseded
Headers show

Commit Message

Vincent Guittot Nov. 25, 2016, 3:34 p.m.
During fork, the utilization of a task is init once the rq has been
selected because the current utilization level of the rq is used to set
the utilization of the fork task. As the task's utilization is still
null at this step of the fork sequence, it doesn't make sense to look for
some spare capacity that can fit the task's utilization.
Furthermore, I can see perf regressions for the test "hackbench -P -g 1"
because the least loaded policy is always bypassed and tasks are not
spread during fork.

With this patch and the fix below, we are back to same performances as
for v4.8. The fix below is only a temporary one used for the test until a
smarter solution is found because we can't simply remove the test which is
useful for others benchmarks

@@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	avg_cost = this_sd->avg_scan_cost;
 
-	/*
-	 * Due to large variance we need a large fuzz factor; hackbench in
-	 * particularly is sensitive here.
-	 */
-	if ((avg_idle / 512) < avg_cost)
-		return -1;
-
 	time = local_clock();
 
 	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

-- 
2.7.4

Comments

Matt Fleming Nov. 28, 2016, 5:01 p.m. | #1
On Fri, 25 Nov, at 04:34:32PM, Vincent Guittot wrote:
> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index aa47589..820a787 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -5463,13 +5463,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

>  	 * utilized systems if we require spare_capacity > task_util(p),

>  	 * so we allow for some task stuffing by using

>  	 * spare_capacity > task_util(p)/2.

> +	 * spare capacity can't be used for fork because the utilization has

> +	 * not been set yet as it need to get a rq to init the utilization

>  	 */

> +	if (sd_flag & SD_BALANCE_FORK)

> +		goto no_spare;

> +

>  	if (this_spare > task_util(p) / 2 &&

>  	    imbalance*this_spare > 100*most_spare)

>  		return NULL;

>  	else if (most_spare > task_util(p) / 2)

>  		return most_spare_sg;

>  

> +no_spare:

>  	if (!idlest || 100*this_load < imbalance*min_load)

>  		return NULL;

>  	return idlest;


It's only a minor comment, but would you be opposed to calling this
label 'skip_spare' to indicate that spare capacity may exist, but
we're not going to make use of it?
Vincent Guittot Nov. 28, 2016, 5:20 p.m. | #2
On 28 November 2016 at 18:01, Matt Fleming <matt@codeblueprint.co.uk> wrote:
> On Fri, 25 Nov, at 04:34:32PM, Vincent Guittot wrote:

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index aa47589..820a787 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -5463,13 +5463,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

>>        * utilized systems if we require spare_capacity > task_util(p),

>>        * so we allow for some task stuffing by using

>>        * spare_capacity > task_util(p)/2.

>> +      * spare capacity can't be used for fork because the utilization has

>> +      * not been set yet as it need to get a rq to init the utilization

>>        */

>> +     if (sd_flag & SD_BALANCE_FORK)

>> +             goto no_spare;

>> +

>>       if (this_spare > task_util(p) / 2 &&

>>           imbalance*this_spare > 100*most_spare)

>>               return NULL;

>>       else if (most_spare > task_util(p) / 2)

>>               return most_spare_sg;

>>

>> +no_spare:

>>       if (!idlest || 100*this_load < imbalance*min_load)

>>               return NULL;

>>       return idlest;

>

> It's only a minor comment, but would you be opposed to calling this

> label 'skip_spare' to indicate that spare capacity may exist, but

> we're not going to make use of it?


you're right,  'skip_spare' makes more sense
Morten Rasmussen Nov. 29, 2016, 10:57 a.m. | #3
On Fri, Nov 25, 2016 at 04:34:32PM +0100, Vincent Guittot wrote:
> During fork, the utilization of a task is init once the rq has been

> selected because the current utilization level of the rq is used to set

> the utilization of the fork task. As the task's utilization is still

> null at this step of the fork sequence, it doesn't make sense to look for

> some spare capacity that can fit the task's utilization.

> Furthermore, I can see perf regressions for the test "hackbench -P -g 1"

> because the least loaded policy is always bypassed and tasks are not

> spread during fork.


Agreed, the late initialization of util_avg doesn't work very well with
the spare capacity checking.

> With this patch and the fix below, we are back to same performances as

> for v4.8. The fix below is only a temporary one used for the test until a

> smarter solution is found because we can't simply remove the test which is

> useful for others benchmarks

> 

> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

>  

>  	avg_cost = this_sd->avg_scan_cost;

>  

> -	/*

> -	 * Due to large variance we need a large fuzz factor; hackbench in

> -	 * particularly is sensitive here.

> -	 */

> -	if ((avg_idle / 512) < avg_cost)

> -		return -1;

> -

>  	time = local_clock();

>  

>  	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {


I don't quite get this fix, but it is very likely because I haven't paid
enough attention.

Are you saying that removing the avg_cost check is improving hackbench
performance? I thought it was supposed to help hackbench? I'm confused
:-(

> 

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---

>  kernel/sched/fair.c | 6 ++++++

>  1 file changed, 6 insertions(+)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index aa47589..820a787 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -5463,13 +5463,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

>  	 * utilized systems if we require spare_capacity > task_util(p),

>  	 * so we allow for some task stuffing by using

>  	 * spare_capacity > task_util(p)/2.

> +	 * spare capacity can't be used for fork because the utilization has

> +	 * not been set yet as it need to get a rq to init the utilization

>  	 */

> +	if (sd_flag & SD_BALANCE_FORK)

> +		goto no_spare;

> +

>  	if (this_spare > task_util(p) / 2 &&

>  	    imbalance*this_spare > 100*most_spare)

>  		return NULL;

>  	else if (most_spare > task_util(p) / 2)

>  		return most_spare_sg;

>  

> +no_spare:

>  	if (!idlest || 100*this_load < imbalance*min_load)

>  		return NULL;

>  	return idlest;


Looks okay to me. We are returning to use load, which is initialized,
for fork decisions.

Should we do the same for SD_BALANCE_EXEC?

An alternative fix would be to move the utilization initialization
before we pick the cpu, but that opens the whole discussion about what
we should initialize it to again. So I'm fine with not going there now.

Morten
Peter Zijlstra Nov. 29, 2016, 11:42 a.m. | #4
On Tue, Nov 29, 2016 at 10:57:59AM +0000, Morten Rasmussen wrote:
> > @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

> >  

> >  	avg_cost = this_sd->avg_scan_cost;

> >  

> > -	/*

> > -	 * Due to large variance we need a large fuzz factor; hackbench in

> > -	 * particularly is sensitive here.

> > -	 */

> > -	if ((avg_idle / 512) < avg_cost)

> > -		return -1;

> > -

> >  	time = local_clock();

> >  

> >  	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

> 

> I don't quite get this fix, but it is very likely because I haven't paid

> enough attention.

> 

> Are you saying that removing the avg_cost check is improving hackbench

> performance? I thought it was supposed to help hackbench? I'm confused

> :-(


IIRC, and my pounding head really doesn't remember much, the comment
reads like we need the large fudge factor because hackbench. That is,
hackbench would like this test to go away, but others benchmarks will
tank.

Now, if only I would've written down which benchmarks that were.. awell.
Matt Fleming Nov. 29, 2016, 11:44 a.m. | #5
On Tue, 29 Nov, at 12:42:43PM, Peter Zijlstra wrote:
> 

> IIRC, and my pounding head really doesn't remember much, the comment

> reads like we need the large fudge factor because hackbench. That is,

> hackbench would like this test to go away, but others benchmarks will

> tank.

> 

> Now, if only I would've written down which benchmarks that were.. awell.

 
Going out on a limb: Chris' schbench?
Peter Zijlstra Nov. 29, 2016, 12:30 p.m. | #6
On Tue, Nov 29, 2016 at 11:44:19AM +0000, Matt Fleming wrote:
> On Tue, 29 Nov, at 12:42:43PM, Peter Zijlstra wrote:

> > 

> > IIRC, and my pounding head really doesn't remember much, the comment

> > reads like we need the large fudge factor because hackbench. That is,

> > hackbench would like this test to go away, but others benchmarks will

> > tank.

> > 

> > Now, if only I would've written down which benchmarks that were.. awell.

>  

> Going out on a limb: Chris' schbench?


No, that actually wants that test taken out as well. It might have been
things like sysbench or so.
Vincent Guittot Nov. 29, 2016, 1:04 p.m. | #7
On 29 November 2016 at 11:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, Nov 25, 2016 at 04:34:32PM +0100, Vincent Guittot wrote:

>> During fork, the utilization of a task is init once the rq has been

>> selected because the current utilization level of the rq is used to set

>> the utilization of the fork task. As the task's utilization is still

>> null at this step of the fork sequence, it doesn't make sense to look for

>> some spare capacity that can fit the task's utilization.

>> Furthermore, I can see perf regressions for the test "hackbench -P -g 1"

>> because the least loaded policy is always bypassed and tasks are not

>> spread during fork.

>

> Agreed, the late initialization of util_avg doesn't work very well with

> the spare capacity checking.

>

>> With this patch and the fix below, we are back to same performances as

>> for v4.8. The fix below is only a temporary one used for the test until a

>> smarter solution is found because we can't simply remove the test which is

>> useful for others benchmarks

>>

>> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

>>

>>       avg_cost = this_sd->avg_scan_cost;

>>

>> -     /*

>> -      * Due to large variance we need a large fuzz factor; hackbench in

>> -      * particularly is sensitive here.

>> -      */

>> -     if ((avg_idle / 512) < avg_cost)

>> -             return -1;

>> -

>>       time = local_clock();

>>

>>       for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

>

> I don't quite get this fix, but it is very likely because I haven't paid

> enough attention.

>

> Are you saying that removing the avg_cost check is improving hackbench

> performance? I thought it was supposed to help hackbench? I'm confused

> :-(


Yes, avg_cost check prevents some tasks migration at the end of the
tests when some threads have already finished their loop letting some
CPUs idle whereas others threads are still competing on the same CPUS

>

>>

>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>> ---

>>  kernel/sched/fair.c | 6 ++++++

>>  1 file changed, 6 insertions(+)

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index aa47589..820a787 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -5463,13 +5463,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

>>        * utilized systems if we require spare_capacity > task_util(p),

>>        * so we allow for some task stuffing by using

>>        * spare_capacity > task_util(p)/2.

>> +      * spare capacity can't be used for fork because the utilization has

>> +      * not been set yet as it need to get a rq to init the utilization

>>        */

>> +     if (sd_flag & SD_BALANCE_FORK)

>> +             goto no_spare;

>> +

>>       if (this_spare > task_util(p) / 2 &&

>>           imbalance*this_spare > 100*most_spare)

>>               return NULL;

>>       else if (most_spare > task_util(p) / 2)

>>               return most_spare_sg;

>>

>> +no_spare:

>>       if (!idlest || 100*this_load < imbalance*min_load)

>>               return NULL;

>>       return idlest;

>

> Looks okay to me. We are returning to use load, which is initialized,

> for fork decisions.

>

> Should we do the same for SD_BALANCE_EXEC?


I asked myself if i should add SD_BALANCE_EXEC but decided to only
keep SD_BALANCE_FORK for now as no regression has been raised for now.

>

> An alternative fix would be to move the utilization initialization

> before we pick the cpu, but that opens the whole discussion about what

> we should initialize it to again. So I'm fine with not going there now.

>

> Morten
Morten Rasmussen Nov. 29, 2016, 2:46 p.m. | #8
On Tue, Nov 29, 2016 at 12:42:43PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2016 at 10:57:59AM +0000, Morten Rasmussen wrote:

> > > @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

> > >  

> > >  	avg_cost = this_sd->avg_scan_cost;

> > >  

> > > -	/*

> > > -	 * Due to large variance we need a large fuzz factor; hackbench in

> > > -	 * particularly is sensitive here.

> > > -	 */

> > > -	if ((avg_idle / 512) < avg_cost)

> > > -		return -1;

> > > -

> > >  	time = local_clock();

> > >  

> > >  	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

> > 

> > I don't quite get this fix, but it is very likely because I haven't paid

> > enough attention.

> > 

> > Are you saying that removing the avg_cost check is improving hackbench

> > performance? I thought it was supposed to help hackbench? I'm confused

> > :-(

> 

> IIRC, and my pounding head really doesn't remember much, the comment

> reads like we need the large fudge factor because hackbench. That is,

> hackbench would like this test to go away, but others benchmarks will

> tank.


Thanks, that seems in line with Vincent's reply.

The last bit that isn't clear to me is whether /512 is a 'large' fuzz
factor. I guess it is, as we can have many wake-ups, i.e. many times
avg_cost, over the period where avg_idle is calculated. No?
Morten Rasmussen Nov. 29, 2016, 2:50 p.m. | #9
On Tue, Nov 29, 2016 at 02:04:27PM +0100, Vincent Guittot wrote:
> On 29 November 2016 at 11:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> > On Fri, Nov 25, 2016 at 04:34:32PM +0100, Vincent Guittot wrote:

> >> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

> >>

> >>       avg_cost = this_sd->avg_scan_cost;

> >>

> >> -     /*

> >> -      * Due to large variance we need a large fuzz factor; hackbench in

> >> -      * particularly is sensitive here.

> >> -      */

> >> -     if ((avg_idle / 512) < avg_cost)

> >> -             return -1;

> >> -

> >>       time = local_clock();

> >>

> >>       for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

> >

> > I don't quite get this fix, but it is very likely because I haven't paid

> > enough attention.

> >

> > Are you saying that removing the avg_cost check is improving hackbench

> > performance? I thought it was supposed to help hackbench? I'm confused

> > :-(

> 

> Yes, avg_cost check prevents some tasks migration at the end of the

> tests when some threads have already finished their loop letting some

> CPUs idle whereas others threads are still competing on the same CPUS


Okay, thanks.

> > Should we do the same for SD_BALANCE_EXEC?

> 

> I asked myself if i should add SD_BALANCE_EXEC but decided to only

> keep SD_BALANCE_FORK for now as no regression has been raised for now.


Fair enough.

FWIW, with the label renaming suggested by mfleming, you can add my
reviewed/acked-by if you like.

Morten
Vincent Guittot Nov. 29, 2016, 2:57 p.m. | #10
On 29 November 2016 at 15:50, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Nov 29, 2016 at 02:04:27PM +0100, Vincent Guittot wrote:

>> On 29 November 2016 at 11:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

>> > On Fri, Nov 25, 2016 at 04:34:32PM +0100, Vincent Guittot wrote:

>> >> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

>> >>

>> >>       avg_cost = this_sd->avg_scan_cost;

>> >>

>> >> -     /*

>> >> -      * Due to large variance we need a large fuzz factor; hackbench in

>> >> -      * particularly is sensitive here.

>> >> -      */

>> >> -     if ((avg_idle / 512) < avg_cost)

>> >> -             return -1;

>> >> -

>> >>       time = local_clock();

>> >>

>> >>       for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

>> >

>> > I don't quite get this fix, but it is very likely because I haven't paid

>> > enough attention.

>> >

>> > Are you saying that removing the avg_cost check is improving hackbench

>> > performance? I thought it was supposed to help hackbench? I'm confused

>> > :-(

>>

>> Yes, avg_cost check prevents some tasks migration at the end of the

>> tests when some threads have already finished their loop letting some

>> CPUs idle whereas others threads are still competing on the same CPUS

>

> Okay, thanks.

>

>> > Should we do the same for SD_BALANCE_EXEC?

>>

>> I asked myself if i should add SD_BALANCE_EXEC but decided to only

>> keep SD_BALANCE_FORK for now as no regression has been raised for now.

>

> Fair enough.

>

> FWIW, with the label renaming suggested by mfleming, you can add my

> reviewed/acked-by if you like.


Thanks

>

> Morten
Matt Fleming Dec. 3, 2016, 11:25 p.m. | #11
On Fri, 25 Nov, at 04:34:32PM, Vincent Guittot wrote:
> During fork, the utilization of a task is init once the rq has been

> selected because the current utilization level of the rq is used to set

> the utilization of the fork task. As the task's utilization is still

> null at this step of the fork sequence, it doesn't make sense to look for

> some spare capacity that can fit the task's utilization.

> Furthermore, I can see perf regressions for the test "hackbench -P -g 1"

> because the least loaded policy is always bypassed and tasks are not

> spread during fork.

> 

> With this patch and the fix below, we are back to same performances as

> for v4.8. The fix below is only a temporary one used for the test until a

> smarter solution is found because we can't simply remove the test which is

> useful for others benchmarks

> 

> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

>  

>  	avg_cost = this_sd->avg_scan_cost;

>  

> -	/*

> -	 * Due to large variance we need a large fuzz factor; hackbench in

> -	 * particularly is sensitive here.

> -	 */

> -	if ((avg_idle / 512) < avg_cost)

> -		return -1;

> -

>  	time = local_clock();

>  

>  	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

> 


OK, I need to point out that I didn't apply the above hunk when
testing this patch series. But I wouldn't have expected that to impact
our fork-intensive workloads so much. Let me know if you'd like me to
re-run with it applied.

I don't see much of a difference, positive or negative, for the
majority of the test machines, it's mainly a wash.

However, the following 4-cpu Xeon E5504 machine does show a nice win,
with thread counts in the mid-range (note, the second column is number
of hackbench groups, where each group has 40 tasks),

hackbench-process-pipes
                        4.9.0-rc6             4.9.0-rc6             4.9.0-rc6
                        tip-sched      fix-fig-for-fork               fix-sig
Amean    1       0.2193 (  0.00%)      0.2014 (  8.14%)      0.1746 ( 20.39%)
Amean    3       0.4489 (  0.00%)      0.3544 ( 21.04%)      0.3284 ( 26.83%)
Amean    5       0.6173 (  0.00%)      0.4690 ( 24.02%)      0.4977 ( 19.37%)
Amean    7       0.7323 (  0.00%)      0.6367 ( 13.05%)      0.6267 ( 14.42%)
Amean    12      0.9716 (  0.00%)      1.0187 ( -4.85%)      0.9351 (  3.75%)
Amean    16      1.2866 (  0.00%)      1.2664 (  1.57%)      1.2131 (  5.71%)
Peter Zijlstra Dec. 5, 2016, 8:48 a.m. | #12
On Tue, Nov 29, 2016 at 02:46:10PM +0000, Morten Rasmussen wrote:
> On Tue, Nov 29, 2016 at 12:42:43PM +0100, Peter Zijlstra wrote:

> > On Tue, Nov 29, 2016 at 10:57:59AM +0000, Morten Rasmussen wrote:

> > > > @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

> > > >  

> > > >  	avg_cost = this_sd->avg_scan_cost;

> > > >  

> > > > -	/*

> > > > -	 * Due to large variance we need a large fuzz factor; hackbench in

> > > > -	 * particularly is sensitive here.

> > > > -	 */

> > > > -	if ((avg_idle / 512) < avg_cost)

> > > > -		return -1;

> > > > -


> The last bit that isn't clear to me is whether /512 is a 'large' fuzz

> factor. I guess it is, as we can have many wake-ups, i.e. many times

> avg_cost, over the period where avg_idle is calculated. No?



So the idea was to not spend more time looking for work than we're
actually going to be idle for, since then we're wasting time we could've
done work.

So avg_idle and avg_cost are equal measure in that the immediate
inequality would be: if (avg_idle < avg_cost) stop().

Of course, both being averages with unknown distribution makes that a
tricky proposition, and that also makes the 512 hard to quantify. Still
a factor of 512 of the total average, where our variable cannot go
negative (negative time intervals are nonsensical), is fairly large.
Vincent Guittot Dec. 5, 2016, 9:17 a.m. | #13
On 4 December 2016 at 00:25, Matt Fleming <matt@codeblueprint.co.uk> wrote:
> On Fri, 25 Nov, at 04:34:32PM, Vincent Guittot wrote:

>> During fork, the utilization of a task is init once the rq has been

>> selected because the current utilization level of the rq is used to set

>> the utilization of the fork task. As the task's utilization is still

>> null at this step of the fork sequence, it doesn't make sense to look for

>> some spare capacity that can fit the task's utilization.

>> Furthermore, I can see perf regressions for the test "hackbench -P -g 1"

>> because the least loaded policy is always bypassed and tasks are not

>> spread during fork.

>>

>> With this patch and the fix below, we are back to same performances as

>> for v4.8. The fix below is only a temporary one used for the test until a

>> smarter solution is found because we can't simply remove the test which is

>> useful for others benchmarks

>>

>> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

>>

>>       avg_cost = this_sd->avg_scan_cost;

>>

>> -     /*

>> -      * Due to large variance we need a large fuzz factor; hackbench in

>> -      * particularly is sensitive here.

>> -      */

>> -     if ((avg_idle / 512) < avg_cost)

>> -             return -1;

>> -

>>       time = local_clock();

>>

>>       for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

>>

>

> OK, I need to point out that I didn't apply the above hunk when

> testing this patch series. But I wouldn't have expected that to impact

> our fork-intensive workloads so much. Let me know if you'd like me to

> re-run with it applied.


At least on my target ( hikey board :  dual quad cortex-A53 platform),
i can see additional perf improvements for the fork intensive test
"hackbench -P -g 1"
The patch above was there to explain any difference in perf results
with v4.8 but you don't need to re-run with it

>

> I don't see much of a difference, positive or negative, for the

> majority of the test machines, it's mainly a wash.

>

> However, the following 4-cpu Xeon E5504 machine does show a nice win,

> with thread counts in the mid-range (note, the second column is number

> of hackbench groups, where each group has 40 tasks),

>

> hackbench-process-pipes

>                         4.9.0-rc6             4.9.0-rc6             4.9.0-rc6

>                         tip-sched      fix-fig-for-fork               fix-sig

> Amean    1       0.2193 (  0.00%)      0.2014 (  8.14%)      0.1746 ( 20.39%)

> Amean    3       0.4489 (  0.00%)      0.3544 ( 21.04%)      0.3284 ( 26.83%)

> Amean    5       0.6173 (  0.00%)      0.4690 ( 24.02%)      0.4977 ( 19.37%)

> Amean    7       0.7323 (  0.00%)      0.6367 ( 13.05%)      0.6267 ( 14.42%)

> Amean    12      0.9716 (  0.00%)      1.0187 ( -4.85%)      0.9351 (  3.75%)

> Amean    16      1.2866 (  0.00%)      1.2664 (  1.57%)      1.2131 (  5.71%)

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa47589..820a787 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5463,13 +5463,19 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	 * utilized systems if we require spare_capacity > task_util(p),
 	 * so we allow for some task stuffing by using
 	 * spare_capacity > task_util(p)/2.
+	 * spare capacity can't be used for fork because the utilization has
+	 * not been set yet as it need to get a rq to init the utilization
 	 */
+	if (sd_flag & SD_BALANCE_FORK)
+		goto no_spare;
+
 	if (this_spare > task_util(p) / 2 &&
 	    imbalance*this_spare > 100*most_spare)
 		return NULL;
 	else if (most_spare > task_util(p) / 2)
 		return most_spare_sg;
 
+no_spare:
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
 	return idlest;