diff mbox

[v4.8-rc1,Regression] sched/fair: Apply more PELT fixes

Message ID 20161019132957.GA7509@e105550-lin.cambridge.arm.com
State New
Headers show

Commit Message

Morten Rasmussen Oct. 19, 2016, 1:30 p.m. UTC
On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:
> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

> > > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

> > > > So aside from funny BIOSes, this should also show up when creating

> > > > cgroups when you have offlined a few CPUs, which is far more common I'd

> > > > think.

> > > 

> > > The problem is also that the load of the tg->se[cpu] that represents

> > > the tg->cfs_rq[cpu] is initialized to 1024 in:

> > > alloc_fair_sched_group

> > >      for_each_possible_cpu(i) {

> > >          init_entity_runnable_average(se);

> > >             sa->load_avg = scale_load_down(se->load.weight);

> > > 

> > > Initializing  sa->load_avg to 1024 for a newly created task makes

> > > sense as we don't know yet what will be its real load but i'm not sure

> > > that we have to do the same for se that represents a task group. This

> > > load should be initialized to 0 and it will increase when task will be

> > > moved/attached into task group

> > 

> > Yes, I think that makes sense, not sure how horrible that is with the

> 

> That should not be that bad because this initial value is only useful for

> the few dozens of ms that follow the creation of the task group


IMHO, it doesn't make much sense to initialize empty containers, which
group sched_entities really are, to 1024. It is meant to represent what
is in it, and a creation it is empty, so in my opinion initializing it
to zero make sense.
 
> > current state of things, but after your propagate patch, that

> > reinstates the interactivity hack that should work for sure.


It actually works on mainline/tip as well.

As I see it, the fundamental problem is keeping group entities up to
date. Because the load_weight and hence se->avg.load_avg each per-cpu
group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for
all cpus (tg->load_avg), including those that might be empty and
therefore not enqueued, we must ensure that they are updated some other
way. Most naturally as part of update_blocked_averages().

To guarantee that, it basically boils down to making sure:
Any cfs_rq with a non-zero tg_load_avg_contrib must be on the
leaf_cfs_rq_list.

We can do that in different ways: 1) Add all cfs_rqs to the
leaf_cfs_rq_list at task group creation, or 2) initialize group
sched_entity contributions to zero and make sure that they are added to
leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued
on it.

Vincent patch below gives us the second option.

>  kernel/sched/fair.c | 9 ++++++++-

>  1 file changed, 8 insertions(+), 1 deletion(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 8b03fb5..89776ac 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)

>  	 * will definitely be update (after enqueue).

>  	 */

>  	sa->period_contrib = 1023;

> -	sa->load_avg = scale_load_down(se->load.weight);

> +	/*

> +	 * Tasks are intialized with full load to be seen as heavy task until

> +	 * they get a chance to stabilize to their real load level.

> +	 * group entity are intialized with null load to reflect the fact that

> +	 * nothing has been attached yet to the task group.

> +	 */

> +	if (entity_is_task(se))

> +		sa->load_avg = scale_load_down(se->load.weight);

>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

>  	/*

>  	 * At this point, util_avg won't be used in select_task_rq_fair anyway


I would suggest adding a comment somewhere stating that we need to keep
group cfs_rqs up to date:

-----
-----

I did a couple of simple tests on tip/sched/core to test whether
Vincent's fix works even without reflecting group load/util in the group
hierarchy:

Juno (2xA57+4xA53)

tip:
	grouped hog(1) alone: 2841
	non-grouped hogs(6) alone: 40830
	grouped hog(1): 218
	non-grouped hogs(6): 40580

tip+vg:
	grouped hog alone: 2849
	non-grouped hogs(6) alone: 40831
	grouped hog: 2363
	non-grouped hogs: 38418

See script below for details, but we basically see that the grouped task
is not getting its 'fair' share on tip, while it does with Vincent's
patch.

To summarize, I think Vincent's patch makes sense and works :-) More
testing is needed of cause to see if there are other problems.

-----

# Create 100 task groups:
for i in `seq 1 100`;
do
        cgcreate -g cpu:/root/test$i
done

NCPUS=$(grep -c ^processor /proc/cpuinfo)

# Run single cpu hog inside task group on first cpu _alone_:
cgexec -g cpu:/root/test100 taskset 0x01 sysbench --test=cpu \
--num-threads=1 --max-time=5 --max-requests=1000000 run | \
awk '{if ($4=="events:") {print "grouped hog(1) alone: " $5}}'

# Run cpu hogs outside task group _alone_:
sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \
--max-requests=1000000 run | awk '{if ($4=="events:") \
{print "non-grouped hogs('$NCPUS') alone: " $5}}'

# Run cpu hogs outside task group:
sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \
--max-requests=1000000 run | awk '{if ($4=="events:") \
{print "non-grouped hogs('$NCPUS'): " $5}}' &

# Run single cpu hog inside task group on first cpu:
cgexec -g cpu:/root/test100 taskset 0x01 sysbench \
--test=cpu --num-threads=1 --max-time=5 \
--max-requests=1000000 run | awk '{if ($4=="events:") \
{print "grouped hog(1): " $5}}'

wait

# Delete task groups:
for i in `seq 1 100`;
do
        cgdelete -g cpu:/root/test$i
done

Comments

Vincent Guittot Oct. 19, 2016, 5:41 p.m. UTC | #1
On 19 October 2016 at 15:30, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:

>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

>> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

>> > > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

>> > > > So aside from funny BIOSes, this should also show up when creating

>> > > > cgroups when you have offlined a few CPUs, which is far more common I'd

>> > > > think.

>> > >

>> > > The problem is also that the load of the tg->se[cpu] that represents

>> > > the tg->cfs_rq[cpu] is initialized to 1024 in:

>> > > alloc_fair_sched_group

>> > >      for_each_possible_cpu(i) {

>> > >          init_entity_runnable_average(se);

>> > >             sa->load_avg = scale_load_down(se->load.weight);

>> > >

>> > > Initializing  sa->load_avg to 1024 for a newly created task makes

>> > > sense as we don't know yet what will be its real load but i'm not sure

>> > > that we have to do the same for se that represents a task group. This

>> > > load should be initialized to 0 and it will increase when task will be

>> > > moved/attached into task group

>> >

>> > Yes, I think that makes sense, not sure how horrible that is with the

>>

>> That should not be that bad because this initial value is only useful for

>> the few dozens of ms that follow the creation of the task group

>

> IMHO, it doesn't make much sense to initialize empty containers, which

> group sched_entities really are, to 1024. It is meant to represent what

> is in it, and a creation it is empty, so in my opinion initializing it

> to zero make sense.

>

>> > current state of things, but after your propagate patch, that

>> > reinstates the interactivity hack that should work for sure.

>

> It actually works on mainline/tip as well.

>

> As I see it, the fundamental problem is keeping group entities up to

> date. Because the load_weight and hence se->avg.load_avg each per-cpu

> group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for

> all cpus (tg->load_avg), including those that might be empty and

> therefore not enqueued, we must ensure that they are updated some other

> way. Most naturally as part of update_blocked_averages().

>

> To guarantee that, it basically boils down to making sure:

> Any cfs_rq with a non-zero tg_load_avg_contrib must be on the

> leaf_cfs_rq_list.

>

> We can do that in different ways: 1) Add all cfs_rqs to the

> leaf_cfs_rq_list at task group creation, or 2) initialize group

> sched_entity contributions to zero and make sure that they are added to

> leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued

> on it.

>

> Vincent patch below gives us the second option.

>

>>  kernel/sched/fair.c | 9 ++++++++-

>>  1 file changed, 8 insertions(+), 1 deletion(-)

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index 8b03fb5..89776ac 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)

>>        * will definitely be update (after enqueue).

>>        */

>>       sa->period_contrib = 1023;

>> -     sa->load_avg = scale_load_down(se->load.weight);

>> +     /*

>> +      * Tasks are intialized with full load to be seen as heavy task until

>> +      * they get a chance to stabilize to their real load level.

>> +      * group entity are intialized with null load to reflect the fact that

>> +      * nothing has been attached yet to the task group.

>> +      */

>> +     if (entity_is_task(se))

>> +             sa->load_avg = scale_load_down(se->load.weight);

>>       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

>>       /*

>>        * At this point, util_avg won't be used in select_task_rq_fair anyway

>

> I would suggest adding a comment somewhere stating that we need to keep

> group cfs_rqs up to date:

>

> -----

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index abb3763dff69..2b820d489be0 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu)

>                 if (throttled_hierarchy(cfs_rq))

>                         continue;

>

> +               /*

> +                * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib

> +                * _must_ be on the leaf_cfs_rq_list to ensure that group shares

> +                * are updated correctly.

> +                */


As discussed on IRC, the point is that even if the leaf cfs_rq is
added to the leaf_cfs_rq_list, it doesn't ensure that it will be
updated correctly for unplugged CPUs

>                 if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))

>                         update_tg_load_avg(cfs_rq, 0);

>         }

> -----

>

> I did a couple of simple tests on tip/sched/core to test whether

> Vincent's fix works even without reflecting group load/util in the group

> hierarchy:

>

> Juno (2xA57+4xA53)

>

> tip:

>         grouped hog(1) alone: 2841

>         non-grouped hogs(6) alone: 40830

>         grouped hog(1): 218

>         non-grouped hogs(6): 40580

>

> tip+vg:

>         grouped hog alone: 2849

>         non-grouped hogs(6) alone: 40831

>         grouped hog: 2363

>         non-grouped hogs: 38418

>

> See script below for details, but we basically see that the grouped task

> is not getting its 'fair' share on tip, while it does with Vincent's

> patch.

>

> To summarize, I think Vincent's patch makes sense and works :-) More

> testing is needed of cause to see if there are other problems.

>

> -----

>

> # Create 100 task groups:

> for i in `seq 1 100`;

> do

>         cgcreate -g cpu:/root/test$i

> done

>

> NCPUS=$(grep -c ^processor /proc/cpuinfo)

>

> # Run single cpu hog inside task group on first cpu _alone_:

> cgexec -g cpu:/root/test100 taskset 0x01 sysbench --test=cpu \

> --num-threads=1 --max-time=5 --max-requests=1000000 run | \

> awk '{if ($4=="events:") {print "grouped hog(1) alone: " $5}}'

>

> # Run cpu hogs outside task group _alone_:

> sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \

> --max-requests=1000000 run | awk '{if ($4=="events:") \

> {print "non-grouped hogs('$NCPUS') alone: " $5}}'

>

> # Run cpu hogs outside task group:

> sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \

> --max-requests=1000000 run | awk '{if ($4=="events:") \

> {print "non-grouped hogs('$NCPUS'): " $5}}' &

>

> # Run single cpu hog inside task group on first cpu:

> cgexec -g cpu:/root/test100 taskset 0x01 sysbench \

> --test=cpu --num-threads=1 --max-time=5 \

> --max-requests=1000000 run | awk '{if ($4=="events:") \

> {print "grouped hog(1): " $5}}'

>

> wait

>

> # Delete task groups:

> for i in `seq 1 100`;

> do

>         cgdelete -g cpu:/root/test$i

> done
Morten Rasmussen Oct. 20, 2016, 7:56 a.m. UTC | #2
On Wed, Oct 19, 2016 at 07:41:36PM +0200, Vincent Guittot wrote:
> On 19 October 2016 at 15:30, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> > On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:

> >> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

> >> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

> >> > > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

> >> > > > So aside from funny BIOSes, this should also show up when creating

> >> > > > cgroups when you have offlined a few CPUs, which is far more common I'd

> >> > > > think.

> >> > >

> >> > > The problem is also that the load of the tg->se[cpu] that represents

> >> > > the tg->cfs_rq[cpu] is initialized to 1024 in:

> >> > > alloc_fair_sched_group

> >> > >      for_each_possible_cpu(i) {

> >> > >          init_entity_runnable_average(se);

> >> > >             sa->load_avg = scale_load_down(se->load.weight);

> >> > >

> >> > > Initializing  sa->load_avg to 1024 for a newly created task makes

> >> > > sense as we don't know yet what will be its real load but i'm not sure

> >> > > that we have to do the same for se that represents a task group. This

> >> > > load should be initialized to 0 and it will increase when task will be

> >> > > moved/attached into task group

> >> >

> >> > Yes, I think that makes sense, not sure how horrible that is with the

> >>

> >> That should not be that bad because this initial value is only useful for

> >> the few dozens of ms that follow the creation of the task group

> >

> > IMHO, it doesn't make much sense to initialize empty containers, which

> > group sched_entities really are, to 1024. It is meant to represent what

> > is in it, and a creation it is empty, so in my opinion initializing it

> > to zero make sense.

> >

> >> > current state of things, but after your propagate patch, that

> >> > reinstates the interactivity hack that should work for sure.

> >

> > It actually works on mainline/tip as well.

> >

> > As I see it, the fundamental problem is keeping group entities up to

> > date. Because the load_weight and hence se->avg.load_avg each per-cpu

> > group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for

> > all cpus (tg->load_avg), including those that might be empty and

> > therefore not enqueued, we must ensure that they are updated some other

> > way. Most naturally as part of update_blocked_averages().

> >

> > To guarantee that, it basically boils down to making sure:

> > Any cfs_rq with a non-zero tg_load_avg_contrib must be on the

> > leaf_cfs_rq_list.

> >

> > We can do that in different ways: 1) Add all cfs_rqs to the

> > leaf_cfs_rq_list at task group creation, or 2) initialize group

> > sched_entity contributions to zero and make sure that they are added to

> > leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued

> > on it.

> >

> > Vincent patch below gives us the second option.

> >

> >>  kernel/sched/fair.c | 9 ++++++++-

> >>  1 file changed, 8 insertions(+), 1 deletion(-)

> >>

> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> >> index 8b03fb5..89776ac 100644

> >> --- a/kernel/sched/fair.c

> >> +++ b/kernel/sched/fair.c

> >> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)

> >>        * will definitely be update (after enqueue).

> >>        */

> >>       sa->period_contrib = 1023;

> >> -     sa->load_avg = scale_load_down(se->load.weight);

> >> +     /*

> >> +      * Tasks are intialized with full load to be seen as heavy task until

> >> +      * they get a chance to stabilize to their real load level.

> >> +      * group entity are intialized with null load to reflect the fact that

> >> +      * nothing has been attached yet to the task group.

> >> +      */

> >> +     if (entity_is_task(se))

> >> +             sa->load_avg = scale_load_down(se->load.weight);

> >>       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

> >>       /*

> >>        * At this point, util_avg won't be used in select_task_rq_fair anyway

> >

> > I would suggest adding a comment somewhere stating that we need to keep

> > group cfs_rqs up to date:

> >

> > -----

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index abb3763dff69..2b820d489be0 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu)

> >                 if (throttled_hierarchy(cfs_rq))

> >                         continue;

> >

> > +               /*

> > +                * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib

> > +                * _must_ be on the leaf_cfs_rq_list to ensure that group shares

> > +                * are updated correctly.

> > +                */

> 

> As discussed on IRC, the point is that even if the leaf cfs_rq is

> added to the leaf_cfs_rq_list, it doesn't ensure that it will be

> updated correctly for unplugged CPUs


Agreed. We have to ensure that tg_load_avg_contrib is zeroed for leaf
cfs_rqs belonging to unplugged cpus. And if modify the above to say
leaf_cfs_rq_list of an online cpu, then we should be covered I think.
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index abb3763dff69..2b820d489be0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6641,6 +6641,11 @@  static void update_blocked_averages(int cpu)
 		if (throttled_hierarchy(cfs_rq))
 			continue;
 
+		/*
+		 * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib
+		 * _must_ be on the leaf_cfs_rq_list to ensure that group shares
+		 * are updated correctly.
+		 */
 		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
 			update_tg_load_avg(cfs_rq, 0);
 	}