[v4.8-rc1,Regression] sched/fair: Apply more PELT fixes

Message ID	20161018115651.GA20956@linaro.org
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Date: Tue, 18 Oct 2016 13:56:51 +0200 From: Vincent Guittot <vincent.guittot@linaro.org> To: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>, Joseph Salisbury <joseph.salisbury@canonical.com>, Ingo Molnar <mingo@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de>, LKML <linux-kernel@vger.kernel.org>, Mike Galbraith <efault@gmx.de>, omer.akram@canonical.com Subject: Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes Message-ID: <20161018115651.GA20956@linaro.org> References: <20161014151827.GA10379@linaro.org> <2bb765e7-8a5f-c525-a6ae-fbec6fae6354@canonical.com> <20161017090903.GA11962@linaro.org> <4e15ad55-beeb-e860-0420-8f439d076758@arm.com> <20161017131952.GR3117@twins.programming.kicks-ass.net> <CAKfTPtAxw1b-vy285HKtPUBFYuJdv2CFZH_gP3CMtZHs1wLPXg@mail.gmail.com> <94cc6deb-f93e-60ec-5834-e84a8b98e73c@arm.com> <20161018090747.GW3142@twins.programming.kicks-ass.net> <CAKfTPtCH4wbzHaDzgAtLLSLUegQfztgE5bO4JV-SYxEGd78KGA@mail.gmail.com> <20161018103412.GT3117@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20161018103412.GT3117@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk

Vincent Guittot Oct. 18, 2016, 11:56 a.m. UTC

Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

> > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

> > > So aside from funny BIOSes, this should also show up when creating

> > > cgroups when you have offlined a few CPUs, which is far more common I'd

> > > think.

> > 

> > The problem is also that the load of the tg->se[cpu] that represents

> > the tg->cfs_rq[cpu] is initialized to 1024 in:

> > alloc_fair_sched_group

> >      for_each_possible_cpu(i) {

> >          init_entity_runnable_average(se);

> >             sa->load_avg = scale_load_down(se->load.weight);

> > 

> > Initializing  sa->load_avg to 1024 for a newly created task makes

> > sense as we don't know yet what will be its real load but i'm not sure

> > that we have to do the same for se that represents a task group. This

> > load should be initialized to 0 and it will increase when task will be

> > moved/attached into task group

> 

> Yes, I think that makes sense, not sure how horrible that is with the


That should not be that bad because this initial value is only useful for
the few dozens of ms that follow the creation of the task group 

>

> current state of things, but after your propagate patch, that

> reinstates the interactivity hack that should work for sure.


The patch below fixes the issue on my platform: 

Dietmar, Omer can you confirm that this fix the problem of your platform too ?

---
 kernel/sched/fair.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Joonwoo Park Oct. 18, 2016, 9:58 p.m. UTC | #1

On 10/18/2016 04:56 AM, Vincent Guittot wrote:
> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

>>>> So aside from funny BIOSes, this should also show up when creating

>>>> cgroups when you have offlined a few CPUs, which is far more common I'd

>>>> think.

>>>

>>> The problem is also that the load of the tg->se[cpu] that represents

>>> the tg->cfs_rq[cpu] is initialized to 1024 in:

>>> alloc_fair_sched_group

>>>      for_each_possible_cpu(i) {

>>>          init_entity_runnable_average(se);

>>>             sa->load_avg = scale_load_down(se->load.weight);

>>>

>>> Initializing  sa->load_avg to 1024 for a newly created task makes

>>> sense as we don't know yet what will be its real load but i'm not sure

>>> that we have to do the same for se that represents a task group. This

>>> load should be initialized to 0 and it will increase when task will be

>>> moved/attached into task group

>>

>> Yes, I think that makes sense, not sure how horrible that is with the

>

> That should not be that bad because this initial value is only useful for

> the few dozens of ms that follow the creation of the task group

>

>>

>> current state of things, but after your propagate patch, that

>> reinstates the interactivity hack that should work for sure.

>

> The patch below fixes the issue on my platform:

>

> Dietmar, Omer can you confirm that this fix the problem of your platform too ?


I just noticed this thread after posting 
https://lkml.org/lkml/2016/10/18/719...
Noticed this bug while a ago and had the patch above at least a week but 
unfortunately didn't have time to post...
I think Omer had same problem I was trying to fix and I believe patch I 
post should address it.

Vincent, your version fixes my test case as well.
This is sched_stat from the same test case I had in my changelog.
Note dd-2030 which is in root cgroup had same runtime as dd-2033 which 
is in child cgroup.

dd (2030, #threads: 1)
-------------------------------------------------------------------
se.exec_start                                :        275700.024137
se.vruntime                                  :         10589.114654
se.sum_exec_runtime                          :          1576.837993
se.nr_migrations                             :                    0
nr_switches                                  :                  159
nr_voluntary_switches                        :                    0
nr_involuntary_switches                      :                  159
se.load.weight                               :              1048576
se.avg.load_sum                              :             48840575
se.avg.util_sum                              :             19741820
se.avg.load_avg                              :                 1022
se.avg.util_avg                              :                  413
se.avg.last_update_time                      :         275700024137
policy                                       :                    0
prio                                         :                  120
clock-delta                                  :                   34
dd (2033, #threads: 1)
-------------------------------------------------------------------
se.exec_start                                :        275710.037178
se.vruntime                                  :          2383.802868
se.sum_exec_runtime                          :          1576.547591
se.nr_migrations                             :                    0
nr_switches                                  :                  162
nr_voluntary_switches                        :                    0
nr_involuntary_switches                      :                  162
se.load.weight                               :              1048576
se.avg.load_sum                              :             48316646
se.avg.util_sum                              :             21235249
se.avg.load_avg                              :                 1011
se.avg.util_avg                              :                  444
se.avg.last_update_time                      :         275710037178
policy                                       :                    0
prio                                         :                  120
clock-delta                                  :                   36

Thanks,
Joonwoo

>

> ---

>  kernel/sched/fair.c | 9 ++++++++-

>  1 file changed, 8 insertions(+), 1 deletion(-)

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 8b03fb5..89776ac 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)

>  	 * will definitely be update (after enqueue).

>  	 */

>  	sa->period_contrib = 1023;

> -	sa->load_avg = scale_load_down(se->load.weight);

> +	/*

> +	 * Tasks are intialized with full load to be seen as heavy task until

> +	 * they get a chance to stabilize to their real load level.

> +	 * group entity are intialized with null load to reflect the fact that

> +	 * nothing has been attached yet to the task group.

> +	 */

> +	if (entity_is_task(se))

> +		sa->load_avg = scale_load_down(se->load.weight);

>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

>  	/*

>  	 * At this point, util_avg won't be used in select_task_rq_fair anyway

>

>

>

>

Vincent Guittot Oct. 19, 2016, 6:42 a.m. UTC | #2

On 18 October 2016 at 23:58, Joonwoo Park <joonwoop@codeaurora.org> wrote:
>

>

> On 10/18/2016 04:56 AM, Vincent Guittot wrote:

>>

>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

>>>

>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

>>>>

>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org>

>>>> wrote:

>>>>>

>>>>> So aside from funny BIOSes, this should also show up when creating

>>>>> cgroups when you have offlined a few CPUs, which is far more common I'd

>>>>> think.

>>>>

>>>>

>>>> The problem is also that the load of the tg->se[cpu] that represents

>>>> the tg->cfs_rq[cpu] is initialized to 1024 in:

>>>> alloc_fair_sched_group

>>>>      for_each_possible_cpu(i) {

>>>>          init_entity_runnable_average(se);

>>>>             sa->load_avg = scale_load_down(se->load.weight);

>>>>

>>>> Initializing  sa->load_avg to 1024 for a newly created task makes

>>>> sense as we don't know yet what will be its real load but i'm not sure

>>>> that we have to do the same for se that represents a task group. This

>>>> load should be initialized to 0 and it will increase when task will be

>>>> moved/attached into task group

>>>

>>>

>>> Yes, I think that makes sense, not sure how horrible that is with the

>>

>>

>> That should not be that bad because this initial value is only useful for

>> the few dozens of ms that follow the creation of the task group

>>

>>>

>>> current state of things, but after your propagate patch, that

>>> reinstates the interactivity hack that should work for sure.

>>

>>

>> The patch below fixes the issue on my platform:

>>

>> Dietmar, Omer can you confirm that this fix the problem of your platform

>> too ?

>

>

> I just noticed this thread after posting

> https://lkml.org/lkml/2016/10/18/719...

> Noticed this bug while a ago and had the patch above at least a week but

> unfortunately didn't have time to post...

> I think Omer had same problem I was trying to fix and I believe patch I post

> should address it.

>

> Vincent, your version fixes my test case as well.


Thanks for testing.
Can i consider this as a Tested-by ?

> This is sched_stat from the same test case I had in my changelog.

> Note dd-2030 which is in root cgroup had same runtime as dd-2033 which is in

> child cgroup.

>

> dd (2030, #threads: 1)

> -------------------------------------------------------------------

> se.exec_start                                :        275700.024137

> se.vruntime                                  :         10589.114654

> se.sum_exec_runtime                          :          1576.837993

> se.nr_migrations                             :                    0

> nr_switches                                  :                  159

> nr_voluntary_switches                        :                    0

> nr_involuntary_switches                      :                  159

> se.load.weight                               :              1048576

> se.avg.load_sum                              :             48840575

> se.avg.util_sum                              :             19741820

> se.avg.load_avg                              :                 1022

> se.avg.util_avg                              :                  413

> se.avg.last_update_time                      :         275700024137

> policy                                       :                    0

> prio                                         :                  120

> clock-delta                                  :                   34

> dd (2033, #threads: 1)

> -------------------------------------------------------------------

> se.exec_start                                :        275710.037178

> se.vruntime                                  :          2383.802868

> se.sum_exec_runtime                          :          1576.547591

> se.nr_migrations                             :                    0

> nr_switches                                  :                  162

> nr_voluntary_switches                        :                    0

> nr_involuntary_switches                      :                  162

> se.load.weight                               :              1048576

> se.avg.load_sum                              :             48316646

> se.avg.util_sum                              :             21235249

> se.avg.load_avg                              :                 1011

> se.avg.util_avg                              :                  444

> se.avg.last_update_time                      :         275710037178

> policy                                       :                    0

> prio                                         :                  120

> clock-delta                                  :                   36

>

> Thanks,

> Joonwoo

>

>

>>

>> ---

>>  kernel/sched/fair.c | 9 ++++++++-

>>  1 file changed, 8 insertions(+), 1 deletion(-)

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index 8b03fb5..89776ac 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity

>> *se)

>>          * will definitely be update (after enqueue).

>>          */

>>         sa->period_contrib = 1023;

>> -       sa->load_avg = scale_load_down(se->load.weight);

>> +       /*

>> +        * Tasks are intialized with full load to be seen as heavy task

>> until

>> +        * they get a chance to stabilize to their real load level.

>> +        * group entity are intialized with null load to reflect the fact

>> that

>> +        * nothing has been attached yet to the task group.

>> +        */

>> +       if (entity_is_task(se))

>> +               sa->load_avg = scale_load_down(se->load.weight);

>>         sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

>>         /*

>>          * At this point, util_avg won't be used in select_task_rq_fair

>> anyway

>>

>>

>>

>>

>

Dietmar Eggemann Oct. 19, 2016, 9:46 a.m. UTC | #3

On 18/10/16 12:56, Vincent Guittot wrote:
> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:


[...]

> 

> The patch below fixes the issue on my platform: 

> 

> Dietmar, Omer can you confirm that this fix the problem of your platform too ?


It fixes this broken BIOS issue on my T430 ( cpu_possible_mask >
cpu_online_mask). I ran the original test with the cpu hogs (stress -c
4). Launch time of applications becomes normal again.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>


But this test only makes sure that we don't see any ghost contribution
(from non-existing cpus) any more.

We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
(with the highest tg having a task enqueued) a little bit more, with and
without your v5 'sched: reflect sched_entity move into task_group's load'.

> ---

>  kernel/sched/fair.c | 9 ++++++++-

>  1 file changed, 8 insertions(+), 1 deletion(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 8b03fb5..89776ac 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)

>  	 * will definitely be update (after enqueue).

>  	 */

>  	sa->period_contrib = 1023;

> -	sa->load_avg = scale_load_down(se->load.weight);

> +	/*

> +	 * Tasks are intialized with full load to be seen as heavy task until

> +	 * they get a chance to stabilize to their real load level.

> +	 * group entity are intialized with null load to reflect the fact that

> +	 * nothing has been attached yet to the task group.

> +	 */

> +	if (entity_is_task(se))

> +		sa->load_avg = scale_load_down(se->load.weight);

>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

>  	/*

>  	 * At this point, util_avg won't be used in select_task_rq_fair anyway

Vincent Guittot Oct. 19, 2016, 11:25 a.m. UTC | #4

On 19 October 2016 at 11:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 18/10/16 12:56, Vincent Guittot wrote:

>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

>

> [...]

>

>>

>> The patch below fixes the issue on my platform:

>>

>> Dietmar, Omer can you confirm that this fix the problem of your platform too ?

>

> It fixes this broken BIOS issue on my T430 ( cpu_possible_mask >

> cpu_online_mask). I ran the original test with the cpu hogs (stress -c

> 4). Launch time of applications becomes normal again.

>

> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>


Thanks

>

> But this test only makes sure that we don't see any ghost contribution

> (from non-existing cpus) any more.

>

> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's

> (with the highest tg having a task enqueued) a little bit more, with and

> without your v5 'sched: reflect sched_entity move into task_group's load'.


Can you elaborate ?

Vincent

>

>> ---

>>  kernel/sched/fair.c | 9 ++++++++-

>>  1 file changed, 8 insertions(+), 1 deletion(-)

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index 8b03fb5..89776ac 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)

>>        * will definitely be update (after enqueue).

>>        */

>>       sa->period_contrib = 1023;

>> -     sa->load_avg = scale_load_down(se->load.weight);

>> +     /*

>> +      * Tasks are intialized with full load to be seen as heavy task until

>> +      * they get a chance to stabilize to their real load level.

>> +      * group entity are intialized with null load to reflect the fact that

>> +      * nothing has been attached yet to the task group.

>> +      */

>> +     if (entity_is_task(se))

>> +             sa->load_avg = scale_load_down(se->load.weight);

>>       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

>>       /*

>>        * At this point, util_avg won't be used in select_task_rq_fair anyway

Peter Zijlstra Oct. 19, 2016, 11:33 a.m. UTC | #5

On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:

> ---

>  kernel/sched/fair.c | 9 ++++++++-

>  1 file changed, 8 insertions(+), 1 deletion(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 8b03fb5..89776ac 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)

>  	 * will definitely be update (after enqueue).

>  	 */

>  	sa->period_contrib = 1023;

> -	sa->load_avg = scale_load_down(se->load.weight);

> +	/*

> +	 * Tasks are intialized with full load to be seen as heavy task until

> +	 * they get a chance to stabilize to their real load level.

> +	 * group entity are intialized with null load to reflect the fact that

> +	 * nothing has been attached yet to the task group.

> +	 */

> +	if (entity_is_task(se))

> +		sa->load_avg = scale_load_down(se->load.weight);

>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

>  	/*

>  	 * At this point, util_avg won't be used in select_task_rq_fair anyway

> 


Vince, could you post a proper version of this patch with changelogs and
tags so that we can get that merged into Linus' tree and stable for 4.8?

Vincent Guittot Oct. 19, 2016, 11:50 a.m. UTC | #6

On 19 October 2016 at 13:33, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:

>

>> ---

>>  kernel/sched/fair.c | 9 ++++++++-

>>  1 file changed, 8 insertions(+), 1 deletion(-)

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index 8b03fb5..89776ac 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)

>>        * will definitely be update (after enqueue).

>>        */

>>       sa->period_contrib = 1023;

>> -     sa->load_avg = scale_load_down(se->load.weight);

>> +     /*

>> +      * Tasks are intialized with full load to be seen as heavy task until

>> +      * they get a chance to stabilize to their real load level.

>> +      * group entity are intialized with null load to reflect the fact that

>> +      * nothing has been attached yet to the task group.

>> +      */

>> +     if (entity_is_task(se))

>> +             sa->load_avg = scale_load_down(se->load.weight);

>>       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

>>       /*

>>        * At this point, util_avg won't be used in select_task_rq_fair anyway

>>

>

> Vince, could you post a proper version of this patch with changelogs and

> tags so that we can get that merged into Linus' tree and stable for 4.8?


yes. i just have to finish the changelog and i sent it

>

Joseph Salisbury Oct. 19, 2016, 2:49 p.m. UTC | #7

On 10/18/2016 07:56 AM, Vincent Guittot wrote:
> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

>>>> So aside from funny BIOSes, this should also show up when creating

>>>> cgroups when you have offlined a few CPUs, which is far more common I'd

>>>> think.

>>> The problem is also that the load of the tg->se[cpu] that represents

>>> the tg->cfs_rq[cpu] is initialized to 1024 in:

>>> alloc_fair_sched_group

>>>      for_each_possible_cpu(i) {

>>>          init_entity_runnable_average(se);

>>>             sa->load_avg = scale_load_down(se->load.weight);

>>>

>>> Initializing  sa->load_avg to 1024 for a newly created task makes

>>> sense as we don't know yet what will be its real load but i'm not sure

>>> that we have to do the same for se that represents a task group. This

>>> load should be initialized to 0 and it will increase when task will be

>>> moved/attached into task group

>> Yes, I think that makes sense, not sure how horrible that is with the

> That should not be that bad because this initial value is only useful for

> the few dozens of ms that follow the creation of the task group 

>

>> current state of things, but after your propagate patch, that

>> reinstates the interactivity hack that should work for sure.

> The patch below fixes the issue on my platform: 

>

> Dietmar, Omer can you confirm that this fix the problem of your platform too ?

>

> ---

>  kernel/sched/fair.c | 9 ++++++++-

>  1 file changed, 8 insertions(+), 1 deletion(-)Vinc

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 8b03fb5..89776ac 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)

>  	 * will definitely be update (after enqueue).

>  	 */

>  	sa->period_contrib = 1023;

> -	sa->load_avg = scale_load_down(se->load.weight);

> +	/*

> +	 * Tasks are intialized with full load to be seen as heavy task until

> +	 * they get a chance to stabilize to their real load level.

> +	 * group entity are intialized with null load to reflect the fact that

> +	 * nothing has been attached yet to the task group.

> +	 */

> +	if (entity_is_task(se))

> +		sa->load_avg = scale_load_down(se->load.weight);

>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

>  	/*

>  	 * At this point, util_avg won't be used in select_task_rq_fair anyway

>

>

>

>

Omer also reports that this patch fixes the bug for him as well.  Thanks
for the great work, Vincent!

Vincent Guittot Oct. 19, 2016, 2:53 p.m. UTC | #8

On 19 October 2016 at 16:49, Joseph Salisbury
<joseph.salisbury@canonical.com> wrote:
> On 10/18/2016 07:56 AM, Vincent Guittot wrote:

>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

>>>>> So aside from funny BIOSes, this should also show up when creating

>>>>> cgroups when you have offlined a few CPUs, which is far more common I'd

>>>>> think.

>>>> The problem is also that the load of the tg->se[cpu] that represents

>>>> the tg->cfs_rq[cpu] is initialized to 1024 in:

>>>> alloc_fair_sched_group

>>>>      for_each_possible_cpu(i) {

>>>>          init_entity_runnable_average(se);

>>>>             sa->load_avg = scale_load_down(se->load.weight);

>>>>

>>>> Initializing  sa->load_avg to 1024 for a newly created task makes

>>>> sense as we don't know yet what will be its real load but i'm not sure

>>>> that we have to do the same for se that represents a task group. This

>>>> load should be initialized to 0 and it will increase when task will be

>>>> moved/attached into task group

>>> Yes, I think that makes sense, not sure how horrible that is with the

>> That should not be that bad because this initial value is only useful for

>> the few dozens of ms that follow the creation of the task group

>>

>>> current state of things, but after your propagate patch, that

>>> reinstates the interactivity hack that should work for sure.

>> The patch below fixes the issue on my platform:

>>

>> Dietmar, Omer can you confirm that this fix the problem of your platform too ?

>>

>> ---

>>  kernel/sched/fair.c | 9 ++++++++-

>>  1 file changed, 8 insertions(+), 1 deletion(-)Vinc

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index 8b03fb5..89776ac 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)

>>        * will definitely be update (after enqueue).

>>        */

>>       sa->period_contrib = 1023;

>> -     sa->load_avg = scale_load_down(se->load.weight);

>> +     /*

>> +      * Tasks are intialized with full load to be seen as heavy task until

>> +      * they get a chance to stabilize to their real load level.

>> +      * group entity are intialized with null load to reflect the fact that

>> +      * nothing has been attached yet to the task group.

>> +      */

>> +     if (entity_is_task(se))

>> +             sa->load_avg = scale_load_down(se->load.weight);

>>       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

>>       /*

>>        * At this point, util_avg won't be used in select_task_rq_fair anyway

>>

>>

>>

>>

> Omer also reports that this patch fixes the bug for him as well.  Thanks

> for the great work, Vincent!


Thanks

>

Dietmar Eggemann Oct. 19, 2016, 3:33 p.m. UTC | #9

On 19/10/16 12:25, Vincent Guittot wrote:
> On 19 October 2016 at 11:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>> On 18/10/16 12:56, Vincent Guittot wrote:

>>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

>>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

>>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

[...]

>> But this test only makes sure that we don't see any ghost contribution

>> (from non-existing cpus) any more.

>>

>> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's

>> (with the highest tg having a task enqueued) a little bit more, with and

>> without your v5 'sched: reflect sched_entity move into task_group's load'.

> 

> Can you elaborate ?

I try :-)

I thought I will see some different behaviour because of the fact that
the tg se's are initialized differently [1024 versus 0].

But I can't spot any difference. The test case is running a sysbench
thread affine to cpu1 in tg_root/tg_1/tg_11/tg_111 on tip/sched/core on
an ARM64 Juno (6 logical cpus).
The moment the sysbench task is put into tg_111
tg_111->se[1]->avg.load_avg gets updated to 0 any way because of the
huge time difference between creating this tg and attaching a task to
it. So the tg->se[2]->avg.load_avg signals for tg_111, tg_11 and tg_1
look exactly the same w/o and w/ your patch.

But your patch helps in this (very synthetic) test case as well. W/o
your patch I see remaining tg->load_avg for tg_1 and tg_11 after the
test case has finished because the tg's were exclusively used on cpu1.

# cat /proc/sched_debug

 cfs_rq[1]:/tg_1
   .tg_load_avg_contrib           : 0
   .tg_load_avg                   : 5120 (5 (unused cpus) * 1024 * 1)
 cfs_rq[1]:/tg_1/tg_11/tg_111
   .tg_load_avg_contrib           : 0
   .tg_load_avg                   : 0
 cfs_rq[1]:/tg_1/tg_11
   .tg_load_avg_contrib           : 0
   .tg_load_avg                   : 5120

With your patch applied all the .tg_load_avg are 0.

Joonwoo Park Oct. 19, 2016, 5:33 p.m. UTC | #10

On Wed, Oct 19, 2016 at 04:33:03PM +0100, Dietmar Eggemann wrote:
> On 19/10/16 12:25, Vincent Guittot wrote:

> > On 19 October 2016 at 11:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

> >> On 18/10/16 12:56, Vincent Guittot wrote:

> >>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

> >>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

> >>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

> 

> [...]

> 

> >> But this test only makes sure that we don't see any ghost contribution

> >> (from non-existing cpus) any more.

> >>

> >> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's

> >> (with the highest tg having a task enqueued) a little bit more, with and

> >> without your v5 'sched: reflect sched_entity move into task_group's load'.

> > 

> > Can you elaborate ?

> 

> I try :-)

> 

> I thought I will see some different behaviour because of the fact that

> the tg se's are initialized differently [1024 versus 0].


This is the exact thing I was also worried about and that's the reason I
tried to fix this in a different way.
However I didn't find any behaviour difference once any task attached to
child cfs_rq which is the point we really care about.

I found this bug while making patch at https://lkml.org/lkml/2016/10/18/841
which will fail with wrong task_group load_avg.
I tested Vincent's patch and above together, confirmed it's still good.

Though I know Ingo already sent out pull request.  Anyway.

Tested-by: Joonwoo Park <joonwoop@codeaurora.org>


Thanks,
Joonwoo

> 

> But I can't spot any difference. The test case is running a sysbench

> thread affine to cpu1 in tg_root/tg_1/tg_11/tg_111 on tip/sched/core on

> an ARM64 Juno (6 logical cpus).

> The moment the sysbench task is put into tg_111

> tg_111->se[1]->avg.load_avg gets updated to 0 any way because of the

> huge time difference between creating this tg and attaching a task to

> it. So the tg->se[2]->avg.load_avg signals for tg_111, tg_11 and tg_1

> look exactly the same w/o and w/ your patch.

> 

> But your patch helps in this (very synthetic) test case as well. W/o

> your patch I see remaining tg->load_avg for tg_1 and tg_11 after the

> test case has finished because the tg's were exclusively used on cpu1.

> 

> # cat /proc/sched_debug

> 

>  cfs_rq[1]:/tg_1

>    .tg_load_avg_contrib           : 0

>    .tg_load_avg                   : 5120 (5 (unused cpus) * 1024 * 1)

>  cfs_rq[1]:/tg_1/tg_11/tg_111

>    .tg_load_avg_contrib           : 0

>    .tg_load_avg                   : 0

>  cfs_rq[1]:/tg_1/tg_11

>    .tg_load_avg_contrib           : 0

>    .tg_load_avg                   : 5120

> 

> With your patch applied all the .tg_load_avg are 0.

Vincent Guittot Oct. 19, 2016, 5:50 p.m. UTC | #11

On 19 October 2016 at 17:33, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 19/10/16 12:25, Vincent Guittot wrote:

>> On 19 October 2016 at 11:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>>> On 18/10/16 12:56, Vincent Guittot wrote:

>>>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

>>>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

>>>>>> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

>

> [...]

>

>>> But this test only makes sure that we don't see any ghost contribution

>>> (from non-existing cpus) any more.

>>>

>>> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's

>>> (with the highest tg having a task enqueued) a little bit more, with and

>>> without your v5 'sched: reflect sched_entity move into task_group's load'.

>>

>> Can you elaborate ?

>

> I try :-)

>

> I thought I will see some different behaviour because of the fact that

> the tg se's are initialized differently [1024 versus 0].


This difference should be noticeable (if noticeable) only during few
hundreds of ms after the creation of the task group until the load_avg
has reached its real value.

>

> But I can't spot any difference. The test case is running a sysbench

> thread affine to cpu1 in tg_root/tg_1/tg_11/tg_111 on tip/sched/core on

> an ARM64 Juno (6 logical cpus).

> The moment the sysbench task is put into tg_111

> tg_111->se[1]->avg.load_avg gets updated to 0 any way because of the

> huge time difference between creating this tg and attaching a task to

> it. So the tg->se[2]->avg.load_avg signals for tg_111, tg_11 and tg_1

> look exactly the same w/o and w/ your patch.

>

> But your patch helps in this (very synthetic) test case as well. W/o

> your patch I see remaining tg->load_avg for tg_1 and tg_11 after the

> test case has finished because the tg's were exclusively used on cpu1.

>

> # cat /proc/sched_debug

>

>  cfs_rq[1]:/tg_1

>    .tg_load_avg_contrib           : 0

>    .tg_load_avg                   : 5120 (5 (unused cpus) * 1024 * 1)

>  cfs_rq[1]:/tg_1/tg_11/tg_111

>    .tg_load_avg_contrib           : 0

>    .tg_load_avg                   : 0

>  cfs_rq[1]:/tg_1/tg_11

>    .tg_load_avg_contrib           : 0

>    .tg_load_avg                   : 5120

>

> With your patch applied all the .tg_load_avg are 0.

[v4.8-rc1,Regression] sched/fair: Apply more PELT fixes

Commit Message

Comments

Patch