diff mbox

[v4.8-rc1,Regression] sched/fair: Apply more PELT fixes

Message ID 4e15ad55-beeb-e860-0420-8f439d076758@arm.com
State New
Headers show

Commit Message

Dietmar Eggemann Oct. 17, 2016, 11:49 a.m. UTC
Hi Vincent,

On 17/10/16 10:09, Vincent Guittot wrote:
> Le Friday 14 Oct 2016 à 12:04:02 (-0400), Joseph Salisbury a écrit :

>> On 10/14/2016 11:18 AM, Vincent Guittot wrote:

>>> Le Friday 14 Oct 2016 à 14:10:07 (+0100), Dietmar Eggemann a écrit :

>>>> On 14/10/16 09:24, Vincent Guittot wrote:


[...]

> Could you try the patch below on top of the faulty kernel ?

> 

> ---

>  kernel/sched/fair.c | 5 +++--

>  1 file changed, 3 insertions(+), 2 deletions(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 8b03fb5..8926685 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

>   */

>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)

>  {

> -	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;

> +	unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg);

> +	long delta = load_avg - cfs_rq->tg_load_avg_contrib;

>  

>  	/*

>  	 * No need to update load_avg for root_task_group as it is not used.

> @@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)

>  

>  	if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {

>  		atomic_long_add(delta, &cfs_rq->tg->load_avg);

> -		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;

> +		cfs_rq->tg_load_avg_contrib = load_avg;

>  	}

>  }


Tested it on an Ubuntu 16.10 Server (on top of the default 4.8.0-22-generic
kernel) on a Lenovo T430 and it didn't help.

What seems to cure it is to get rid of this snippet (part of the commit
mentioned earlier in this thread: 3d30544f0212):


BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system
initially because systemd will create all ~100 services (and therefore the
corresponding 2. level tg's) at once. In my previous example, there was 500ms
between the creation of 2 tg's so there was a lot of decaying going on in between.

Comments

Vincent Guittot Oct. 17, 2016, 1:54 p.m. UTC | #1
On 17 October 2016 at 15:19, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:

>

>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> > index 8b03fb5..8926685 100644

>> > --- a/kernel/sched/fair.c

>> > +++ b/kernel/sched/fair.c

>> > @@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

>> >   */

>> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)

>> >  {

>> > -   long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;

>> > +   unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg);

>> > +   long delta = load_avg - cfs_rq->tg_load_avg_contrib;

>> >

>> >     /*

>> >      * No need to update load_avg for root_task_group as it is not used.

>> > @@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)

>> >

>> >     if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {

>> >             atomic_long_add(delta, &cfs_rq->tg->load_avg);

>> > -           cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;

>> > +           cfs_rq->tg_load_avg_contrib = load_avg;

>> >     }

>> >  }

>>

>> Tested it on an Ubuntu 16.10 Server (on top of the default 4.8.0-22-generic

>> kernel) on a Lenovo T430 and it didn't help.

>

> Right, I don't think that race exists, we update cfs_rq->avg.load_avg

> with rq->lock held and have it held here, so it cannot change under us.


yes I agree. It was just to be sure that an unexpected race condition
doesn't happen

>

> This might do with a few lockdep_assert_held() instances to clarify this

> though.

>

>> What seems to cure it is to get rid of this snippet (part of the commit

>> mentioned earlier in this thread: 3d30544f0212):

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index 039de34f1521..16c692049fbf 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -726,7 +726,6 @@ void post_init_entity_util_avg(struct sched_entity *se)

>>         struct sched_avg *sa = &se->avg;

>>         long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;

>>         u64 now = cfs_rq_clock_task(cfs_rq);

>> -       int tg_update;

>>

>>         if (cap > 0) {

>>                 if (cfs_rq->avg.util_avg != 0) {

>> @@ -759,10 +758,8 @@ void post_init_entity_util_avg(struct sched_entity *se)

>>                 }

>>         }

>>

>> -       tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);

>> +       update_cfs_rq_load_avg(now, cfs_rq, false);

>>         attach_entity_load_avg(cfs_rq, se);

>> -       if (tg_update)

>> -               update_tg_load_avg(cfs_rq, false);

>>  }

>>

>>  #else /* !CONFIG_SMP */

>>

>> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system

>> initially because systemd will create all ~100 services (and therefore the

>> corresponding 2. level tg's) at once. In my previous example, there was 500ms

>> between the creation of 2 tg's so there was a lot of decaying going on in between.

>

> Cute... on current kernels that translates to simply removing the call

> to update_tg_load_avg(), lets see if we can figure out what goes

> sideways first though, because it _should_ decay back out. And if that


yes, Reaching ~300000-400000 is not an issue in itself, the problem is
that load_avg has decayed but it has not been reflected in
tg->load_avg in the buggy case

> can fail here, I'm not seeing why that wouldn't fail elsewhere either.

>

> I'll see if I can reproduce this with a script creating heaps of cgroups

> in a hurry, I have a total lack of system-disease on all my machines.

>

>

> /me goes prod..
Dietmar Eggemann Oct. 17, 2016, 10:52 p.m. UTC | #2
On 10/17/2016 02:54 PM, Vincent Guittot wrote:
> On 17 October 2016 at 15:19, Peter Zijlstra <peterz@infradead.org> wrote:

>> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:


[...]

>>> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system

>>> initially because systemd will create all ~100 services (and therefore the

>>> corresponding 2. level tg's) at once. In my previous example, there was 500ms

>>> between the creation of 2 tg's so there was a lot of decaying going on in between.

>>

>> Cute... on current kernels that translates to simply removing the call

>> to update_tg_load_avg(), lets see if we can figure out what goes

>> sideways first though, because it _should_ decay back out. And if that

> 

> yes, Reaching ~300000-400000 is not an issue in itself, the problem is

> that load_avg has decayed but it has not been reflected in

> tg->load_avg in the buggy case

> 

>> can fail here, I'm not seeing why that wouldn't fail elsewhere either.

>>

>> I'll see if I can reproduce this with a script creating heaps of cgroups

>> in a hurry, I have a total lack of system-disease on all my machines.



Something looks weird related to the use of for_each_possible_cpu(i) in
online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).

In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)
I get:

[ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
80

/proc/sched_debug:

cfs_rq[0]:/system.slice
  ...
  .tg_load_avg                   : 323584
  ...

80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
this could be this extra load on the systen.slice tg)

Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
of system.slice tg is 0 after startup.
Vincent Guittot Oct. 18, 2016, 8:43 a.m. UTC | #3
On 18 October 2016 at 00:52, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 10/17/2016 02:54 PM, Vincent Guittot wrote:

>> On 17 October 2016 at 15:19, Peter Zijlstra <peterz@infradead.org> wrote:

>>> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:

>

> [...]

>

>>>> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system

>>>> initially because systemd will create all ~100 services (and therefore the

>>>> corresponding 2. level tg's) at once. In my previous example, there was 500ms

>>>> between the creation of 2 tg's so there was a lot of decaying going on in between.

>>>

>>> Cute... on current kernels that translates to simply removing the call

>>> to update_tg_load_avg(), lets see if we can figure out what goes

>>> sideways first though, because it _should_ decay back out. And if that

>>

>> yes, Reaching ~300000-400000 is not an issue in itself, the problem is

>> that load_avg has decayed but it has not been reflected in

>> tg->load_avg in the buggy case

>>

>>> can fail here, I'm not seeing why that wouldn't fail elsewhere either.

>>>

>>> I'll see if I can reproduce this with a script creating heaps of cgroups

>>> in a hurry, I have a total lack of system-disease on all my machines.

>


Hi Dietmar,

>

> Something looks weird related to the use of for_each_possible_cpu(i) in

> online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).

>

> In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)

> I get:

>

> [ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

>


Thanks to your description above, i have been able to reproduce the
issue on my ARM platform.
The key point is to have cpu_possible_mask different from
cpu_present_mask in order to reproduce the problem. When
cpu_present_mask equals cpu_possible_mask, i can't reproduce the
problem
I create a 1st level of task group tg-l1. Then each time, I create a
new task group in tg-l1, tg-l1.tg_load_avg will increase with 1024*
number of cpu that are possible but not present like you described
below

Thanks
Vincent

> T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l

> 80

>

> /proc/sched_debug:

>

> cfs_rq[0]:/system.slice

>   ...

>   .tg_load_avg                   : 323584

>   ...

>

> 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,

> this could be this extra load on the systen.slice tg)

>

> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in

> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg

> of system.slice tg is 0 after startup.
Peter Zijlstra Oct. 18, 2016, 9:07 a.m. UTC | #4
On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:
> 

> Something looks weird related to the use of for_each_possible_cpu(i) in

> online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).

> 

> In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)

> I get:

> 

> [ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3


OK, you have a buggy BIOS :-) It enumerates too many CPU slots. There is
no reason to have 4 empty CPU slots on a machine that cannot do physical
hotplug.

This also explains why it doesn't show on many machines, most machines
will not have this and possible_mask == present_mask == online_mask for
most all cases.

x86 folk, can we detect the lack of physical hotplug capability and FW_WARN
on this and lowering possible_mask to present_mask?

> [ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

> 

> T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l

> 80

> 

> /proc/sched_debug:

> 

> cfs_rq[0]:/system.slice

>   ...

>   .tg_load_avg                   : 323584

>   ...

> 

> 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,

> this could be this extra load on the systen.slice tg)

> 

> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in

> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg

> of system.slice tg is 0 after startup.


Right, so the reason for using present_mask is that it avoids having to
deal with hotplug, also all the per-cpu memory is allocated and present
for !online CPUs anyway, so might as well set it up properly anyway.

(You might want to start booting your laptop with "possible_cpus=4" to
save some memory FWIW.)

But yes, we have a bug here too... /me ponders

So aside from funny BIOSes, this should also show up when creating
cgroups when you have offlined a few CPUs, which is far more common I'd
think.

On IRC you mentioned that adding list_add_leaf_cfs_rq() to
online_fair_sched_group() cures this, this would actually match with
unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid
a few instructions on the enqueue path, so that's all good.

I'm just not immediately seeing how that cures things. The only relevant
user of the leaf_cfs_rq list seems to be update_blocked_averages() which
is called from the balance code (idle_balance() and
rebalance_domains()). But neither should call that for offline (or
!present) CPUs.


Humm..
Vincent Guittot Oct. 18, 2016, 9:45 a.m. UTC | #5
On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:

>>

>> Something looks weird related to the use of for_each_possible_cpu(i) in

>> online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).

>>

>> In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)

>> I get:

>>

>> [ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

>

> OK, you have a buggy BIOS :-) It enumerates too many CPU slots. There is

> no reason to have 4 empty CPU slots on a machine that cannot do physical

> hotplug.

>

> This also explains why it doesn't show on many machines, most machines

> will not have this and possible_mask == present_mask == online_mask for

> most all cases.

>

> x86 folk, can we detect the lack of physical hotplug capability and FW_WARN

> on this and lowering possible_mask to present_mask?

>

>> [ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

>> [ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

>> [ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

>> [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

>> [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

>> [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

>> [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

>>

>> T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l

>> 80

>>

>> /proc/sched_debug:

>>

>> cfs_rq[0]:/system.slice

>>   ...

>>   .tg_load_avg                   : 323584

>>   ...

>>

>> 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,

>> this could be this extra load on the systen.slice tg)

>>

>> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in

>> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg

>> of system.slice tg is 0 after startup.

>

> Right, so the reason for using present_mask is that it avoids having to

> deal with hotplug, also all the per-cpu memory is allocated and present

> for !online CPUs anyway, so might as well set it up properly anyway.

>

> (You might want to start booting your laptop with "possible_cpus=4" to

> save some memory FWIW.)

>

> But yes, we have a bug here too... /me ponders

>

> So aside from funny BIOSes, this should also show up when creating

> cgroups when you have offlined a few CPUs, which is far more common I'd

> think.


The problem is also that the load of the tg->se[cpu] that represents
the tg->cfs_rq[cpu] is initialized to 1024 in:
alloc_fair_sched_group
     for_each_possible_cpu(i) {
         init_entity_runnable_average(se);
            sa->load_avg = scale_load_down(se->load.weight);

Initializing  sa->load_avg to 1024 for a newly created task makes
sense as we don't know yet what will be its real load but i'm not sure
that we have to do the same for se that represents a task group. This
load should be initialized to 0 and it will increase when task will be
moved/attached into task group

>

> On IRC you mentioned that adding list_add_leaf_cfs_rq() to

> online_fair_sched_group() cures this, this would actually match with

> unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid

> a few instructions on the enqueue path, so that's all good.

>

> I'm just not immediately seeing how that cures things. The only relevant

> user of the leaf_cfs_rq list seems to be update_blocked_averages() which

> is called from the balance code (idle_balance() and

> rebalance_domains()). But neither should call that for offline (or

> !present) CPUs.

>

>

> Humm..
Peter Zijlstra Oct. 18, 2016, 10:34 a.m. UTC | #6
On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:

> > So aside from funny BIOSes, this should also show up when creating

> > cgroups when you have offlined a few CPUs, which is far more common I'd

> > think.

> 

> The problem is also that the load of the tg->se[cpu] that represents

> the tg->cfs_rq[cpu] is initialized to 1024 in:

> alloc_fair_sched_group

>      for_each_possible_cpu(i) {

>          init_entity_runnable_average(se);

>             sa->load_avg = scale_load_down(se->load.weight);

> 

> Initializing  sa->load_avg to 1024 for a newly created task makes

> sense as we don't know yet what will be its real load but i'm not sure

> that we have to do the same for se that represents a task group. This

> load should be initialized to 0 and it will increase when task will be

> moved/attached into task group


Yes, I think that makes sense, not sure how horrible that is with the
current state of things, but after your propagate patch, that
reinstates the interactivity hack that should work for sure.
Dietmar Eggemann Oct. 18, 2016, 11:15 a.m. UTC | #7
On 18/10/16 10:07, Peter Zijlstra wrote:
> On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:


[...]

>> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in

>> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg

>> of system.slice tg is 0 after startup.

> 

> Right, so the reason for using present_mask is that it avoids having to

> deal with hotplug, also all the per-cpu memory is allocated and present

> for !online CPUs anyway, so might as well set it up properly anyway.

> 

> (You might want to start booting your laptop with "possible_cpus=4" to

> save some memory FWIW.)


The question for me is could this be the reason for the X1 Carbon
platform as well?

The initial pastebin from Joseph (http://paste.ubuntu.com/23312351)
showed .tg_load_avg : 381697 on a 4 logical cpu thing. With a couple of
more services than 80 this might be the problem.

> 

> But yes, we have a bug here too... /me ponders

> 

> So aside from funny BIOSes, this should also show up when creating

> cgroups when you have offlined a few CPUs, which is far more common I'd

> think.


Yes.

> On IRC you mentioned that adding list_add_leaf_cfs_rq() to

> online_fair_sched_group() cures this, this would actually match with

> unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid

> a few instructions on the enqueue path, so that's all good.


Yes, I was able to recreate a similar problem (not related to the cpu
masks) on ARM64 (6 logical cpus). I created 100 2. level tg's but only
put one task (no cpu affinity, so it could run on multiple cpus) in one
of these tg's (mainly to see the related cfs_rq's in /proc/sched_debug).

I get a remaining .tg_load_avg : 49898 for cfs_rq[x]:/tg_1

 > I'm just not immediately seeing how that cures things. The only relevant

> user of the leaf_cfs_rq list seems to be update_blocked_averages() which

> is called from the balance code (idle_balance() and

> rebalance_domains()). But neither should call that for offline (or

> !present) CPUs.


Assuming this is load from the 99 2. level tg's which never had a task
running, putting list_add_leaf_cfs_rq() into online_fair_sched_group()
for all cpus makes sure that all the 'blocked load' get's decayed.

Doing what Vincent just suggested, not initializing tg se's w/ 1024 but
w/ 0 instead prevents this from being necessary.

[...]
Peter Zijlstra Oct. 18, 2016, 12:07 p.m. UTC | #8
On Tue, Oct 18, 2016 at 12:15:11PM +0100, Dietmar Eggemann wrote:
> On 18/10/16 10:07, Peter Zijlstra wrote:

> > On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:


> > On IRC you mentioned that adding list_add_leaf_cfs_rq() to

> > online_fair_sched_group() cures this, this would actually match with

> > unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid

> > a few instructions on the enqueue path, so that's all good.

> 

> Yes, I was able to recreate a similar problem (not related to the cpu

> masks) on ARM64 (6 logical cpus). I created 100 2. level tg's but only

> put one task (no cpu affinity, so it could run on multiple cpus) in one

> of these tg's (mainly to see the related cfs_rq's in /proc/sched_debug).

> 

> I get a remaining .tg_load_avg : 49898 for cfs_rq[x]:/tg_1


Ah, and since all those CPUs are online, we decay all that load away. OK
makes sense now.

>  > I'm just not immediately seeing how that cures things. The only relevant

> > user of the leaf_cfs_rq list seems to be update_blocked_averages() which

> > is called from the balance code (idle_balance() and

> > rebalance_domains()). But neither should call that for offline (or

> > !present) CPUs.

> 

> Assuming this is load from the 99 2. level tg's which never had a task

> running, putting list_add_leaf_cfs_rq() into online_fair_sched_group()

> for all cpus makes sure that all the 'blocked load' get's decayed.

> 

> Doing what Vincent just suggested, not initializing tg se's w/ 1024 but

> w/ 0 instead prevents this from being necessary.


Indeed. I just worry about the cases where we do no propagate the load
up, eg. the stuff fixed by:

  1476695653-12309-5-git-send-email-vincent.guittot@linaro.org

If we hit an intermediary cgroup with 0 load, we might get some
interactivity issues.

But it could be I got lost again :-)
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 039de34f1521..16c692049fbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -726,7 +726,6 @@  void post_init_entity_util_avg(struct sched_entity *se)
        struct sched_avg *sa = &se->avg;
        long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
        u64 now = cfs_rq_clock_task(cfs_rq);
-       int tg_update;
 
        if (cap > 0) {
                if (cfs_rq->avg.util_avg != 0) {
@@ -759,10 +758,8 @@  void post_init_entity_util_avg(struct sched_entity *se)
                }
        }
 
-       tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);
+       update_cfs_rq_load_avg(now, cfs_rq, false);
        attach_entity_load_avg(cfs_rq, se);
-       if (tg_update)
-               update_tg_load_avg(cfs_rq, false);
 }
 
 #else /* !CONFIG_SMP */