[RFC] sched: fix hierarchical order in rq->leaf_cfs_rq_list

Message ID 1464083710-4370-1-git-send-email-vincent.guittot@linaro.org
State New
Headers show

Commit Message

Vincent Guittot May 24, 2016, 9:55 a.m.
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that
a child will always called before its parent.

The hierarchical order in shares update list has been introduced by
commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update list")

With the current implementation a child can be still put after its parent.

Lets take the example of
       root
         \
          b
          /\
          c d*
            |
            e*

with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list
looks like: head -> c -> b -> root -> tail

The branch d -> e will be added the first time that they are enqueued,
starting with e then d.

When e is added, its parents is not already on the list so e is put at the
tail : head -> c -> b -> root -> e -> tail

Then, d is added at the head because its parent is already on the list:
head -> d -> c -> b -> root -> e -> tail

e is not placed at the right position and will be called the last whereas
it should be called at the beginning.

Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a parent
that is already on the list. We can use this event to detect when we have
finished to add a new branch. For the others, whose parents are not already
added, we have to ensure that they will be added after their children that
have just been inserted the steps before, and after any potential parents that
are already in the list. The easiest way is to put the cfs_rq just after the
last inserted one and to keep track of it untl the branch is fully added.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---

Hi,

Same version but rebased on latest sched/core

Regards,
Vincent

 kernel/sched/core.c  |  1 +
 kernel/sched/fair.c  | 24 ++++++++++++++++++++----
 kernel/sched/sched.h |  1 +
 3 files changed, 22 insertions(+), 4 deletions(-)

-- 
1.9.1

Comments

Dietmar Eggemann May 25, 2016, 5:40 p.m. | #1
Hi Vincent,

On 24/05/16 10:55, Vincent Guittot wrote:
> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that

> a child will always called before its parent.

> 

> The hierarchical order in shares update list has been introduced by

> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update list")

> 

> With the current implementation a child can be still put after its parent.

> 

> Lets take the example of

>        root

>          \

>           b

>           /\

>           c d*

>             |

>             e*

> 

> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list

> looks like: head -> c -> b -> root -> tail

> 

> The branch d -> e will be added the first time that they are enqueued,

> starting with e then d.

> 

> When e is added, its parents is not already on the list so e is put at the

> tail : head -> c -> b -> root -> e -> tail

> 

> Then, d is added at the head because its parent is already on the list:

> head -> d -> c -> b -> root -> e -> tail

> 

> e is not placed at the right position and will be called the last whereas

> it should be called at the beginning.

> 

> Because it follows the bottom-up enqueue sequence, we are sure that we

> will finished to add either a cfs_rq without parent or a cfs_rq with a parent

> that is already on the list. We can use this event to detect when we have

> finished to add a new branch. For the others, whose parents are not already

> added, we have to ensure that they will be added after their children that

> have just been inserted the steps before, and after any potential parents that

> are already in the list. The easiest way is to put the cfs_rq just after the

> last inserted one and to keep track of it untl the branch is fully added.

> 

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>


So the use case for this would be you create a multi-level task group
hierarchy on a cpu (e.g. tg_css_id=2,4,6 on cpu=1) and let a task run in
the highest level task group (tg_css_id=6).

list_add_leaf_cfs_rq() call:

...
enqueue_task_fair: cpu=1 tg_css_id=6 cfs_rq=0xffffffc97500f700 on_list=0
enqueue_task_fair: cpu=1 tg_css_id=4 cfs_rq=0xffffffc975175900 on_list=0
enqueue_task_fair: cpu=1 tg_css_id=2 cfs_rq=0xffffffc975866d00 on_list=0
enqueue_task_fair: cpu=1 tg_css_id=1 cfs_rq=0xffffffc97fec44e8 on_list=1

...

In this case, the for_each_leaf_cfs_rq() in update_blocked_averages()
iterates in the tg_css_id=2,1,6,4 order:

...
update_blocked_averages: tg_css_id=2 cfs_rq=0xffffffc975866d00 on_list=1
update_blocked_averages: tg_css_id=1 cfs_rq=0xffffffc97fec44e8 on_list=1
update_blocked_averages: tg_css_id=6 cfs_rq=0xffffffc97500f700 on_list=1
update_blocked_averages: tg_css_id=4 cfs_rq=0xffffffc975175900 on_list=1
...

which I guess breaks the update_tg_cfs_util() call you introduced in
update_blocked_averages() in '[RFC PATCH v2] sched: reflect sched_entity
movement into task_group's utilization'. IMHO, otherwise
update_blocked_averages() can deal with the list not being ordered
tg_css_id=6,4,2,1.

https://lkml.org/lkml/2016/5/24/200

The use of for_each_leaf_cfs_rq() in update_shares() is gone. Do the
remaining call sites (update_runtime_enabled(),
unthrottle_offline_cfs_rqs() require any ordering?

[...]
Vincent Guittot May 26, 2016, 9:55 a.m. | #2
On 25 May 2016 at 19:40, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> Hi Vincent,

>

> On 24/05/16 10:55, Vincent Guittot wrote:

>> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that

>> a child will always called before its parent.

>>

>> The hierarchical order in shares update list has been introduced by

>> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update list")

>>

>> With the current implementation a child can be still put after its parent.

>>

>> Lets take the example of

>>        root

>>          \

>>           b

>>           /\

>>           c d*

>>             |

>>             e*

>>

>> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list

>> looks like: head -> c -> b -> root -> tail

>>

>> The branch d -> e will be added the first time that they are enqueued,

>> starting with e then d.

>>

>> When e is added, its parents is not already on the list so e is put at the

>> tail : head -> c -> b -> root -> e -> tail

>>

>> Then, d is added at the head because its parent is already on the list:

>> head -> d -> c -> b -> root -> e -> tail

>>

>> e is not placed at the right position and will be called the last whereas

>> it should be called at the beginning.

>>

>> Because it follows the bottom-up enqueue sequence, we are sure that we

>> will finished to add either a cfs_rq without parent or a cfs_rq with a parent

>> that is already on the list. We can use this event to detect when we have

>> finished to add a new branch. For the others, whose parents are not already

>> added, we have to ensure that they will be added after their children that

>> have just been inserted the steps before, and after any potential parents that

>> are already in the list. The easiest way is to put the cfs_rq just after the

>> last inserted one and to keep track of it untl the branch is fully added.

>>

>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>

> So the use case for this would be you create a multi-level task group

> hierarchy on a cpu (e.g. tg_css_id=2,4,6 on cpu=1) and let a task run in

> the highest level task group (tg_css_id=6).

>

> list_add_leaf_cfs_rq() call:

>

> ...

> enqueue_task_fair: cpu=1 tg_css_id=6 cfs_rq=0xffffffc97500f700 on_list=0

> enqueue_task_fair: cpu=1 tg_css_id=4 cfs_rq=0xffffffc975175900 on_list=0

> enqueue_task_fair: cpu=1 tg_css_id=2 cfs_rq=0xffffffc975866d00 on_list=0

> enqueue_task_fair: cpu=1 tg_css_id=1 cfs_rq=0xffffffc97fec44e8 on_list=1

>

> ...

>

> In this case, the for_each_leaf_cfs_rq() in update_blocked_averages()

> iterates in the tg_css_id=2,1,6,4 order:

>

> ...

> update_blocked_averages: tg_css_id=2 cfs_rq=0xffffffc975866d00 on_list=1

> update_blocked_averages: tg_css_id=1 cfs_rq=0xffffffc97fec44e8 on_list=1

> update_blocked_averages: tg_css_id=6 cfs_rq=0xffffffc97500f700 on_list=1

> update_blocked_averages: tg_css_id=4 cfs_rq=0xffffffc975175900 on_list=1

> ...

>

> which I guess breaks the update_tg_cfs_util() call you introduced in

> update_blocked_averages() in '[RFC PATCH v2] sched: reflect sched_entity

> movement into task_group's utilization'. IMHO, otherwise

> update_blocked_averages() can deal with the list not being ordered

> tg_css_id=6,4,2,1.

>

> https://lkml.org/lkml/2016/5/24/200

>

> The use of for_each_leaf_cfs_rq() in update_shares() is gone. Do the

> remaining call sites (update_runtime_enabled(),

> unthrottle_offline_cfs_rqs() require any ordering?


I can see one potential issue with unthrottle_offline_cfs_rqs() which
goes to the hierarchy when unthrottling.
Otherwise i don't see any other dependency with ordering the cfs_rq
than the patch i have sent for task group utilization
The other point is this patch align the code with the comment

>

> [...]
Vincent Guittot June 2, 2016, 7:42 a.m. | #3
On 1 June 2016 at 19:42,  <bsegall@google.com> wrote:
> Peter Zijlstra <peterz@infradead.org> writes:

>

>> On Tue, May 24, 2016 at 11:55:10AM +0200, Vincent Guittot wrote:

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>>> index 218f8e8..6d3fbf2 100644

>>> --- a/kernel/sched/fair.c

>>> +++ b/kernel/sched/fair.c

>>> @@ -290,15 +290,31 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)

>>>               * Ensure we either appear before our parent (if already

>>>               * enqueued) or force our parent to appear after us when it is

>>>               * enqueued.  The fact that we always enqueue bottom-up

>>> +             * reduces this to two cases and a specila case for the root

>>

>> 'special'

>>

>>> +             * cfs_rq.

>>>               */

>>>              if (cfs_rq->tg->parent &&

>>>                  cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {

>>> +                    /* Add the child just before its parent */

>>> +                    list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,

>>> +                            &(cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->leaf_cfs_rq_list));

>>> +                    rq_of(cfs_rq)->leaf_alone = &rq_of(cfs_rq)->leaf_cfs_rq_list;

>>> +            } else if (!cfs_rq->tg->parent) {

>>> +                    /*

>>> +                     * cfs_rq without parent should be

>>> +                     * at the end of the list

>>> +                     */

>>>                      list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,

>>>                              &rq_of(cfs_rq)->leaf_cfs_rq_list);

>>> +                    rq_of(cfs_rq)->leaf_alone = &rq_of(cfs_rq)->leaf_cfs_rq_list;

>>> +            } else {

>>> +                    /*

>>> +                     * Our parent has not already been added so make sure

>>> +                     * that it will be put after us

>>> +                     */

>>> +                    list_add_rcu(&cfs_rq->leaf_cfs_rq_list,

>>> +                            rq_of(cfs_rq)->leaf_alone);

>>> +                    rq_of(cfs_rq)->leaf_alone = &cfs_rq->leaf_cfs_rq_list;

>>>              }

>>>

>>>              cfs_rq->on_list = 1;

>>

>> Paul, Ben ?

>>

>> This used to be critical for update_shares() (now

>> update_blocked_averages), but IIRC is not critical for that since

>> PELT.

>

> Yeah, given that we no longer update_cfs_shares in that path, it

> shouldn't be as important (unless new code is being added that it will

> be useful for). That said, I honestly don't remember why we don't

> update_cfs_shares, as it could affect the load.weight being used in a

> parent's computation. Paul, do you remember? Was it just too expensive

> and less necessary given the other improvements?

>

>>

>> I find its more readable with like so..

>>

>>

>> Also; I feel the comments can use more love; my head hurts ;-)

>

> Yeah

>

>

> leaf_alone here is basically a temporary for the duration of an

> enqueue_task_fair call, yes? A name suggesting that might be useful, or


yes it is

> else a comment mentioning that one of the first two cases will always

> clear the otherwise unsafe reference before it can be a problem.


ok, i will add a comment

>

> I think this also only barely works with throttling: even if the tg as a

> whole is out of runtime, an individual cfs_rq can't be throttled until

> just one line after list_add_cfs_rq, and we never list_del until cgroup

> destruction. A throttled !on_list cfs_rq would cause us to never reset

> leaf_alone, but I don't think that can quite happen.


yes, i don't see how this can happen too

>

>>

>> ---

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -286,35 +286,38 @@ static inline struct cfs_rq *group_cfs_r

>>  static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)

>>  {

>>       if (!cfs_rq->on_list) {

>> +             struct rq *rq = rq_of(cfs_rq);

>> +             int cpu = cpu_of(rq);

>> +

>>               /*

>>                * Ensure we either appear before our parent (if already

>>                * enqueued) or force our parent to appear after us when it is

>>                * enqueued.  The fact that we always enqueue bottom-up

>> -              * reduces this to two cases and a specila case for the root

>> +              * reduces this to two cases and a special case for the root

>>                * cfs_rq.

>>                */

>>               if (cfs_rq->tg->parent &&

>> -                 cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {

>> +                 cfs_rq->tg->parent->cfs_rq[cpu]->on_list) {

>>                       /* Add the child just before its parent */

>>                       list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,

>> -                             &(cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->leaf_cfs_rq_list));

>> -                     rq_of(cfs_rq)->leaf_alone = &rq_of(cfs_rq)->leaf_cfs_rq_list;

>> +                             &(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));

>> +                     rq->leaf_alone = &rq->leaf_cfs_rq_list;

>>               } else if (!cfs_rq->tg->parent) {

>>                       /*

>>                        * cfs_rq without parent should be

>>                        * at the end of the list

>>                        */

>>                       list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,

>> -                             &rq_of(cfs_rq)->leaf_cfs_rq_list);

>> -                     rq_of(cfs_rq)->leaf_alone = &rq_of(cfs_rq)->leaf_cfs_rq_list;

>> +                                       &rq->leaf_cfs_rq_list);

>> +                     rq->leaf_alone = &rq->leaf_cfs_rq_list;

>>               } else {

>>                       /*

>>                        * Our parent has not already been added so make sure

>>                        * that it will be put after us

>>                        */

>>                       list_add_rcu(&cfs_rq->leaf_cfs_rq_list,

>> -                             rq_of(cfs_rq)->leaf_alone);

>> -                     rq_of(cfs_rq)->leaf_alone = &cfs_rq->leaf_cfs_rq_list;

>> +                                  rq->leaf_alone);

>> +                     rq->leaf_alone = &cfs_rq->leaf_cfs_rq_list;

>>               }

>>

>>               cfs_rq->on_list = 1;

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 404c078..7b0fbdc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7386,6 +7386,7 @@  void __init sched_init(void)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
+		rq->leaf_alone = &rq->leaf_cfs_rq_list;
 		/*
 		 * How much cpu bandwidth does root_task_group get?
 		 *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 218f8e8..6d3fbf2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -290,15 +290,31 @@  static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 		 * Ensure we either appear before our parent (if already
 		 * enqueued) or force our parent to appear after us when it is
 		 * enqueued.  The fact that we always enqueue bottom-up
-		 * reduces this to two cases.
+		 * reduces this to two cases and a specila case for the root
+		 * cfs_rq.
 		 */
 		if (cfs_rq->tg->parent &&
 		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
-			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		} else {
+			/* Add the child just before its parent */
+			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
+				&(cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->leaf_cfs_rq_list));
+			rq_of(cfs_rq)->leaf_alone = &rq_of(cfs_rq)->leaf_cfs_rq_list;
+		} else if (!cfs_rq->tg->parent) {
+			/*
+			 * cfs_rq without parent should be
+			 * at the end of the list
+			 */
 			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
 				&rq_of(cfs_rq)->leaf_cfs_rq_list);
+			rq_of(cfs_rq)->leaf_alone = &rq_of(cfs_rq)->leaf_cfs_rq_list;
+		} else {
+			/*
+			 * Our parent has not already been added so make sure
+			 * that it will be put after us
+			 */
+			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+				rq_of(cfs_rq)->leaf_alone);
+			rq_of(cfs_rq)->leaf_alone = &cfs_rq->leaf_cfs_rq_list;
 		}
 
 		cfs_rq->on_list = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e51145e..30750dc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -614,6 +614,7 @@  struct rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
 	struct list_head leaf_cfs_rq_list;
+	struct list_head *leaf_alone;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 	/*