diff mbox

[2/7,v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list

Message ID b8d0fe09-82c4-b263-ce09-6c49ac9fae43@arm.com
State New
Headers show

Commit Message

Dietmar Eggemann Sept. 21, 2016, 5:25 p.m. UTC
On 21/09/16 13:34, Vincent Guittot wrote:
> Hi Dietmar,

> 

> On 21 September 2016 at 12:14, Dietmar Eggemann

> <dietmar.eggemann@arm.com> wrote:

>> Hi Vincent,

>>

>> On 12/09/16 08:47, Vincent Guittot wrote:


[...]

>> I guess in the meantime we lost the functionality to remove a cfs_rq from the

>> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates

>> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)

>> owned by Cpu1 stay in the leaf_cfs_rq list.

>>

>> Shouldn't we reintegrate it? Following patch goes on top of this patch:

> 

> I see one problem: Once a cfs_rq has been removed , it will not be

> periodically updated anymore until the next enqueue. A sleeping task

> is only attached but not enqueued when it moves into a cfs_rq so its

> load is added to cfs_rq's load which have to be periodically

> updated/decayed. So adding a cfs_rq to the list only during an enqueue

> is not enough in this case.


There was more in the original patch (commit 82958366cfea), the removal of the
cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had
decayed to 0. Couldn't we use something similar with se->avg.load_avg instead
to make sure that these blocked contributions have been decayed before we do the
removal?



> Then, only the highest child level of task group will be removed

> because cfs_rq->nr_running will be > 0 as soon as a child task group

> is created and enqueued into a task group. Or you should use

> cfs.h_nr_running instead i don't know all implications


In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? Do I miss something?

    migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 prio=120 orig_cpu=1 dest_cpu=2
    migration/1-16 [001] 67.235295: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff80097550d700 tg_id=6 on_list=0
    migration/1-16 [001] 67.235298: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff800975903700 tg_id=4 on_list=0
    migration/1-16 [001] 67.235300: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff800974e0fe00 tg_id=2 on_list=0
    migration/1-16 [001] 67.235302: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff80097fecb6e0 tg_id=1 on_list=1
    migration/1-16 [001] 67.235309: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff80097550d800 tg_id=6 on_list=1
 -> migration/1-16 [001] 67.235312: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff80097550d800 tg_id=6 on_list=1
    migration/1-16 [001] 67.235314: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff800975903600 tg_id=4 on_list=1
 -> migration/1-16 [001] 67.235315: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff800975903600 tg_id=4 on_list=1
    migration/1-16 [001] 67.235316: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff800974e0f600 tg_id=2 on_list=1
 -> migration/1-16 [001] 67.235318: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff800974e0f600 tg_id=2 on_list=1
    migration/1-16 [001] 67.235319: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff80097feb56e0 tg_id=1 on_list=1

If we don't remove these cfs_rq's, IMHO they stay in the list as long as the tg exists.

[...]

Comments

Vincent Guittot Sept. 21, 2016, 6:02 p.m. UTC | #1
On 21 September 2016 at 19:25, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
> On 21/09/16 13:34, Vincent Guittot wrote:

>> Hi Dietmar,

>>

>> On 21 September 2016 at 12:14, Dietmar Eggemann

>> <dietmar.eggemann@arm.com> wrote:

>>> Hi Vincent,

>>>

>>> On 12/09/16 08:47, Vincent Guittot wrote:

>

> [...]

>

>>> I guess in the meantime we lost the functionality to remove a cfs_rq from the

>>> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates

>>> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)

>>> owned by Cpu1 stay in the leaf_cfs_rq list.

>>>

>>> Shouldn't we reintegrate it? Following patch goes on top of this patch:

>>

>> I see one problem: Once a cfs_rq has been removed , it will not be

>> periodically updated anymore until the next enqueue. A sleeping task

>> is only attached but not enqueued when it moves into a cfs_rq so its

>> load is added to cfs_rq's load which have to be periodically

>> updated/decayed. So adding a cfs_rq to the list only during an enqueue

>> is not enough in this case.

>

> There was more in the original patch (commit 82958366cfea), the removal of the

> cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had

> decayed to 0. Couldn't we use something similar with se->avg.load_avg instead

> to make sure that these blocked contributions have been decayed before we do the

> removal?


Yes we can use se->avg.load_avg but that's not enough. Once,
se->avg.load_avg becomes null and group cfs_rq->nr_running == 0, we
remove the cfs_rq  from the list and this cfs_rq will not be updated
until a new sched_entity is enqueued. But when you move a sleeping
task between 2 task groups, the load of the task is attached to the
destination cfs_rq but not enqueue so the cfs_rq->avg.load_avg is no
more null but it will not be updated during update_blocked_averages
because the cfs_rq is no more in the list. So waiting for the enqueue
of the sched_entity to add the cfs_rq back in the list is no more
enough. You will have to do that during attach too

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 4ac1718560e2..3595b0623b37 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu)

>          * list_add_leaf_cfs_rq() for details.

>          */

>         for_each_leaf_cfs_rq(rq, cfs_rq) {

> +               struct sched_entity *se = cfs_rq->tg->se[cpu];

> +

>                 /* throttled entities do not contribute to load */

>                 if (throttled_hierarchy(cfs_rq))

>                         continue;

> @@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu)

>                         update_tg_load_avg(cfs_rq, 0);

>

>                 /* Propagate pending load changes to the parent */

> -               if (cfs_rq->tg->se[cpu])

> -                       update_load_avg(cfs_rq->tg->se[cpu], 0, 0);

> +               if (se) {

> +                       update_load_avg(se, 0, 0);

> +

> +                       if (!se->avg.load_avg && !cfs_rq->nr_running)

> +                               list_del_leaf_cfs_rq(cfs_rq);

> +               }

>         }

>         raw_spin_unlock_irqrestore(&rq->lock, flags);

>  }

>

>

>> Then, only the highest child level of task group will be removed

>> because cfs_rq->nr_running will be > 0 as soon as a child task group

>> is created and enqueued into a task group. Or you should use

>> cfs.h_nr_running instead i don't know all implications

>

> In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? Do I miss something?


no you don't miss anything. I have replied a bit too quickly on that
point;  the sched_entity of a tsk_group is dequeued when its group
cfs_rq becomes idle  so the parent cfs_rq->nr_running can be null

>

>     migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 prio=120 orig_cpu=1 dest_cpu=2

>     migration/1-16 [001] 67.235295: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff80097550d700 tg_id=6 on_list=0

>     migration/1-16 [001] 67.235298: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff800975903700 tg_id=4 on_list=0

>     migration/1-16 [001] 67.235300: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff800974e0fe00 tg_id=2 on_list=0

>     migration/1-16 [001] 67.235302: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff80097fecb6e0 tg_id=1 on_list=1

>     migration/1-16 [001] 67.235309: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff80097550d800 tg_id=6 on_list=1

>  -> migration/1-16 [001] 67.235312: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff80097550d800 tg_id=6 on_list=1

>     migration/1-16 [001] 67.235314: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff800975903600 tg_id=4 on_list=1

>  -> migration/1-16 [001] 67.235315: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff800975903600 tg_id=4 on_list=1

>     migration/1-16 [001] 67.235316: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff800974e0f600 tg_id=2 on_list=1

>  -> migration/1-16 [001] 67.235318: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff800974e0f600 tg_id=2 on_list=1

>     migration/1-16 [001] 67.235319: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff80097feb56e0 tg_id=1 on_list=1

>

> If we don't remove these cfs_rq's, IMHO they stay in the list as long as the tg exists.


Yes i agree. The main drawback is probably that
update_blocked_averages  continue to loop on these cfs_rq during
periodic load balance. We have to compare the cost of adding a branch
of cfs_rqs into the list during the enqueue/attach of a sched_entity
with the cost of the useless loops on these cfs_rq without load during
periodic load balance.

>

> [...]
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ac1718560e2..3595b0623b37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6566,6 +6566,8 @@  static void update_blocked_averages(int cpu)
         * list_add_leaf_cfs_rq() for details.
         */
        for_each_leaf_cfs_rq(rq, cfs_rq) {
+               struct sched_entity *se = cfs_rq->tg->se[cpu];
+
                /* throttled entities do not contribute to load */
                if (throttled_hierarchy(cfs_rq))
                        continue;
@@ -6574,8 +6576,12 @@  static void update_blocked_averages(int cpu)
                        update_tg_load_avg(cfs_rq, 0); 
 
                /* Propagate pending load changes to the parent */
-               if (cfs_rq->tg->se[cpu])
-                       update_load_avg(cfs_rq->tg->se[cpu], 0, 0);
+               if (se) {
+                       update_load_avg(se, 0, 0);
+
+                       if (!se->avg.load_avg && !cfs_rq->nr_running)
+                               list_del_leaf_cfs_rq(cfs_rq);
+               }
        }
        raw_spin_unlock_irqrestore(&rq->lock, flags);
 }