sched/pelt: fix false running accounting in PELT

Message ID 1498831118-22672-1-git-send-email-vincent.guittot@linaro.org
State New
Headers show

Commit Message

Vincent Guittot June 30, 2017, 1:58 p.m.
The running state is a subset of runnable state which means that running
can't be set if runnable (weight) is cleared. There are corner cases
where the current sched_entity has been already dequeued but cfs_rq->curr
has not been updated yet and still points to the dequeued sched_entity.
If ___update_load_avg is called at that time, weight will be 0 and running
will be set which is not possible.

This case happens during pick_next_task_fair() when a cfs_rq becomes idles.
The current sched_entity has been dequeued so se->on_rq is cleared and
cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be
cleared when picking the idle thread). Because the cfs_rq becomes idle,
idle_balance() is called and ends up to call update_blocked_averages()
with these wrong running and runnable states.

Add a test in ___update_load_avg to correct the running state in this case.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

-- 
2.7.4

Comments

Vincent Guittot July 1, 2017, 5:09 a.m. | #1
On 30 June 2017 at 15:58, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> The running state is a subset of runnable state which means that running

> can't be set if runnable (weight) is cleared. There are corner cases

> where the current sched_entity has been already dequeued but cfs_rq->curr

> has not been updated yet and still points to the dequeued sched_entity.

> If ___update_load_avg is called at that time, weight will be 0 and running

> will be set which is not possible.

>

> This case happens during pick_next_task_fair() when a cfs_rq becomes idles.

> The current sched_entity has been dequeued so se->on_rq is cleared and

> cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be

> cleared when picking the idle thread). Because the cfs_rq becomes idle,

> idle_balance() is called and ends up to call update_blocked_averages()

> with these wrong running and runnable states.

>

> Add a test in ___update_load_avg to correct the running state in this case.*

>

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>


The v1 is just wrong. I have sent a v2 with correct patch

sorry for the disturb
Vincent

> ---

>  kernel/sched/fair.c | 18 ++++++++++++++++++

>  1 file changed, 18 insertions(+)

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 008c514..5fdcb42 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -2968,6 +2968,24 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,

>         sa->last_update_time += delta << 10;

>

>         /*

> +        * running is a subset of runnable (weight) so running can't be set if

> +        * runnable is clear. But there are some corner cases where the current

> +        * se has been already dequeued but cfs_rq->curr still points to it.

> +        * This means that weight will be 0 but not running for a sched_entity

> +        * but also for a cfs_rq if the latter becomes idle. As an example,

> +        * this happens during idle_balance() which calls

> +        * update_blocked_averages()

> +        */

> +       if (weight)

> +               running = 1;

> +

> +       /*

> +        * Scale time to reflect the amount a computation effectively done

> +        * during the time slot at current capacity

> +        */

> +       delta = scale_time(delta, cpu, sa, weight, running);

> +

> +       /*

>          * Now we know we crossed measurement unit boundaries. The *_avg

>          * accrues by two steps:

>          *

> --

> 2.7.4

>
kbuild test robot July 1, 2017, 4:57 p.m. | #2
Hi Vincent,

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.12-rc7 next-20170630]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Vincent-Guittot/sched-pelt-fix-false-running-accounting-in-PELT/20170702-002907
config: i386-randconfig-x078-07012120 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   kernel/sched/fair.c: In function '___update_load_avg':
>> kernel/sched/fair.c:2986:10: error: implicit declaration of function 'scale_time' [-Werror=implicit-function-declaration]

     delta = scale_time(delta, cpu, sa, weight, running);
             ^~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/scale_time +2986 kernel/sched/fair.c

  2980			running = 1;
  2981	
  2982		/*
  2983		 * Scale time to reflect the amount a computation effectively done
  2984		 * during the time slot at current capacity
  2985		 */
> 2986		delta = scale_time(delta, cpu, sa, weight, running);

  2987	
  2988		/*
  2989		 * Now we know we crossed measurement unit boundaries. The *_avg

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 008c514..5fdcb42 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2968,6 +2968,24 @@  ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	sa->last_update_time += delta << 10;
 
 	/*
+	 * running is a subset of runnable (weight) so running can't be set if
+	 * runnable is clear. But there are some corner cases where the current
+	 * se has been already dequeued but cfs_rq->curr still points to it.
+	 * This means that weight will be 0 but not running for a sched_entity
+	 * but also for a cfs_rq if the latter becomes idle. As an example,
+	 * this happens during idle_balance() which calls
+	 * update_blocked_averages()
+	 */
+	if (weight)
+		running = 1;
+
+	/*
+	 * Scale time to reflect the amount a computation effectively done
+	 * during the time slot at current capacity
+	 */
+	delta = scale_time(delta, cpu, sa, weight, running);
+
+	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
 	 * accrues by two steps:
 	 *