[v5,07/10] sched/irq: add irq utilization tracking

Message ID 1527253951-22709-8-git-send-email-vincent.guittot@linaro.org
State New
Headers show
Series
  • track CPU utilization
Related show

Commit Message

Vincent Guittot May 25, 2018, 1:12 p.m.
interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.

That's also important to note that because
  rq_clock == rq_clock_task + interrupt time
and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.
The CPU utilization is :
  avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq

Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/core.c  |  4 +++-
 kernel/sched/fair.c  | 26 +++++++-------------------
 kernel/sched/pelt.c  | 38 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/pelt.h  |  7 +++++++
 kernel/sched/sched.h |  1 +
 5 files changed, 56 insertions(+), 20 deletions(-)

-- 
2.7.4

Comments

Dietmar Eggemann May 30, 2018, 3:55 p.m. | #1
On 05/25/2018 03:12 PM, Vincent Guittot wrote:
> interrupt and steal time are the only remaining activities tracked by

> rt_avg. Like for sched classes, we can use PELT to track their average

> utilization of the CPU. But unlike sched class, we don't track when

> entering/leaving interrupt; Instead, we take into account the time spent

> under interrupt context when we update rqs' clock (rq_clock_task).

> This also means that we have to decay the normal context time and account

> for interrupt time during the update.

> 

> That's also important to note that because

>    rq_clock == rq_clock_task + interrupt time

> and rq_clock_task is used by a sched class to compute its utilization, the

> util_avg of a sched class only reflects the utilization of the time spent

> in normal context and not of the whole time of the CPU. The utilization of

> interrupt gives an more accurate level of utilization of CPU.

> The CPU utilization is :

>    avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq

> 

> Most of the time, avg_irq is small and neglictible so the use of the

> approximation CPU utilization = /Sum avg_rq was enough


[...]

> @@ -7362,6 +7363,7 @@ static void update_blocked_averages(int cpu)

>   	}

>   	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);

>   	update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);

> +	update_irq_load_avg(rq, 0);


So this one decays the signals only in case the update_rq_clock_task() 
didn't call update_irq_load_avg() because 'irq_delta + steal' is 0, right?

[...]

> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c

> index 3d5bd3a..d2e4f21 100644

> --- a/kernel/sched/pelt.c

> +++ b/kernel/sched/pelt.c

> @@ -355,3 +355,41 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

>   

>   	return 0;

>   }

> +

> +/*

> + * irq:

> + *

> + *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked

> + *   util_sum = cpu_scale * load_sum

> + *   runnable_load_sum = load_sum

> + *

> + */

> +

> +int update_irq_load_avg(struct rq *rq, u64 running)

> +{

> +	int ret = 0;

> +	/*

> +	 * We know the time that has been used by interrupt since last update

> +	 * but we don't when. Let be pessimistic and assume that interrupt has

> +	 * happened just before the update. This is not so far from reality

> +	 * because interrupt will most probably wake up task and trig an update

> +	 * of rq clock during which the metric si updated.

> +	 * We start to decay with normal context time and then we add the

> +	 * interrupt context time.

> +	 * We can safely remove running from rq->clock because

> +	 * rq->clock += delta with delta >= running


This is true as long update_irq_load_avg() with a 'running != 0' is 
called only after rq->clock moved forward (rq->clock += delta) (which is 
true for update_rq_clock()->update_rq_clock_task()).

> +	 */

> +	ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,

> +				0,

> +				0,

> +				0);

> +	ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,

> +				1,

> +				1,

> +				1);


So you decay the signal in [sa->lut, rq->clock - running] (assumed to be 
the portion of delta used by the task scheduler) and you increase it in 
[rq->clock - running, rq->clock] (irq and virt portion of delta).

That means that this signal is updated on rq->clock whereas the others 
are on rq->clock_task.

What about the ever growing clock diff between them? I see e.g ~6s after 
20min uptime and up to 1.5ms 'running'.

It should be still safe to sum the sched class and irq signal in 
sugov_aggregate_util() because they are independent, I guess.

[...]
Vincent Guittot May 30, 2018, 6:45 p.m. | #2
Hi Dietmar,

On 30 May 2018 at 17:55, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 05/25/2018 03:12 PM, Vincent Guittot wrote:

>>

>> interrupt and steal time are the only remaining activities tracked by

>> rt_avg. Like for sched classes, we can use PELT to track their average

>> utilization of the CPU. But unlike sched class, we don't track when

>> entering/leaving interrupt; Instead, we take into account the time spent

>> under interrupt context when we update rqs' clock (rq_clock_task).

>> This also means that we have to decay the normal context time and account

>> for interrupt time during the update.

>>

>> That's also important to note that because

>>    rq_clock == rq_clock_task + interrupt time

>> and rq_clock_task is used by a sched class to compute its utilization, the

>> util_avg of a sched class only reflects the utilization of the time spent

>> in normal context and not of the whole time of the CPU. The utilization of

>> interrupt gives an more accurate level of utilization of CPU.

>> The CPU utilization is :

>>    avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq

>>

>> Most of the time, avg_irq is small and neglictible so the use of the

>> approximation CPU utilization = /Sum avg_rq was enough

>

>

> [...]

>

>> @@ -7362,6 +7363,7 @@ static void update_blocked_averages(int cpu)

>>         }

>>         update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);

>>         update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);

>> +       update_irq_load_avg(rq, 0);

>

>

> So this one decays the signals only in case the update_rq_clock_task()

> didn't call update_irq_load_avg() because 'irq_delta + steal' is 0, right?


yes

>

> [...]

>

>

>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c

>> index 3d5bd3a..d2e4f21 100644

>> --- a/kernel/sched/pelt.c

>> +++ b/kernel/sched/pelt.c

>> @@ -355,3 +355,41 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int

>> running)

>>         return 0;

>>   }

>> +

>> +/*

>> + * irq:

>> + *

>> + *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked

>> + *   util_sum = cpu_scale * load_sum

>> + *   runnable_load_sum = load_sum

>> + *

>> + */

>> +

>> +int update_irq_load_avg(struct rq *rq, u64 running)

>> +{

>> +       int ret = 0;

>> +       /*

>> +        * We know the time that has been used by interrupt since last

>> update

>> +        * but we don't when. Let be pessimistic and assume that interrupt

>> has

>> +        * happened just before the update. This is not so far from

>> reality

>> +        * because interrupt will most probably wake up task and trig an

>> update

>> +        * of rq clock during which the metric si updated.

>> +        * We start to decay with normal context time and then we add the

>> +        * interrupt context time.

>> +        * We can safely remove running from rq->clock because

>> +        * rq->clock += delta with delta >= running

>

>

> This is true as long update_irq_load_avg() with a 'running != 0' is called

> only after rq->clock moved forward (rq->clock += delta) (which is true for

> update_rq_clock()->update_rq_clock_task()).


yes

>

>> +        */

>> +       ret = ___update_load_sum(rq->clock - running, rq->cpu,

>> &rq->avg_irq,

>> +                               0,

>> +                               0,

>> +                               0);

>> +       ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,

>> +                               1,

>> +                               1,

>> +                               1);

>

>

> So you decay the signal in [sa->lut, rq->clock - running] (assumed to be the

> portion of delta used by the task scheduler) and you increase it in

> [rq->clock - running, rq->clock] (irq and virt portion of delta).

>

> That means that this signal is updated on rq->clock whereas the others are

> on rq->clock_task.

>

> What about the ever growing clock diff between them? I see e.g ~6s after

> 20min uptime and up to 1.5ms 'running'.

>

> It should be still safe to sum the sched class and irq signal in

> sugov_aggregate_util() because they are independent, I guess.


yes. the formula is explained in patch "cpufreq/schedutil: take into
account interrupt"


>

> [...]
Dietmar Eggemann May 31, 2018, 4:54 p.m. | #3
On 05/30/2018 08:45 PM, Vincent Guittot wrote:
> Hi Dietmar,

> 

> On 30 May 2018 at 17:55, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>> On 05/25/2018 03:12 PM, Vincent Guittot wrote:


[...]

>>> +        */

>>> +       ret = ___update_load_sum(rq->clock - running, rq->cpu,

>>> &rq->avg_irq,

>>> +                               0,

>>> +                               0,

>>> +                               0);

>>> +       ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,

>>> +                               1,

>>> +                               1,

>>> +                               1);


Can you not change the function parameter list to the usual
(u64 now, struct rq *rq, int running)?

Something like this (only compile and boot tested):

-- >8 --

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9894bc7af37e..26ffd585cab8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -177,8 +177,22 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
        rq->clock_task += delta;
 
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
-       if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
-               update_irq_load_avg(rq, irq_delta + steal);
+       if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) {
+               /*
+                * We know the time that has been used by interrupt since last
+                * update but we don't when. Let be pessimistic and assume that
+                * interrupt has happened just before the update. This is not
+                * so far from reality because interrupt will most probably
+                * wake up task and trig an update of rq clock during which the
+                * metric si updated.
+                * We start to decay with normal context time and then we add
+                * the interrupt context time.
+                * We can safely remove running from rq->clock because
+                * rq->clock += delta with delta >= running
+                */
+               update_irq_load_avg(rq_clock(rq) - (irq_delta + steal), rq, 0);
+               update_irq_load_avg(rq_clock(rq), rq, 1);
+       }
 #endif
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1bb3379c4b71..a245f853c271 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7363,7 +7363,7 @@ static void update_blocked_averages(int cpu)
        }
        update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
        update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
-       update_irq_load_avg(rq, 0);
+       update_irq_load_avg(rq_clock(rq), rq, 0);
        /* Don't need periodic decay once load/util_avg are null */
        if (others_rqs_have_blocked(rq))
                done = false;
@@ -7434,7 +7434,7 @@ static inline void update_blocked_averages(int cpu)
        update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
        update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
        update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
-       update_irq_load_avg(rq, 0);
+       update_irq_load_avg(rq_clock(rq), rq, 0);
 #ifdef CONFIG_NO_HZ_COMMON
        rq->last_blocked_load_update_tick = jiffies;
        if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index d2e4f2186b13..ae01bb18e28c 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -365,31 +365,15 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  */
 
-int update_irq_load_avg(struct rq *rq, u64 running)
+int update_irq_load_avg(u64 now, struct rq *rq, int running)
 {
-       int ret = 0;
-       /*
-        * We know the time that has been used by interrupt since last update
-        * but we don't when. Let be pessimistic and assume that interrupt has
-        * happened just before the update. This is not so far from reality
-        * because interrupt will most probably wake up task and trig an update
-        * of rq clock during which the metric si updated.
-        * We start to decay with normal context time and then we add the
-        * interrupt context time.
-        * We can safely remove running from rq->clock because
-        * rq->clock += delta with delta >= running
-        */
-       ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
-                               0,
-                               0,
-                               0);
-       ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
-                               1,
-                               1,
-                               1);
-
-       if (ret)
+       if (___update_load_sum(now, rq->cpu, &rq->avg_irq,
+                               running,
+                               running,
+                               running)) {
                ___update_load_avg(&rq->avg_irq, 1, 1);
+               return 1;
+       }
 
-       return ret;
+       return 0;
 }
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 0ce9a5a5877a..ebc57301a9a8 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -5,7 +5,7 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
 int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
-int update_irq_load_avg(struct rq *rq, u64 running);
+int update_irq_load_avg(u64 now, struct rq *rq, int running);
 
 /*
  * When a task is dequeued, its estimated utilization should not be update if
Vincent Guittot June 6, 2018, 4:06 p.m. | #4
Hi Dietmar,

Sorry for the late answer

On 31 May 2018 at 18:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 05/30/2018 08:45 PM, Vincent Guittot wrote:

>> Hi Dietmar,

>>

>> On 30 May 2018 at 17:55, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>>> On 05/25/2018 03:12 PM, Vincent Guittot wrote:

>

> [...]

>

>>>> +        */

>>>> +       ret = ___update_load_sum(rq->clock - running, rq->cpu,

>>>> &rq->avg_irq,

>>>> +                               0,

>>>> +                               0,

>>>> +                               0);

>>>> +       ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,

>>>> +                               1,

>>>> +                               1,

>>>> +                               1);

>

> Can you not change the function parameter list to the usual

> (u64 now, struct rq *rq, int running)?

>

> Something like this (only compile and boot tested):


To be honest, I prefer to keep the specific sequence above in a
dedicated function instead of adding it in core code.
Furthermore,  we end up calling call twice ___update_load_avg instead
of only once. This will set an intermediate and incorrect value in
util_avg and this value can be read in the meantime

Vincent

>

> -- >8 --

>

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c

> index 9894bc7af37e..26ffd585cab8 100644

> --- a/kernel/sched/core.c

> +++ b/kernel/sched/core.c

> @@ -177,8 +177,22 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)

>         rq->clock_task += delta;

>

>  #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

> -       if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))

> -               update_irq_load_avg(rq, irq_delta + steal);

> +       if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) {

> +               /*

> +                * We know the time that has been used by interrupt since last

> +                * update but we don't when. Let be pessimistic and assume that

> +                * interrupt has happened just before the update. This is not

> +                * so far from reality because interrupt will most probably

> +                * wake up task and trig an update of rq clock during which the

> +                * metric si updated.

> +                * We start to decay with normal context time and then we add

> +                * the interrupt context time.

> +                * We can safely remove running from rq->clock because

> +                * rq->clock += delta with delta >= running

> +                */

> +               update_irq_load_avg(rq_clock(rq) - (irq_delta + steal), rq, 0);

> +               update_irq_load_avg(rq_clock(rq), rq, 1);

> +       }

>  #endif

>  }

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 1bb3379c4b71..a245f853c271 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7363,7 +7363,7 @@ static void update_blocked_averages(int cpu)

>         }

>         update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);

>         update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);

> -       update_irq_load_avg(rq, 0);

> +       update_irq_load_avg(rq_clock(rq), rq, 0);

>         /* Don't need periodic decay once load/util_avg are null */

>         if (others_rqs_have_blocked(rq))

>                 done = false;

> @@ -7434,7 +7434,7 @@ static inline void update_blocked_averages(int cpu)

>         update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);

>         update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);

>         update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);

> -       update_irq_load_avg(rq, 0);

> +       update_irq_load_avg(rq_clock(rq), rq, 0);

>  #ifdef CONFIG_NO_HZ_COMMON

>         rq->last_blocked_load_update_tick = jiffies;

>         if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))

> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c

> index d2e4f2186b13..ae01bb18e28c 100644

> --- a/kernel/sched/pelt.c

> +++ b/kernel/sched/pelt.c

> @@ -365,31 +365,15 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

>   *

>   */

>

> -int update_irq_load_avg(struct rq *rq, u64 running)

> +int update_irq_load_avg(u64 now, struct rq *rq, int running)

>  {

> -       int ret = 0;

> -       /*

> -        * We know the time that has been used by interrupt since last update

> -        * but we don't when. Let be pessimistic and assume that interrupt has

> -        * happened just before the update. This is not so far from reality

> -        * because interrupt will most probably wake up task and trig an update

> -        * of rq clock during which the metric si updated.

> -        * We start to decay with normal context time and then we add the

> -        * interrupt context time.

> -        * We can safely remove running from rq->clock because

> -        * rq->clock += delta with delta >= running

> -        */

> -       ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,

> -                               0,

> -                               0,

> -                               0);

> -       ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,

> -                               1,

> -                               1,

> -                               1);

> -

> -       if (ret)

> +       if (___update_load_sum(now, rq->cpu, &rq->avg_irq,

> +                               running,

> +                               running,

> +                               running)) {

>                 ___update_load_avg(&rq->avg_irq, 1, 1);

> +               return 1;

> +       }

>

> -       return ret;

> +       return 0;

>  }

> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h

> index 0ce9a5a5877a..ebc57301a9a8 100644

> --- a/kernel/sched/pelt.h

> +++ b/kernel/sched/pelt.h

> @@ -5,7 +5,7 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e

>  int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);

>  int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);

>  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);

> -int update_irq_load_avg(struct rq *rq, u64 running);

> +int update_irq_load_avg(u64 now, struct rq *rq, int running);

>

>  /*

>   * When a task is dequeued, its estimated utilization should not be update if

>

>
Dietmar Eggemann June 7, 2018, 8:29 a.m. | #5
On 06/06/2018 06:06 PM, Vincent Guittot wrote:
> Hi Dietmar,

> 

> Sorry for the late answer

> 

> On 31 May 2018 at 18:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>> On 05/30/2018 08:45 PM, Vincent Guittot wrote:

>>> Hi Dietmar,

>>>

>>> On 30 May 2018 at 17:55, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>>>> On 05/25/2018 03:12 PM, Vincent Guittot wrote:

>>

>> [...]

>>

>>>>> +        */

>>>>> +       ret = ___update_load_sum(rq->clock - running, rq->cpu,

>>>>> &rq->avg_irq,

>>>>> +                               0,

>>>>> +                               0,

>>>>> +                               0);

>>>>> +       ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,

>>>>> +                               1,

>>>>> +                               1,

>>>>> +                               1);

>>

>> Can you not change the function parameter list to the usual

>> (u64 now, struct rq *rq, int running)?

>>

>> Something like this (only compile and boot tested):

> 

> To be honest, I prefer to keep the specific sequence above in a

> dedicated function instead of adding it in core code.


No big issue.

> Furthermore,  we end up calling call twice ___update_load_avg instead

> of only once. This will set an intermediate and incorrect value in

> util_avg and this value can be read in the meantime


Can't buy this argument though because this is true with the current 
implementation as well since the 'decay load sum' - 'accrue load sum' 
sequence is not atomic.

What about calling update_irq_load_avg(rq, 0) in update_rq_clock_task() 
if (irq_delta + steal) eq. 0 and sched_feat(NONTASK_CAPACITY) eq. true 
in this #ifdef CONFIG_XXX_TIME_ACCOUNTING block?

Maintaining a irq/steal time signal makes only sense if at least one of 
the CONFIG_XXX_TIME_ACCOUNTING is set and NONTASK_CAPACITY is true. The 
call to update_irq_load_avg() in update_blocked_averages() isn't guarded 
my them.

[...]
Vincent Guittot June 7, 2018, 8:44 a.m. | #6
On 7 June 2018 at 10:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 06/06/2018 06:06 PM, Vincent Guittot wrote:

>>

>> Hi Dietmar,

>>

>> Sorry for the late answer

>>

>> On 31 May 2018 at 18:54, Dietmar Eggemann <dietmar.eggemann@arm.com>

>> wrote:

>>>

>>> On 05/30/2018 08:45 PM, Vincent Guittot wrote:

>>>>

>>>> Hi Dietmar,

>>>>

>>>> On 30 May 2018 at 17:55, Dietmar Eggemann <dietmar.eggemann@arm.com>

>>>> wrote:

>>>>>

>>>>> On 05/25/2018 03:12 PM, Vincent Guittot wrote:

>>>

>>>

>>> [...]

>>>

>>>>>> +        */

>>>>>> +       ret = ___update_load_sum(rq->clock - running, rq->cpu,

>>>>>> &rq->avg_irq,

>>>>>> +                               0,

>>>>>> +                               0,

>>>>>> +                               0);

>>>>>> +       ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,

>>>>>> +                               1,

>>>>>> +                               1,

>>>>>> +                               1);

>>>

>>>

>>> Can you not change the function parameter list to the usual

>>> (u64 now, struct rq *rq, int running)?

>>>

>>> Something like this (only compile and boot tested):

>>

>>

>> To be honest, I prefer to keep the specific sequence above in a

>> dedicated function instead of adding it in core code.

>

>

> No big issue.

>

>> Furthermore,  we end up calling call twice ___update_load_avg instead

>> of only once. This will set an intermediate and incorrect value in

>> util_avg and this value can be read in the meantime

>

>

> Can't buy this argument though because this is true with the current

> implementation as well since the 'decay load sum' - 'accrue load sum'

> sequence is not atomic.


it's not a problem that the _sum variable are updated in different
step because there are internal variable
Only util_avg is used "outside" and the latter is updated after both
idle and running steps have been applied

>

> What about calling update_irq_load_avg(rq, 0) in update_rq_clock_task() if

> (irq_delta + steal) eq. 0 and sched_feat(NONTASK_CAPACITY) eq. true in this

> #ifdef CONFIG_XXX_TIME_ACCOUNTING block?


update_irq_load_avg(rq, 0) is called in update_blocked_averages to
decay smoothly like other blocked signals and replace the need to call
update_irq_load_avg(rq, 0) for every call to update_rq_clock_task()
which can be significant

>

> Maintaining a irq/steal time signal makes only sense if at least one of the

> CONFIG_XXX_TIME_ACCOUNTING is set and NONTASK_CAPACITY is true. The call to

> update_irq_load_avg() in update_blocked_averages() isn't guarded my them.


good point

>

> [...]
Dietmar Eggemann June 7, 2018, 9:06 a.m. | #7
On 06/07/2018 10:44 AM, Vincent Guittot wrote:
> On 7 June 2018 at 10:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>> On 06/06/2018 06:06 PM, Vincent Guittot wrote:

>>>

>>> Hi Dietmar,

>>>

>>> Sorry for the late answer

>>>

>>> On 31 May 2018 at 18:54, Dietmar Eggemann <dietmar.eggemann@arm.com>

>>> wrote:

>>>>

>>>> On 05/30/2018 08:45 PM, Vincent Guittot wrote:

>>>>>

>>>>> Hi Dietmar,

>>>>>

>>>>> On 30 May 2018 at 17:55, Dietmar Eggemann <dietmar.eggemann@arm.com>

>>>>> wrote:

>>>>>>

>>>>>> On 05/25/2018 03:12 PM, Vincent Guittot wrote:


[...]

>> Can't buy this argument though because this is true with the current

>> implementation as well since the 'decay load sum' - 'accrue load sum'

>> sequence is not atomic.

> 

> it's not a problem that the _sum variable are updated in different

> step because there are internal variable

> Only util_avg is used "outside" and the latter is updated after both

> idle and running steps have been applied


You're right here!

>> What about calling update_irq_load_avg(rq, 0) in update_rq_clock_task() if

>> (irq_delta + steal) eq. 0 and sched_feat(NONTASK_CAPACITY) eq. true in this

>> #ifdef CONFIG_XXX_TIME_ACCOUNTING block?

> 

> update_irq_load_avg(rq, 0) is called in update_blocked_averages to

> decay smoothly like other blocked signals and replace the need to call

> update_irq_load_avg(rq, 0) for every call to update_rq_clock_task()

> which can be significant


OK.

>> Maintaining a irq/steal time signal makes only sense if at least one of the

>> CONFIG_XXX_TIME_ACCOUNTING is set and NONTASK_CAPACITY is true. The call to

>> update_irq_load_avg() in update_blocked_averages() isn't guarded my them.

> 

> good point


[...]

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d155518..ab58288 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -16,6 +16,8 @@ 
 #include "../workqueue_internal.h"
 #include "../smpboot.h"
 
+#include "pelt.h"
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
 
@@ -184,7 +186,7 @@  static void update_rq_clock_task(struct rq *rq, s64 delta)
 
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
 	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
-		sched_rt_avg_update(rq, irq_delta + steal);
+		update_irq_load_avg(rq, irq_delta + steal);
 #endif
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da75eda..1bb3379 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5323,8 +5323,6 @@  static void cpu_load_update(struct rq *this_rq, unsigned long this_load,
 
 		this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
 	}
-
-	sched_avg_update(this_rq);
 }
 
 /* Used instead of source_load when we know the type == 0 */
@@ -7298,6 +7296,9 @@  static inline bool others_rqs_have_blocked(struct rq *rq)
 	if (rq->avg_dl.util_avg)
 		return true;
 
+	if (rq->avg_irq.util_avg)
+		return true;
+
 	return false;
 }
 
@@ -7362,6 +7363,7 @@  static void update_blocked_averages(int cpu)
 	}
 	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
 	update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+	update_irq_load_avg(rq, 0);
 	/* Don't need periodic decay once load/util_avg are null */
 	if (others_rqs_have_blocked(rq))
 		done = false;
@@ -7432,6 +7434,7 @@  static inline void update_blocked_averages(int cpu)
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
 	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
 	update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+	update_irq_load_avg(rq, 0);
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
 	if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
@@ -7544,24 +7547,9 @@  static inline int get_sd_load_idx(struct sched_domain *sd,
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	u64 total, used, age_stamp, avg;
-	s64 delta;
-
-	/*
-	 * Since we're reading these variables without serialization make sure
-	 * we read them once before doing sanity checks on them.
-	 */
-	age_stamp = READ_ONCE(rq->age_stamp);
-	avg = READ_ONCE(rq->rt_avg);
-	delta = __rq_clock_broken(rq) - age_stamp;
-
-	if (unlikely(delta < 0))
-		delta = 0;
-
-	total = sched_avg_period() + delta;
-
-	used = div_u64(avg, total);
+	unsigned long used;
 
+	used = READ_ONCE(rq->avg_irq.util_avg);
 	used += READ_ONCE(rq->avg_rt.util_avg);
 	used += READ_ONCE(rq->avg_dl.util_avg);
 	if (likely(used < SCHED_CAPACITY_SCALE))
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 3d5bd3a..d2e4f21 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -355,3 +355,41 @@  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 
 	return 0;
 }
+
+/*
+ * irq:
+ *
+ *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ *   util_sum = cpu_scale * load_sum
+ *   runnable_load_sum = load_sum
+ *
+ */
+
+int update_irq_load_avg(struct rq *rq, u64 running)
+{
+	int ret = 0;
+	/*
+	 * We know the time that has been used by interrupt since last update
+	 * but we don't when. Let be pessimistic and assume that interrupt has
+	 * happened just before the update. This is not so far from reality
+	 * because interrupt will most probably wake up task and trig an update
+	 * of rq clock during which the metric si updated.
+	 * We start to decay with normal context time and then we add the
+	 * interrupt context time.
+	 * We can safely remove running from rq->clock because
+	 * rq->clock += delta with delta >= running
+	 */
+	ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
+				0,
+				0,
+				0);
+	ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
+				1,
+				1,
+				1);
+
+	if (ret)
+		___update_load_avg(&rq->avg_irq, 1, 1);
+
+	return ret;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 0e4f912..0ce9a5a 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -5,6 +5,7 @@  int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
 int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_irq_load_avg(struct rq *rq, u64 running);
 
 /*
  * When a task is dequeued, its estimated utilization should not be update if
@@ -51,6 +52,12 @@  update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	return 0;
 }
+
+static inline int
+update_irq_load_avg(struct rq *rq, u64 running)
+{
+	return 0;
+}
 #endif
 
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0eb07a8..f7e8d5b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -850,6 +850,7 @@  struct rq {
 	u64			age_stamp;
 	struct sched_avg	avg_rt;
 	struct sched_avg	avg_dl;
+	struct sched_avg	avg_irq;
 	u64			idle_stamp;
 	u64			avg_idle;