[v7,2/2] sched/fair: update scale invariance of PELT

Message ID	1542711308-25256-3-git-send-email-vincent.guittot@linaro.org
State	Superseded
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Vincent Guittot <vincent.guittot@linaro.org> To: peterz@infradead.org, mingo@kernel.org, linux-kernel@vger.kernel.org Cc: rjw@rjwysocki.net, dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, patrick.bellasi@arm.com, pjt@google.com, bsegall@google.com, thara.gopinath@linaro.org, pkondeti@codeaurora.org, quentin.perret@arm.com, srinivas.pandruvada@linux.intel.com, Vincent Guittot <vincent.guittot@linaro.org> Subject: [PATCH v7 2/2] sched/fair: update scale invariance of PELT Date: Tue, 20 Nov 2018 11:55:08 +0100 Message-Id: <1542711308-25256-3-git-send-email-vincent.guittot@linaro.org> In-Reply-To: <1542711308-25256-1-git-send-email-vincent.guittot@linaro.org> References: <1542711308-25256-1-git-send-email-vincent.guittot@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	sched/fair: update scale invariance of PELT \| expand [v7,0/2] sched/fair: update scale invariance of PELT [v7,1/2] sched/fair: move rq_of helper function [v7,2/2] sched/fair: update scale invariance of PELT

Vincent Guittot Nov. 20, 2018, 10:55 a.m. UTC

The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.

The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :

    U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

with U is the max util_avg value = SCHED_CAPACITY_SCALE

At a lower capacity, the range becomes:

    U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

with C reflecting the compute capacity ratio between current capacity and
max capacity.

so C tries to compensate changes in (1-y^r') but it can't be accurate.

Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.

In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to ensure
that the same amount of idle time will be applied when system becomes idle
and no idle time has been "stolen". But reaching the maximum utilization
value (SCHED_CAPACITY_SCALE) means that the task is seen as an
always-running task whatever the capacity of the CPU (even at max compute
capacity). In this case, we can discard this "stolen" idle times which
becomes meaningless.

In order to achieve this time scaling, a new clock_pelt is created per rq.
The increase of this clock scales with current capacity when something
is running on rq and synchronizes with clock_task when rq is idle. With
this mechanism, we ensure the same running and idle time whatever the
current capacity. This also enables to simplify the pelt algorithm by
removing all references of uarch and frequency and applying the same
contribution to utilization and loads. Furthermore, the scaling is done
only once per update of clock (update_rq_clock_task()) instead of during
each update of sched_entities and cfs/rt/dl_rq of the rq like the current
implementation. This is interesting when cgroup are involved as shown in
the results below:

On a hikey (octo Arm64 platform).
Performance cpufreq governor and only shallowest c-state to remove variance
generated by those power features so we only track the impact of pelt algo.

each test runs 16 times

./perf bench sched pipe
(higher is better)
kernel	tip/sched/core     + patch
        ops/seconds        ops/seconds         diff
cgroup
root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%
level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%
level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%

hackbench -l 1000
(lower is better)
kernel	tip/sched/core     + patch
        duration(sec)      duration(sec)        diff
cgroup
root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%
level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%
level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%

Then, the responsiveness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch. These values has been
computed based on the geometric series and the half period value:

Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
972 (95%)    138ms         not reachable            276ms
486 (47.5%)  30ms          138ms                     60ms
256 (25%)    13ms           32ms                     26ms

On my hikey (octo Arm64 platform) with schedutil governor, the time to
reach max OPP when starting from a null utilization, decreases from 223ms
with current scale invariance down to 121ms with the new algorithm.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 include/linux/sched.h   |  23 +++-------
 kernel/sched/core.c     |   1 +
 kernel/sched/deadline.c |   6 +--
 kernel/sched/fair.c     |  45 +++++++++++---------
 kernel/sched/pelt.c     |  45 +++++++++++---------
 kernel/sched/pelt.h     | 111 ++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/rt.c       |   6 +--
 kernel/sched/sched.h    |   5 ++-
 8 files changed, 174 insertions(+), 68 deletions(-)

-- 
2.7.4

Vincent Guittot Nov. 28, 2018, 9:54 a.m. UTC | #1

Hi,

On Tue, 20 Nov 2018 at 11:55, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>

> The current implementation of load tracking invariance scales the

> contribution with current frequency and uarch performance (only for

> utilization) of the CPU. One main result of this formula is that the

> figures are capped by current capacity of CPU. Another one is that the

> load_avg is not invariant because not scaled with uarch.

>

> The util_avg of a periodic task that runs r time slots every p time slots

> varies in the range :

>

>     U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

>

> with U is the max util_avg value = SCHED_CAPACITY_SCALE

>

> At a lower capacity, the range becomes:

>

>     U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

>

> with C reflecting the compute capacity ratio between current capacity and

> max capacity.

>

> so C tries to compensate changes in (1-y^r') but it can't be accurate.

>

> Instead of scaling the contribution value of PELT algo, we should scale the

> running time. The PELT signal aims to track the amount of computation of

> tasks and/or rq so it seems more correct to scale the running time to

> reflect the effective amount of computation done since the last update.

>

> In order to be fully invariant, we need to apply the same amount of

> running time and idle time whatever the current capacity. Because running

> at lower capacity implies that the task will run longer, we have to ensure

> that the same amount of idle time will be applied when system becomes idle

> and no idle time has been "stolen". But reaching the maximum utilization

> value (SCHED_CAPACITY_SCALE) means that the task is seen as an

> always-running task whatever the capacity of the CPU (even at max compute

> capacity). In this case, we can discard this "stolen" idle times which

> becomes meaningless.

>

> In order to achieve this time scaling, a new clock_pelt is created per rq.

> The increase of this clock scales with current capacity when something

> is running on rq and synchronizes with clock_task when rq is idle. With

> this mechanism, we ensure the same running and idle time whatever the

> current capacity. This also enables to simplify the pelt algorithm by

> removing all references of uarch and frequency and applying the same

> contribution to utilization and loads. Furthermore, the scaling is done

> only once per update of clock (update_rq_clock_task()) instead of during

> each update of sched_entities and cfs/rt/dl_rq of the rq like the current

> implementation. This is interesting when cgroup are involved as shown in

> the results below:

>

> On a hikey (octo Arm64 platform).

> Performance cpufreq governor and only shallowest c-state to remove variance

> generated by those power features so we only track the impact of pelt algo.

>

> each test runs 16 times

>

> ./perf bench sched pipe

> (higher is better)

> kernel  tip/sched/core     + patch

>         ops/seconds        ops/seconds         diff

> cgroup

> root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%

> level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%

> level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%

>

> hackbench -l 1000

> (lower is better)

> kernel  tip/sched/core     + patch

>         duration(sec)      duration(sec)        diff

> cgroup

> root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%

> level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%

> level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%

>

> Then, the responsiveness of PELT is improved when CPU is not running at max

> capacity with this new algorithm. I have put below some examples of

> duration to reach some typical load values according to the capacity of the

> CPU with current implementation and with this patch. These values has been

> computed based on the geometric series and the half period value:

>

> Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)

> 972 (95%)    138ms         not reachable            276ms

> 486 (47.5%)  30ms          138ms                     60ms

> 256 (25%)    13ms           32ms                     26ms

>

> On my hikey (octo Arm64 platform) with schedutil governor, the time to

> reach max OPP when starting from a null utilization, decreases from 223ms

> with current scale invariance down to 121ms with the new algorithm.

>

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>


Is there anything else that I should do for these patches ?

Regards,
Vincent

Peter Zijlstra Nov. 28, 2018, 10:02 a.m. UTC | #2

On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> Is there anything else that I should do for these patches ?


IIRC, Morten mention they break util_est; Patrick was going to explain.

But yes, I like them.

Patrick Bellasi Nov. 28, 2018, 11:53 a.m. UTC | #3

On 28-Nov 11:02, Peter Zijlstra wrote:
> On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> 

> > Is there anything else that I should do for these patches ?

> 

> IIRC, Morten mention they break util_est; Patrick was going to explain.

I guess the problem is that, once we cross the current capacity,
strictly speaking util_avg does not represent anymore a utilization.

With the new signal this could happen and we end up storing estimated
utilization samples which will overestimate the task requirements.

We will have a spike in estimated utilization at next wakeup, since we
use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in
case we collect multiple samples above the current capacity.

So, a possible fix could be to avoid storing util_est samples if we
end up with a utilization above the current capacity.

Something like:

----8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac855b2f4774..93e0cf5d8a76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 	if (!task_sleep)
 		return;

+	/* Skip samples which do not represent an actual utilization */
+	if (unlikely(task_util(p) > capacity_of(task_cpu(p))))
+		return;
+
 	/*
 	 * If the PELT values haven't changed since enqueue time,
 	 * skip the util_est update.
---8<---

Could that work ?

Maybe using a new utility function to wrap the new check.

-- 
#include <best/regards.h>

Patrick Bellasi

Vincent Guittot Nov. 28, 2018, 1:33 p.m. UTC | #4

On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 28-Nov 11:02, Peter Zijlstra wrote:

> > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> >

> > > Is there anything else that I should do for these patches ?

> >

> > IIRC, Morten mention they break util_est; Patrick was going to explain.

>

> I guess the problem is that, once we cross the current capacity,

> strictly speaking util_avg does not represent anymore a utilization.

>

> With the new signal this could happen and we end up storing estimated

> utilization samples which will overestimate the task requirements.

>

> We will have a spike in estimated utilization at next wakeup, since we

> use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> case we collect multiple samples above the current capacity.


TBH I don't see how it's different from current implementation with a
task that was scheduled on big core and now wakes up on little core.
The util_est is overestimated as well.

But I'm fine with adding your proposal on to on the patchset

>

> So, a possible fix could be to avoid storing util_est samples if we

> end up with a utilization above the current capacity.

>

> Something like:

>

> ----8<---

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index ac855b2f4774..93e0cf5d8a76 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

>         if (!task_sleep)

>                 return;

>

> +       /* Skip samples which do not represent an actual utilization */

> +       if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> +               return;

> +

>         /*

>          * If the PELT values haven't changed since enqueue time,

>          * skip the util_est update.

> ---8<---

>

> Could that work ?

>

> Maybe using a new utility function to wrap the new check.

>

> --

> #include <best/regards.h>

>

> Patrick Bellasi

Vincent Guittot Nov. 28, 2018, 1:35 p.m. UTC | #5

On Wed, 28 Nov 2018 at 14:33, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>

> On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> >

> > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > >

> > > > Is there anything else that I should do for these patches ?

> > >

> > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> >

> > I guess the problem is that, once we cross the current capacity,

> > strictly speaking util_avg does not represent anymore a utilization.

> >

> > With the new signal this could happen and we end up storing estimated

> > utilization samples which will overestimate the task requirements.

> >

> > We will have a spike in estimated utilization at next wakeup, since we

> > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > case we collect multiple samples above the current capacity.

>

> TBH I don't see how it's different from current implementation with a

> task that was scheduled on big core and now wakes up on little core.

> The util_est is overestimated as well.

>

> But I'm fine with adding your proposal on to on the patchset

s/on to on/on top of/

>

> >

> > So, a possible fix could be to avoid storing util_est samples if we

> > end up with a utilization above the current capacity.

> >

> > Something like:

> >

> > ----8<---

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index ac855b2f4774..93e0cf5d8a76 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

> >         if (!task_sleep)

> >                 return;

> >

> > +       /* Skip samples which do not represent an actual utilization */

> > +       if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> > +               return;

> > +

> >         /*

> >          * If the PELT values haven't changed since enqueue time,

> >          * skip the util_est update.

> > ---8<---

> >

> > Could that work ?

> >

> > Maybe using a new utility function to wrap the new check.

> >

> > --

> > #include <best/regards.h>

> >

> > Patrick Bellasi

Patrick Bellasi Nov. 28, 2018, 2:40 p.m. UTC | #6

On 28-Nov 14:33, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> >

> > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > >

> > > > Is there anything else that I should do for these patches ?

> > >

> > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> >

> > I guess the problem is that, once we cross the current capacity,

> > strictly speaking util_avg does not represent anymore a utilization.

> >

> > With the new signal this could happen and we end up storing estimated

> > utilization samples which will overestimate the task requirements.

> >

> > We will have a spike in estimated utilization at next wakeup, since we

> > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > case we collect multiple samples above the current capacity.

> 

> TBH I don't see how it's different from current implementation with a

> task that was scheduled on big core and now wakes up on little core.

> The util_est is overestimated as well.

While running below the capacity of a CPU, either big or LITTLE, we
can still measure the actual used bandwidth as long as we have idle
time. If the task is then moved into a lower capacity core, I think
it's still safe to assume that, likely, it would need more capacity.

Why do you say it's the same ?

With your new signal instead, once we cross the current capacity,
utilization is just not anymore utilization. Thus, IMHO it make sense
avoid to accumulate a sample for what we call "estimated utilization".

I would also say that, with the current implementation which caps
utilization to the current capacity, we get better estimation in
general. At least we can say with absolute precision:

   "the task needs _at least_ that amount of capacity".

Potentially we can also flag the task as being under-provisioned, in
case there was not idle time, and _let a policy_ decide what to do
with it and the granted information we have.

While, with your new signal, once we are over the current capacity,
the "utilization" is just a sort of "random" number at best useful to
drive some conclusions about how long the task has been delayed.

IOW, I fear that we are embedding a policy within a signal which is
currently representing something very well defined: how much cpu
bandwidth a task used. While, latency/under-provisioning policies
perhaps should be better placed somewhere else.

Perhaps I've missed it in some of the previous discussions:
have we have considered/discussed this signal-vs-policy aspect ?

-- 
#include <best/regards.h>

Patrick Bellasi

Vincent Guittot Nov. 28, 2018, 2:55 p.m. UTC | #7

On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 28-Nov 14:33, Vincent Guittot wrote:

> > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > >

> > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > >

> > > > > Is there anything else that I should do for these patches ?

> > > >

> > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > >

> > > I guess the problem is that, once we cross the current capacity,

> > > strictly speaking util_avg does not represent anymore a utilization.

> > >

> > > With the new signal this could happen and we end up storing estimated

> > > utilization samples which will overestimate the task requirements.

> > >

> > > We will have a spike in estimated utilization at next wakeup, since we

> > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > case we collect multiple samples above the current capacity.

> >

> > TBH I don't see how it's different from current implementation with a

> > task that was scheduled on big core and now wakes up on little core.

> > The util_est is overestimated as well.

>

> While running below the capacity of a CPU, either big or LITTLE, we

> can still measure the actual used bandwidth as long as we have idle

> time. If the task is then moved into a lower capacity core, I think

> it's still safe to assume that, likely, it would need more capacity.

>

> Why do you say it's the same ?


In the example of a task that runs 39ms in period of 80ms that we used
during previous version,
the utilization on the big core will reach 709 so will util_est too
When the task migrates on little core (512), util_est is higher than
current cpu capacity

>

> With your new signal instead, once we cross the current capacity,

> utilization is just not anymore utilization. Thus, IMHO it make sense

> avoid to accumulate a sample for what we call "estimated utilization".

>

> I would also say that, with the current implementation which caps

> utilization to the current capacity, we get better estimation in

> general. At least we can say with absolute precision:

>

>    "the task needs _at least_ that amount of capacity".

>

> Potentially we can also flag the task as being under-provisioned, in

> case there was not idle time, and _let a policy_ decide what to do

> with it and the granted information we have.

>

> While, with your new signal, once we are over the current capacity,

> the "utilization" is just a sort of "random" number at best useful to

> drive some conclusions about how long the task has been delayed.

>

> IOW, I fear that we are embedding a policy within a signal which is

> currently representing something very well defined: how much cpu

> bandwidth a task used. While, latency/under-provisioning policies

> perhaps should be better placed somewhere else.

>

> Perhaps I've missed it in some of the previous discussions:

> have we have considered/discussed this signal-vs-policy aspect ?

>

> --

> #include <best/regards.h>

>

> Patrick Bellasi

Patrick Bellasi Nov. 28, 2018, 3:21 p.m. UTC | #8

On 28-Nov 15:55, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> >

> > On 28-Nov 14:33, Vincent Guittot wrote:

> > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > >

> > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > >

> > > > > > Is there anything else that I should do for these patches ?

> > > > >

> > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > >

> > > > I guess the problem is that, once we cross the current capacity,

> > > > strictly speaking util_avg does not represent anymore a utilization.

> > > >

> > > > With the new signal this could happen and we end up storing estimated

> > > > utilization samples which will overestimate the task requirements.

> > > >

> > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > case we collect multiple samples above the current capacity.

> > >

> > > TBH I don't see how it's different from current implementation with a

> > > task that was scheduled on big core and now wakes up on little core.

> > > The util_est is overestimated as well.

> >

> > While running below the capacity of a CPU, either big or LITTLE, we

> > can still measure the actual used bandwidth as long as we have idle

> > time. If the task is then moved into a lower capacity core, I think

> > it's still safe to assume that, likely, it would need more capacity.

> >

> > Why do you say it's the same ?

> 

> In the example of a task that runs 39ms in period of 80ms that we used

> during previous version,

> the utilization on the big core will reach 709 so will util_est too

> When the task migrates on little core (512), util_est is higher than

> current cpu capacity


Right, and what's the problem ?

1) We know that PELT is calibrated to 32ms period task and in your
   example, since the runtime is higher then the half-life, it's
   correct to estimate a utilization higher then 50%.

   PELT utilization is defined _based on the half-life_: thus
   your task having a 50% duty cycle does not mean we are not correct
   if report a utilization != 50%.
   It would be as broken as reporting 10% utilization for a task
   running 100ms every 1s.

2) If it was a 70% task on a previous activation, once it's moved into
   a lower capacity CPU it's still correct to assume that it's likely
   going to require the same bandwidth and thus will be
   under-provisioned.

I still don't see where we are wrong in this case :/

To me it looks different then the problem I described.

> > With your new signal instead, once we cross the current capacity,

> > utilization is just not anymore utilization. Thus, IMHO it make sense

> > avoid to accumulate a sample for what we call "estimated utilization".

> >

> > I would also say that, with the current implementation which caps

> > utilization to the current capacity, we get better estimation in

> > general. At least we can say with absolute precision:

> >

> >    "the task needs _at least_ that amount of capacity".

> >

> > Potentially we can also flag the task as being under-provisioned, in

> > case there was not idle time, and _let a policy_ decide what to do

> > with it and the granted information we have.

> >

> > While, with your new signal, once we are over the current capacity,

> > the "utilization" is just a sort of "random" number at best useful to

> > drive some conclusions about how long the task has been delayed.

> >

> > IOW, I fear that we are embedding a policy within a signal which is

> > currently representing something very well defined: how much cpu

> > bandwidth a task used. While, latency/under-provisioning policies

> > perhaps should be better placed somewhere else.

> >

> > Perhaps I've missed it in some of the previous discussions:

> > have we have considered/discussed this signal-vs-policy aspect ?


What's your opinion on the above instead ?

-- 
#include <best/regards.h>

Patrick Bellasi

Vincent Guittot Nov. 28, 2018, 3:42 p.m. UTC | #9

On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 28-Nov 15:55, Vincent Guittot wrote:

> > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > >

> > > On 28-Nov 14:33, Vincent Guittot wrote:

> > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > >

> > > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > > >

> > > > > > > Is there anything else that I should do for these patches ?

> > > > > >

> > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > > >

> > > > > I guess the problem is that, once we cross the current capacity,

> > > > > strictly speaking util_avg does not represent anymore a utilization.

> > > > >

> > > > > With the new signal this could happen and we end up storing estimated

> > > > > utilization samples which will overestimate the task requirements.

> > > > >

> > > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > > case we collect multiple samples above the current capacity.

> > > >

> > > > TBH I don't see how it's different from current implementation with a

> > > > task that was scheduled on big core and now wakes up on little core.

> > > > The util_est is overestimated as well.

> > >

> > > While running below the capacity of a CPU, either big or LITTLE, we

> > > can still measure the actual used bandwidth as long as we have idle

> > > time. If the task is then moved into a lower capacity core, I think

> > > it's still safe to assume that, likely, it would need more capacity.

> > >

> > > Why do you say it's the same ?

> >

> > In the example of a task that runs 39ms in period of 80ms that we used

> > during previous version,

> > the utilization on the big core will reach 709 so will util_est too

> > When the task migrates on little core (512), util_est is higher than

> > current cpu capacity

>

> Right, and what's the problem ?


you worry about an util_est being higher than capacity which is the case there

>

> 1) We know that PELT is calibrated to 32ms period task and in your

>    example, since the runtime is higher then the half-life, it's

>    correct to estimate a utilization higher then 50%.

>

>    PELT utilization is defined _based on the half-life_: thus

>    your task having a 50% duty cycle does not mean we are not correct

>    if report a utilization != 50%.

>    It would be as broken as reporting 10% utilization for a task

>    running 100ms every 1s.

>

> 2) If it was a 70% task on a previous activation, once it's moved into

>    a lower capacity CPU it's still correct to assume that it's likely

>    going to require the same bandwidth and thus will be

>    under-provisioned.

>

> I still don't see where we are wrong in this case :/

>

> To me it looks different then the problem I described.

>

> > > With your new signal instead, once we cross the current capacity,

> > > utilization is just not anymore utilization. Thus, IMHO it make sense

> > > avoid to accumulate a sample for what we call "estimated utilization".


This is not true. With the example above, the util_est will be exactly the same
 on big and little cores with the new signal

> > >

> > > I would also say that, with the current implementation which caps

> > > utilization to the current capacity, we get better estimation in

> > > general. At least we can say with absolute precision:

> > >

> > >    "the task needs _at least_ that amount of capacity".

> > >

> > > Potentially we can also flag the task as being under-provisioned, in

> > > case there was not idle time, and _let a policy_ decide what to do

> > > with it and the granted information we have.

> > >

> > > While, with your new signal, once we are over the current capacity,

> > > the "utilization" is just a sort of "random" number at best useful to

> > > drive some conclusions about how long the task has been delayed.


see my comment above

> > >

> > > IOW, I fear that we are embedding a policy within a signal which is

> > > currently representing something very well defined: how much cpu

> > > bandwidth a task used. While, latency/under-provisioning policies

> > > perhaps should be better placed somewhere else.

> > >

> > > Perhaps I've missed it in some of the previous discussions:

> > > have we have considered/discussed this signal-vs-policy aspect ?

>

> What's your opinion on the above instead ?


It's not a policy but it gives better knowledge about the amount a work done
I have put below discussion on the  subject on previous version

> >

> > With contribution scaling the PELT utilization of a task is a _minimum_

> > utilization. Regardless of where the task is currently/was running (and

> > provided that it doesn't change behaviour) its PELT utilization will

> > approximate its _minimum_ utilization on an idle 1024 capacity CPU.

>

> The main drawback is that the _minimum_ utilization depends on the CPU

> capacity on which the task runs. The two 25% tasks on a 256 capacity

> CPU will have an utilization of 128 as an example

>

> >

> > With time scaling the PELT utilization doesn't really have a meaning on

> > its own. It has to be compared to the capacity of the CPU where it

> > is/was running to know what the its current PELT utilization means. When

>

> I would have said the opposite. The utilization of the task will

> always reflect the same amount of work that has been already done

> whatever the CPU capacity.

> In fact, the new scaling mechanism uses the real amount of work that

> has been already done to compute the utilization signal which is not

> the case currently. This gives more information about the real amount

> of worked that has been computed in the over utilization case.

>

> > the utilization over-shoots the capacity its value is no longer

> > represents utilization, it just means that it has a higher compute

> > demand than is offered on its current CPU and a high value means that it

> > has been suffering longer. It can't be used to predict the actual

> > utilization on an idle 1024 capacity any better than contribution scaled

> > PELT utilization.

>

> I think that it provides earlier detection of over utilization and

> more accurate signal for a longer time duration which can help the

> load balance

> Coming back to 50% task example . I will use a 50ms running time

> during a 100ms period for the example below to make it easier

>

> Starting from 0, the evolution of the utilization is:

>

> With contribution scaling:

>          time  0ms  50ms  100ms  150ms  200ms

> capacity

> 1024           0    666

> 512            0    333   453

> When the CPU start to be over utilized (@100ms), the utilization is

> already too low (453 instead of 666) and scheduler doesn't detect yet

> that we are over utilized

> 256            0    169   226    246    252

> That's even worse with this lower capacity

>

> With time scaling,

>          time  0ms  50ms  100ms  150ms  200ms

> capacity

> 1024           0    666

> 512            0    428   677

> We know that the current capacity is not enough and the utilization

> reflect the correct utilization level compare to 1024 capacity (the

> 666 vs 677 difference comes from the 1024us window so the last window

> is not full in the case of max capacity)

> 256            0    234   468    564    677

> At 100ms, we know that there is not enough capacity. (In fact we know

> that at 56ms). And even at time 200ms, the amount of work is exactly

> what would have been executed on a CPU 4x faster

>

> >

> > This change might not be a showstopper, but it is something to be aware

> > off and take into account wherever PELT utilization is used.

>

> The point above is clearly a big difference between the 2 approaches

> of the no spare cycle case but I think it will help by giving more

> information in the over utilization case

>

> Vincent

> >

> > Morten


>

> --

> #include <best/regards.h>

>

> Patrick Bellasi

Patrick Bellasi Nov. 28, 2018, 4:35 p.m. UTC | #10

On 28-Nov 16:42, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> >

> > On 28-Nov 15:55, Vincent Guittot wrote:

> > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > >

> > > > On 28-Nov 14:33, Vincent Guittot wrote:

> > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > >

> > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > > > >

> > > > > > > > Is there anything else that I should do for these patches ?

> > > > > > >

> > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > > > >

> > > > > > I guess the problem is that, once we cross the current capacity,

> > > > > > strictly speaking util_avg does not represent anymore a utilization.

> > > > > >

> > > > > > With the new signal this could happen and we end up storing estimated

> > > > > > utilization samples which will overestimate the task requirements.

> > > > > >

> > > > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > > > case we collect multiple samples above the current capacity.

> > > > >

> > > > > TBH I don't see how it's different from current implementation with a

> > > > > task that was scheduled on big core and now wakes up on little core.

> > > > > The util_est is overestimated as well.

> > > >

> > > > While running below the capacity of a CPU, either big or LITTLE, we

> > > > can still measure the actual used bandwidth as long as we have idle

> > > > time. If the task is then moved into a lower capacity core, I think

> > > > it's still safe to assume that, likely, it would need more capacity.

> > > >

> > > > Why do you say it's the same ?

> > >

> > > In the example of a task that runs 39ms in period of 80ms that we used

> > > during previous version,

> > > the utilization on the big core will reach 709 so will util_est too

> > > When the task migrates on little core (512), util_est is higher than

> > > current cpu capacity

> >

> > Right, and what's the problem ?

> 

> you worry about an util_est being higher than capacity which is the case there


I worry about util_est being higher then the capacity the task WAS
running... not the capacity the task IS running... if that value does
not correspond to what the task really need... (more on that at the
end).

> > 1) We know that PELT is calibrated to 32ms period task and in your

> >    example, since the runtime is higher then the half-life, it's

> >    correct to estimate a utilization higher then 50%.

> >

> >    PELT utilization is defined _based on the half-life_: thus

> >    your task having a 50% duty cycle does not mean we are not correct

> >    if report a utilization != 50%.

> >    It would be as broken as reporting 10% utilization for a task

> >    running 100ms every 1s.

> >

> > 2) If it was a 70% task on a previous activation, once it's moved into

> >    a lower capacity CPU it's still correct to assume that it's likely

> >    going to require the same bandwidth and thus will be

> >    under-provisioned.

> >

> > I still don't see where we are wrong in this case :/

> >

> > To me it looks different then the problem I described.

> >

> > > > With your new signal instead, once we cross the current capacity,

> > > > utilization is just not anymore utilization. Thus, IMHO it make sense

> > > > avoid to accumulate a sample for what we call "estimated utilization".

> 

> This is not true. With the example above, the util_est will be exactly the same

>  on big and little cores with the new signal


... AFAIU only if we have idle time...

> > > > I would also say that, with the current implementation which caps

> > > > utilization to the current capacity, we get better estimation in

> > > > general. At least we can say with absolute precision:

> > > >

> > > >    "the task needs _at least_ that amount of capacity".

> > > >

> > > > Potentially we can also flag the task as being under-provisioned, in

> > > > case there was not idle time, and _let a policy_ decide what to do

> > > > with it and the granted information we have.

> > > >

> > > > While, with your new signal, once we are over the current capacity,

> > > > the "utilization" is just a sort of "random" number at best useful to

> > > > drive some conclusions about how long the task has been delayed.

> 

> see my comment above

> 

> > > >

> > > > IOW, I fear that we are embedding a policy within a signal which is

> > > > currently representing something very well defined: how much cpu

> > > > bandwidth a task used. While, latency/under-provisioning policies

> > > > perhaps should be better placed somewhere else.

> > > >

> > > > Perhaps I've missed it in some of the previous discussions:

> > > > have we have considered/discussed this signal-vs-policy aspect ?

> >

> > What's your opinion on the above instead ?

> 

> It's not a policy but it gives better knowledge about the amount a work done

> I have put below discussion on the  subject on previous version


Thanks, I think I've skimmed through it, but it's sill useful...

> > > With contribution scaling the PELT utilization of a task is a _minimum_

> > > utilization. Regardless of where the task is currently/was running (and

> > > provided that it doesn't change behaviour) its PELT utilization will

> > > approximate its _minimum_ utilization on an idle 1024 capacity CPU.

> >

> > The main drawback is that the _minimum_ utilization depends on the CPU

> > capacity on which the task runs. The two 25% tasks on a 256 capacity

> > CPU will have an utilization of 128 as an example

> >

> > >

> > > With time scaling the PELT utilization doesn't really have a meaning on

> > > its own. It has to be compared to the capacity of the CPU where it

> > > is/was running to know what the its current PELT utilization means. When

> >

> > I would have said the opposite. The utilization of the task will

> > always reflect the same amount of work that has been already done

> > whatever the CPU capacity.

> > In fact, the new scaling mechanism uses the real amount of work that

> > has been already done to compute the utilization signal which is not

> > the case currently. This gives more information about the real amount

> > of worked that has been computed in the over utilization case.

> >

> > > the utilization over-shoots the capacity its value is no longer

> > > represents utilization, it just means that it has a higher compute

> > > demand than is offered on its current CPU and a high value means that it

> > > has been suffering longer. It can't be used to predict the actual

> > > utilization on an idle 1024 capacity any better than contribution scaled

> > > PELT utilization.

> >

> > I think that it provides earlier detection of over utilization and

> > more accurate signal for a longer time duration which can help the

> > load balance

> > Coming back to 50% task example . I will use a 50ms running time

> > during a 100ms period for the example below to make it easier

> >

> > Starting from 0, the evolution of the utilization is:

> >

> > With contribution scaling:

> >          time  0ms  50ms  100ms  150ms  200ms

> > capacity

> > 1024           0    666

> > 512            0    333   453

> > When the CPU start to be over utilized (@100ms), the utilization is

> > already too low (453 instead of 666) and scheduler doesn't detect yet

> > that we are over utilized

> > 256            0    169   226    246    252

> > That's even worse with this lower capacity

> >

> > With time scaling,

> >          time  0ms  50ms  100ms  150ms  200ms

> > capacity

> > 1024           0    666

> > 512            0    428   677

> > We know that the current capacity is not enough and the utilization

> > reflect the correct utilization level compare to 1024 capacity (the

> > 666 vs 677 difference comes from the 1024us window so the last window

> > is not full in the case of max capacity)

> > 256            0    234   468    564    677

> > At 100ms, we know that there is not enough capacity. (In fact we know

> > that at 56ms). And even at time 200ms, the amount of work is exactly

> > what would have been executed on a CPU 4x faster

> >

> > >

> > > This change might not be a showstopper, but it is something to be aware

> > > off and take into account wherever PELT utilization is used.

> >

> > The point above is clearly a big difference between the 2 approaches

> > of the no spare cycle case but I think it will help by giving more

> > information in the over utilization case


I like the idea that we ramp up faster and always get to the same
value. I like also the idea that we always reach the same value on
both LITTLE and big.

As long as there is idle time this is working fine, in these cases we
should probably also collect util_est samples.

But what happens when we don't have idle time ?

Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity.

Are not these two tasks being reported as 50% tasks (after a while) ?

If that's the case, these are samples we should not store...

-- 
#include <best/regards.h>

Patrick Bellasi

Vincent Guittot Nov. 29, 2018, 10:43 a.m. UTC | #11

On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 28-Nov 16:42, Vincent Guittot wrote:

> > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > >

> > > On 28-Nov 15:55, Vincent Guittot wrote:

> > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > >

> > > > > On 28-Nov 14:33, Vincent Guittot wrote:

> > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > > >

> > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > > > > >

> > > > > > > > > Is there anything else that I should do for these patches ?

> > > > > > > >

> > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > > > > >

> > > > > > > I guess the problem is that, once we cross the current capacity,

> > > > > > > strictly speaking util_avg does not represent anymore a utilization.

> > > > > > >

> > > > > > > With the new signal this could happen and we end up storing estimated

> > > > > > > utilization samples which will overestimate the task requirements.

> > > > > > >

> > > > > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > > > > case we collect multiple samples above the current capacity.

> > > > > >

> > > > > > TBH I don't see how it's different from current implementation with a

> > > > > > task that was scheduled on big core and now wakes up on little core.

> > > > > > The util_est is overestimated as well.

> > > > >

> > > > > While running below the capacity of a CPU, either big or LITTLE, we

> > > > > can still measure the actual used bandwidth as long as we have idle

> > > > > time. If the task is then moved into a lower capacity core, I think

> > > > > it's still safe to assume that, likely, it would need more capacity.

> > > > >

> > > > > Why do you say it's the same ?

> > > >

> > > > In the example of a task that runs 39ms in period of 80ms that we used

> > > > during previous version,

> > > > the utilization on the big core will reach 709 so will util_est too

> > > > When the task migrates on little core (512), util_est is higher than

> > > > current cpu capacity

> > >

> > > Right, and what's the problem ?

> >

> > you worry about an util_est being higher than capacity which is the case there

>

> I worry about util_est being higher then the capacity the task WAS

> running... not the capacity the task IS running... if that value does

> not correspond to what the task really need... (more on that at the

> end).

>

> > > 1) We know that PELT is calibrated to 32ms period task and in your

> > >    example, since the runtime is higher then the half-life, it's

> > >    correct to estimate a utilization higher then 50%.

> > >

> > >    PELT utilization is defined _based on the half-life_: thus

> > >    your task having a 50% duty cycle does not mean we are not correct

> > >    if report a utilization != 50%.

> > >    It would be as broken as reporting 10% utilization for a task

> > >    running 100ms every 1s.

> > >

> > > 2) If it was a 70% task on a previous activation, once it's moved into

> > >    a lower capacity CPU it's still correct to assume that it's likely

> > >    going to require the same bandwidth and thus will be

> > >    under-provisioned.

> > >

> > > I still don't see where we are wrong in this case :/

> > >

> > > To me it looks different then the problem I described.

> > >

> > > > > With your new signal instead, once we cross the current capacity,

> > > > > utilization is just not anymore utilization. Thus, IMHO it make sense

> > > > > avoid to accumulate a sample for what we call "estimated utilization".

> >

> > This is not true. With the example above, the util_est will be exactly the same

> >  on big and little cores with the new signal

>

> ... AFAIU only if we have idle time...

>

> > > > > I would also say that, with the current implementation which caps

> > > > > utilization to the current capacity, we get better estimation in

> > > > > general. At least we can say with absolute precision:

> > > > >

> > > > >    "the task needs _at least_ that amount of capacity".

> > > > >

> > > > > Potentially we can also flag the task as being under-provisioned, in

> > > > > case there was not idle time, and _let a policy_ decide what to do

> > > > > with it and the granted information we have.

> > > > >

> > > > > While, with your new signal, once we are over the current capacity,

> > > > > the "utilization" is just a sort of "random" number at best useful to

> > > > > drive some conclusions about how long the task has been delayed.

> >

> > see my comment above

> >

> > > > >

> > > > > IOW, I fear that we are embedding a policy within a signal which is

> > > > > currently representing something very well defined: how much cpu

> > > > > bandwidth a task used. While, latency/under-provisioning policies

> > > > > perhaps should be better placed somewhere else.

> > > > >

> > > > > Perhaps I've missed it in some of the previous discussions:

> > > > > have we have considered/discussed this signal-vs-policy aspect ?

> > >

> > > What's your opinion on the above instead ?

> >

> > It's not a policy but it gives better knowledge about the amount a work done

> > I have put below discussion on the  subject on previous version

>

> Thanks, I think I've skimmed through it, but it's sill useful...

>

> > > > With contribution scaling the PELT utilization of a task is a _minimum_

> > > > utilization. Regardless of where the task is currently/was running (and

> > > > provided that it doesn't change behaviour) its PELT utilization will

> > > > approximate its _minimum_ utilization on an idle 1024 capacity CPU.

> > >

> > > The main drawback is that the _minimum_ utilization depends on the CPU

> > > capacity on which the task runs. The two 25% tasks on a 256 capacity

> > > CPU will have an utilization of 128 as an example

> > >

> > > >

> > > > With time scaling the PELT utilization doesn't really have a meaning on

> > > > its own. It has to be compared to the capacity of the CPU where it

> > > > is/was running to know what the its current PELT utilization means. When

> > >

> > > I would have said the opposite. The utilization of the task will

> > > always reflect the same amount of work that has been already done

> > > whatever the CPU capacity.

> > > In fact, the new scaling mechanism uses the real amount of work that

> > > has been already done to compute the utilization signal which is not

> > > the case currently. This gives more information about the real amount

> > > of worked that has been computed in the over utilization case.

> > >

> > > > the utilization over-shoots the capacity its value is no longer

> > > > represents utilization, it just means that it has a higher compute

> > > > demand than is offered on its current CPU and a high value means that it

> > > > has been suffering longer. It can't be used to predict the actual

> > > > utilization on an idle 1024 capacity any better than contribution scaled

> > > > PELT utilization.

> > >

> > > I think that it provides earlier detection of over utilization and

> > > more accurate signal for a longer time duration which can help the

> > > load balance

> > > Coming back to 50% task example . I will use a 50ms running time

> > > during a 100ms period for the example below to make it easier

> > >

> > > Starting from 0, the evolution of the utilization is:

> > >

> > > With contribution scaling:

> > >          time  0ms  50ms  100ms  150ms  200ms

> > > capacity

> > > 1024           0    666

> > > 512            0    333   453

> > > When the CPU start to be over utilized (@100ms), the utilization is

> > > already too low (453 instead of 666) and scheduler doesn't detect yet

> > > that we are over utilized

> > > 256            0    169   226    246    252

> > > That's even worse with this lower capacity

> > >

> > > With time scaling,

> > >          time  0ms  50ms  100ms  150ms  200ms

> > > capacity

> > > 1024           0    666

> > > 512            0    428   677

> > > We know that the current capacity is not enough and the utilization

> > > reflect the correct utilization level compare to 1024 capacity (the

> > > 666 vs 677 difference comes from the 1024us window so the last window

> > > is not full in the case of max capacity)

> > > 256            0    234   468    564    677

> > > At 100ms, we know that there is not enough capacity. (In fact we know

> > > that at 56ms). And even at time 200ms, the amount of work is exactly

> > > what would have been executed on a CPU 4x faster

> > >

> > > >

> > > > This change might not be a showstopper, but it is something to be aware

> > > > off and take into account wherever PELT utilization is used.

> > >

> > > The point above is clearly a big difference between the 2 approaches

> > > of the no spare cycle case but I think it will help by giving more

> > > information in the over utilization case

>

> I like the idea that we ramp up faster and always get to the same

> value. I like also the idea that we always reach the same value on

> both LITTLE and big.

>

> As long as there is idle time this is working fine, in these cases we

> should probably also collect util_est samples.

>

> But what happens when we don't have idle time ?


As shown above, the utilization stays correct for a longer time frame
even after the over utilization point and provides better over
utilization detection

>

> Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity.

>

> Are not these two tasks being reported as 50% tasks (after a while) ?


Yes they will but similarly to above they will stay correct for longer
time even when they become higher than current cpu capacity


>

> If that's the case, these are samples we should not store...

>

> --

> #include <best/regards.h>

>

> Patrick Bellasi

Peter Zijlstra Nov. 29, 2018, 12:53 p.m. UTC | #12

On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index ac855b2f4774..93e0cf5d8a76 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

>  	if (!task_sleep)

>  		return;

>  

> +	/* Skip samples which do not represent an actual utilization */

> +	if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> +		return;

> +

>  	/*

>  	 * If the PELT values haven't changed since enqueue time,

>  	 * skip the util_est update.

Would you not want something like:

	min(task_util(p), capacity_of(task_cpu(p)))

And is this the only place where we need this?

OTOH, if the task is always running, it will be always running
irrespective of where it runs.

Not storing these samples seems weird though; this is the exact
condition you want to record -- the task is very active, if we skip
these, we'll come back at a low frequency on the next wakeup.

Patrick Bellasi Nov. 29, 2018, 3 p.m. UTC | #13

On 29-Nov 11:43, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > On 28-Nov 16:42, Vincent Guittot wrote:

> > > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > On 28-Nov 15:55, Vincent Guittot wrote:

> > > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > > On 28-Nov 14:33, Vincent Guittot wrote:

> > > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > > > > > >

> > > > > > > > > > Is there anything else that I should do for these patches ?

> > > > > > > > >

> > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > > > > > >

> > > > > > > > I guess the problem is that, once we cross the current capacity,

> > > > > > > > strictly speaking util_avg does not represent anymore a utilization.

> > > > > > > >

> > > > > > > > With the new signal this could happen and we end up storing estimated

> > > > > > > > utilization samples which will overestimate the task requirements.

> > > > > > > >

> > > > > > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > > > > > case we collect multiple samples above the current capacity.

> > > > > > >

> > > > > > > TBH I don't see how it's different from current implementation with a

> > > > > > > task that was scheduled on big core and now wakes up on little core.

> > > > > > > The util_est is overestimated as well.

> > > > > >

> > > > > > While running below the capacity of a CPU, either big or LITTLE, we

> > > > > > can still measure the actual used bandwidth as long as we have idle

> > > > > > time. If the task is then moved into a lower capacity core, I think

> > > > > > it's still safe to assume that, likely, it would need more capacity.

> > > > > >

> > > > > > Why do you say it's the same ?

> > > > >

> > > > > In the example of a task that runs 39ms in period of 80ms that we used

> > > > > during previous version,

> > > > > the utilization on the big core will reach 709 so will util_est too

> > > > > When the task migrates on little core (512), util_est is higher than

> > > > > current cpu capacity

> > > >

> > > > Right, and what's the problem ?

> > >

> > > you worry about an util_est being higher than capacity which is the case there

> >

> > I worry about util_est being higher then the capacity the task WAS

> > running... not the capacity the task IS running... if that value does

> > not correspond to what the task really need... (more on that at the

> > end).

> >

> > > > 1) We know that PELT is calibrated to 32ms period task and in your

> > > >    example, since the runtime is higher then the half-life, it's

> > > >    correct to estimate a utilization higher then 50%.

> > > >

> > > >    PELT utilization is defined _based on the half-life_: thus

> > > >    your task having a 50% duty cycle does not mean we are not correct

> > > >    if report a utilization != 50%.

> > > >    It would be as broken as reporting 10% utilization for a task

> > > >    running 100ms every 1s.

> > > >

> > > > 2) If it was a 70% task on a previous activation, once it's moved into

> > > >    a lower capacity CPU it's still correct to assume that it's likely

> > > >    going to require the same bandwidth and thus will be

> > > >    under-provisioned.

> > > >

> > > > I still don't see where we are wrong in this case :/

> > > >

> > > > To me it looks different then the problem I described.

> > > >

> > > > > > With your new signal instead, once we cross the current capacity,

> > > > > > utilization is just not anymore utilization. Thus, IMHO it make sense

> > > > > > avoid to accumulate a sample for what we call "estimated utilization".

> > >

> > > This is not true. With the example above, the util_est will be exactly the same

> > >  on big and little cores with the new signal

> >

> > ... AFAIU only if we have idle time...

> >

> > > > > > I would also say that, with the current implementation which caps

> > > > > > utilization to the current capacity, we get better estimation in

> > > > > > general. At least we can say with absolute precision:

> > > > > >

> > > > > >    "the task needs _at least_ that amount of capacity".

> > > > > >

> > > > > > Potentially we can also flag the task as being under-provisioned, in

> > > > > > case there was not idle time, and _let a policy_ decide what to do

> > > > > > with it and the granted information we have.

> > > > > >

> > > > > > While, with your new signal, once we are over the current capacity,

> > > > > > the "utilization" is just a sort of "random" number at best useful to

> > > > > > drive some conclusions about how long the task has been delayed.

> > >

> > > see my comment above

> > >

> > > > > >

> > > > > > IOW, I fear that we are embedding a policy within a signal which is

> > > > > > currently representing something very well defined: how much cpu

> > > > > > bandwidth a task used. While, latency/under-provisioning policies

> > > > > > perhaps should be better placed somewhere else.

> > > > > >

> > > > > > Perhaps I've missed it in some of the previous discussions:

> > > > > > have we have considered/discussed this signal-vs-policy aspect ?

> > > >

> > > > What's your opinion on the above instead ?

> > >

> > > It's not a policy but it gives better knowledge about the amount a work done

> > > I have put below discussion on the  subject on previous version

> >

> > Thanks, I think I've skimmed through it, but it's sill useful...

> >

> > > > > With contribution scaling the PELT utilization of a task is a _minimum_

> > > > > utilization. Regardless of where the task is currently/was running (and

> > > > > provided that it doesn't change behaviour) its PELT utilization will

> > > > > approximate its _minimum_ utilization on an idle 1024 capacity CPU.

> > > >

> > > > The main drawback is that the _minimum_ utilization depends on the CPU

> > > > capacity on which the task runs. The two 25% tasks on a 256 capacity

> > > > CPU will have an utilization of 128 as an example

> > > >

> > > > >

> > > > > With time scaling the PELT utilization doesn't really have a meaning on

> > > > > its own. It has to be compared to the capacity of the CPU where it

> > > > > is/was running to know what the its current PELT utilization means. When

> > > >

> > > > I would have said the opposite. The utilization of the task will

> > > > always reflect the same amount of work that has been already done

> > > > whatever the CPU capacity.

> > > > In fact, the new scaling mechanism uses the real amount of work that

> > > > has been already done to compute the utilization signal which is not

> > > > the case currently. This gives more information about the real amount

> > > > of worked that has been computed in the over utilization case.

> > > >

> > > > > the utilization over-shoots the capacity its value is no longer

> > > > > represents utilization, it just means that it has a higher compute

> > > > > demand than is offered on its current CPU and a high value means that it

> > > > > has been suffering longer. It can't be used to predict the actual

> > > > > utilization on an idle 1024 capacity any better than contribution scaled

> > > > > PELT utilization.

> > > >

> > > > I think that it provides earlier detection of over utilization and

> > > > more accurate signal for a longer time duration which can help the

> > > > load balance

> > > > Coming back to 50% task example . I will use a 50ms running time

> > > > during a 100ms period for the example below to make it easier

> > > >

> > > > Starting from 0, the evolution of the utilization is:

> > > >

> > > > With contribution scaling:

> > > >          time  0ms  50ms  100ms  150ms  200ms

> > > > capacity

> > > > 1024           0    666

> > > > 512            0    333   453

> > > > When the CPU start to be over utilized (@100ms), the utilization is

> > > > already too low (453 instead of 666) and scheduler doesn't detect yet

> > > > that we are over utilized

> > > > 256            0    169   226    246    252

> > > > That's even worse with this lower capacity

> > > >





> > > > With time scaling,

> > > >          time  0ms  50ms  100ms  150ms  200ms

> > > > capacity

> > > > 1024           0    666

> > > > 512            0    428   677

> > > > 256            0    234   468    564    677


[...]

> > I like the idea that we ramp up faster and always get to the same

> > value. I like also the idea that we always reach the same value on

> > both LITTLE and big.

> >

> > As long as there is idle time this is working fine, in these cases we

> > should probably also collect util_est samples.

> >

> > But what happens when we don't have idle time ?

> 

> As shown above, the utilization stays correct for a longer time frame

> even after the over utilization point and provides better over

> utilization detection

> 

> >

> > Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity.

> >

> > Are not these two tasks being reported as 50% tasks (after a while) ?

> 

> Yes they will but similarly to above they will stay correct for longer

> time even when they become higher than current cpu capacity


Seems we agree that, when there is no idle time:
- the two 15% tasks will be overestimated
- their utilization will reach 50% after a while

If I'm not wrong, we will have:
- 30% CPU util in  ~16ms @1024 capacity
                   ~64ms  @256 capacity

Thus, the tasks will be certainly over-estimated after ~64ms.
Is that correct ?

> > If that's the case, these are samples we should not store...


Now, we can argue that 64ms is a pretty long time and thus it's quite
unlucky we will have no idle for such a long time.

Still, I'm wondering if we should keep collecting those samples or
better find a way to detect that and skip the sampling.

-- 
#include <best/regards.h>

Patrick Bellasi

Patrick Bellasi Nov. 29, 2018, 3:13 p.m. UTC | #14

On 29-Nov 13:53, Peter Zijlstra wrote:
> On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote:

> 

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index ac855b2f4774..93e0cf5d8a76 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

> >  	if (!task_sleep)

> >  		return;

> > 

> > +	/* Skip samples which do not represent an actual utilization */

> > +	if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> > +		return;

> > +

> >  	/*

> >  	 * If the PELT values haven't changed since enqueue time,

> >  	 * skip the util_est update.

> 

> Would you not want something like:

> 

> 	min(task_util(p), capacity_of(task_cpu(p)))

> 

> And is this the only place where we need this?


Mmm... even this could be an over-estimation:

I've just posted an example in my last reply to Vincent, end of:

   Message-ID: <20181129150020.GF23094@e110439-lin>
   https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/

> OTOH, if the task is always running, it will be always running

> irrespective of where it runs.


That's not what I'm concerned about. I'm concerned about small tasks
which are running on limited capacity (e.g. due to thermal capping)
without idle time. In this case, the new "utilization" signal could
overestimate the real task needs.

> Not storing these samples seems weird though; this is the exact

> condition you want to record -- the task is very active, if we skip

> these, we'll come back at a low frequency on the next wakeup.


When there is not idle time, we don't know if the reported
utilization, above the cpu capacity, is due to the task being bigger...
or just the new utilization signal converging towards:

    100% / RUNNABLE_TASKS_COUNT

-- 
#include <best/regards.h>

Patrick Bellasi

Vincent Guittot Nov. 29, 2018, 4:19 p.m. UTC | #15

On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 29-Nov 11:43, Vincent Guittot wrote:

> > On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > On 28-Nov 16:42, Vincent Guittot wrote:

> > > > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > On 28-Nov 15:55, Vincent Guittot wrote:

> > > > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > > > On 28-Nov 14:33, Vincent Guittot wrote:

> > > > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > > > > > > >

> > > > > > > > > > > Is there anything else that I should do for these patches ?

> > > > > > > > > >

> > > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > > > > > > >

> > > > > > > > > I guess the problem is that, once we cross the current capacity,

> > > > > > > > > strictly speaking util_avg does not represent anymore a utilization.

> > > > > > > > >

> > > > > > > > > With the new signal this could happen and we end up storing estimated

> > > > > > > > > utilization samples which will overestimate the task requirements.

> > > > > > > > >

> > > > > > > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > > > > > > case we collect multiple samples above the current capacity.

> > > > > > > >

> > > > > > > > TBH I don't see how it's different from current implementation with a

> > > > > > > > task that was scheduled on big core and now wakes up on little core.

> > > > > > > > The util_est is overestimated as well.

> > > > > > >

> > > > > > > While running below the capacity of a CPU, either big or LITTLE, we

> > > > > > > can still measure the actual used bandwidth as long as we have idle

> > > > > > > time. If the task is then moved into a lower capacity core, I think

> > > > > > > it's still safe to assume that, likely, it would need more capacity.

> > > > > > >

> > > > > > > Why do you say it's the same ?

> > > > > >

> > > > > > In the example of a task that runs 39ms in period of 80ms that we used

> > > > > > during previous version,

> > > > > > the utilization on the big core will reach 709 so will util_est too

> > > > > > When the task migrates on little core (512), util_est is higher than

> > > > > > current cpu capacity

> > > > >

> > > > > Right, and what's the problem ?

> > > >

> > > > you worry about an util_est being higher than capacity which is the case there

> > >

> > > I worry about util_est being higher then the capacity the task WAS

> > > running... not the capacity the task IS running... if that value does

> > > not correspond to what the task really need... (more on that at the

> > > end).

> > >

> > > > > 1) We know that PELT is calibrated to 32ms period task and in your

> > > > >    example, since the runtime is higher then the half-life, it's

> > > > >    correct to estimate a utilization higher then 50%.

> > > > >

> > > > >    PELT utilization is defined _based on the half-life_: thus

> > > > >    your task having a 50% duty cycle does not mean we are not correct

> > > > >    if report a utilization != 50%.

> > > > >    It would be as broken as reporting 10% utilization for a task

> > > > >    running 100ms every 1s.

> > > > >

> > > > > 2) If it was a 70% task on a previous activation, once it's moved into

> > > > >    a lower capacity CPU it's still correct to assume that it's likely

> > > > >    going to require the same bandwidth and thus will be

> > > > >    under-provisioned.

> > > > >

> > > > > I still don't see where we are wrong in this case :/

> > > > >

> > > > > To me it looks different then the problem I described.

> > > > >

> > > > > > > With your new signal instead, once we cross the current capacity,

> > > > > > > utilization is just not anymore utilization. Thus, IMHO it make sense

> > > > > > > avoid to accumulate a sample for what we call "estimated utilization".

> > > >

> > > > This is not true. With the example above, the util_est will be exactly the same

> > > >  on big and little cores with the new signal

> > >

> > > ... AFAIU only if we have idle time...

> > >

> > > > > > > I would also say that, with the current implementation which caps

> > > > > > > utilization to the current capacity, we get better estimation in

> > > > > > > general. At least we can say with absolute precision:

> > > > > > >

> > > > > > >    "the task needs _at least_ that amount of capacity".

> > > > > > >

> > > > > > > Potentially we can also flag the task as being under-provisioned, in

> > > > > > > case there was not idle time, and _let a policy_ decide what to do

> > > > > > > with it and the granted information we have.

> > > > > > >

> > > > > > > While, with your new signal, once we are over the current capacity,

> > > > > > > the "utilization" is just a sort of "random" number at best useful to

> > > > > > > drive some conclusions about how long the task has been delayed.

> > > >

> > > > see my comment above

> > > >

> > > > > > >

> > > > > > > IOW, I fear that we are embedding a policy within a signal which is

> > > > > > > currently representing something very well defined: how much cpu

> > > > > > > bandwidth a task used. While, latency/under-provisioning policies

> > > > > > > perhaps should be better placed somewhere else.

> > > > > > >

> > > > > > > Perhaps I've missed it in some of the previous discussions:

> > > > > > > have we have considered/discussed this signal-vs-policy aspect ?

> > > > >

> > > > > What's your opinion on the above instead ?

> > > >

> > > > It's not a policy but it gives better knowledge about the amount a work done

> > > > I have put below discussion on the  subject on previous version

> > >

> > > Thanks, I think I've skimmed through it, but it's sill useful...

> > >

> > > > > > With contribution scaling the PELT utilization of a task is a _minimum_

> > > > > > utilization. Regardless of where the task is currently/was running (and

> > > > > > provided that it doesn't change behaviour) its PELT utilization will

> > > > > > approximate its _minimum_ utilization on an idle 1024 capacity CPU.

> > > > >

> > > > > The main drawback is that the _minimum_ utilization depends on the CPU

> > > > > capacity on which the task runs. The two 25% tasks on a 256 capacity

> > > > > CPU will have an utilization of 128 as an example

> > > > >

> > > > > >

> > > > > > With time scaling the PELT utilization doesn't really have a meaning on

> > > > > > its own. It has to be compared to the capacity of the CPU where it

> > > > > > is/was running to know what the its current PELT utilization means. When

> > > > >

> > > > > I would have said the opposite. The utilization of the task will

> > > > > always reflect the same amount of work that has been already done

> > > > > whatever the CPU capacity.

> > > > > In fact, the new scaling mechanism uses the real amount of work that

> > > > > has been already done to compute the utilization signal which is not

> > > > > the case currently. This gives more information about the real amount

> > > > > of worked that has been computed in the over utilization case.

> > > > >

> > > > > > the utilization over-shoots the capacity its value is no longer

> > > > > > represents utilization, it just means that it has a higher compute

> > > > > > demand than is offered on its current CPU and a high value means that it

> > > > > > has been suffering longer. It can't be used to predict the actual

> > > > > > utilization on an idle 1024 capacity any better than contribution scaled

> > > > > > PELT utilization.

> > > > >

> > > > > I think that it provides earlier detection of over utilization and

> > > > > more accurate signal for a longer time duration which can help the

> > > > > load balance

> > > > > Coming back to 50% task example . I will use a 50ms running time

> > > > > during a 100ms period for the example below to make it easier

> > > > >

> > > > > Starting from 0, the evolution of the utilization is:

> > > > >

> > > > > With contribution scaling:

> > > > >          time  0ms  50ms  100ms  150ms  200ms

> > > > > capacity

> > > > > 1024           0    666

> > > > > 512            0    333   453

> > > > > When the CPU start to be over utilized (@100ms), the utilization is

> > > > > already too low (453 instead of 666) and scheduler doesn't detect yet

> > > > > that we are over utilized

> > > > > 256            0    169   226    246    252

> > > > > That's even worse with this lower capacity

> > > > >

>

>

>

>

> > > > > With time scaling,

> > > > >          time  0ms  50ms  100ms  150ms  200ms

> > > > > capacity

> > > > > 1024           0    666

> > > > > 512            0    428   677

> > > > > 256            0    234   468    564    677

>

> [...]

>

> > > I like the idea that we ramp up faster and always get to the same

> > > value. I like also the idea that we always reach the same value on

> > > both LITTLE and big.

> > >

> > > As long as there is idle time this is working fine, in these cases we

> > > should probably also collect util_est samples.

> > >

> > > But what happens when we don't have idle time ?

> >

> > As shown above, the utilization stays correct for a longer time frame

> > even after the over utilization point and provides better over

> > utilization detection

> >

> > >

> > > Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity.

> > >

> > > Are not these two tasks being reported as 50% tasks (after a while) ?

> >

> > Yes they will but similarly to above they will stay correct for longer

> > time even when they become higher than current cpu capacity

>

> Seems we agree that, when there is no idle time:

> - the two 15% tasks will be overestimated

> - their utilization will reach 50% after a while

>

> If I'm not wrong, we will have:

> - 30% CPU util in  ~16ms @1024 capacity

>                    ~64ms  @256 capacity

>

> Thus, the tasks will be certainly over-estimated after ~64ms.

> Is that correct ?


From a pure util_avg pov it's correct
But i'd like to weight that a bit with the example below

>

> > > If that's the case, these are samples we should not store...

>

> Now, we can argue that 64ms is a pretty long time and thus it's quite

> unlucky we will have no idle for such a long time.

>

> Still, I'm wondering if we should keep collecting those samples or

> better find a way to detect that and skip the sampling.


The problem is that you can have util_avg above capacity even with idle time
In the 1st example of this thread, the 39ms/80ms task will reach 709
which is the value saved by util_est on a big core
But on core with half capacity, there is still idle time so 709 is a
correct value although above 512

In fact, max will be always above the linear ratio because it's based
on geometric series

And this is true even with 15.6ms/32ms (same ratio as above) task
although the impact is smaller (max value, which should be saved by
util est, becomes  587 in this case).






>

> --

> #include <best/regards.h>

>

> Patrick Bellasi

Patrick Bellasi Jan. 10, 2019, 3:30 p.m. UTC | #16

On 29-Nov 17:19, Vincent Guittot wrote:
> On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > On 29-Nov 11:43, Vincent Guittot wrote:


[...]

> > Seems we agree that, when there is no idle time:

> > - the two 15% tasks will be overestimated

> > - their utilization will reach 50% after a while

> >

> > If I'm not wrong, we will have:

> > - 30% CPU util in  ~16ms @1024 capacity

> >                    ~64ms  @256 capacity

> >

> > Thus, the tasks will be certainly over-estimated after ~64ms.

> > Is that correct ?

> 

> From a pure util_avg pov it's correct

> But i'd like to weight that a bit with the example below

> 

> > Now, we can argue that 64ms is a pretty long time and thus it's quite

> > unlucky we will have no idle for such a long time.

> >

> > Still, I'm wondering if we should keep collecting those samples or

> > better find a way to detect that and skip the sampling.

> 

> The problem is that you can have util_avg above capacity even with idle time

> In the 1st example of this thread, the 39ms/80ms task will reach 709

> which is the value saved by util_est on a big core

> But on core with half capacity, there is still idle time so 709 is a

> correct value although above 512


Right, I see your point and (in principle) I like the idea of
collecting samples for tasks which happen to run at a lower capacity
then required and the utilization value makes sense...

> In fact, max will be always above the linear ratio because it's based

> on geometric series

> 

> And this is true even with 15.6ms/32ms (same ratio as above) task

> although the impact is smaller (max value, which should be saved by

> util est, becomes  587 in this case).


However that's not always the case... as per my example above.

Moreover, we should also consider that util_est is mainly meant to be
a lower-bound for tasks utilization.
That's why task_util_est() already returns the actual util_avg when
it's higher than the estimated utilization.

With your new signal and without any special check on samples
collection, if a task is limited because of thermal capping for
example, we could end up overestimating its utilization and thus
perhaps generating an unwanted frequency spike when the capping is
relaxed... and (even worst) it will take some more activations for the
estimated utilization to converge back to the actual utilization.

Since we cannot easily know if there is idle time in a CPU when a task
completes an activation with a utilization higher then the CPU
capacity, I would better prefer to just skip the sampling with
something like:

---8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9332863d122a..485053026533 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3639,6 +3639,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 {
 	long last_ewma_diff;
 	struct util_est ue;
+	int cpu;
 
 	if (!sched_feat(UTIL_EST))
 		return;
@@ -3672,6 +3673,14 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 	if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
 		return;
 
+	/*
+	 * To avoid overestimation of actual task utilization, skip updates if
+	 * we cannot grant there is idle time in this CPU.
+	 */
+	cpu = cpu_of(rq_of(cfs_rq));
+	if (task_util(p) > cpu_capacity(cpu))
+		return;
+
 	/*
 	 * Update Task's estimated utilization
 	 *
---8<---

At least this will ensure that util_est always provides an actual
measured lower bound for a task utilization.

If you think this makes sense, feel free to add such a patch on
top of your series.

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

Vincent Guittot Jan. 11, 2019, 2:29 p.m. UTC | #17

On Thu, 10 Jan 2019 at 16:30, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 29-Nov 17:19, Vincent Guittot wrote:

> > On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > On 29-Nov 11:43, Vincent Guittot wrote:

>

> [...]

>

> > > Seems we agree that, when there is no idle time:

> > > - the two 15% tasks will be overestimated

> > > - their utilization will reach 50% after a while

> > >

> > > If I'm not wrong, we will have:

> > > - 30% CPU util in  ~16ms @1024 capacity

> > >                    ~64ms  @256 capacity

> > >

> > > Thus, the tasks will be certainly over-estimated after ~64ms.

> > > Is that correct ?

> >

> > From a pure util_avg pov it's correct

> > But i'd like to weight that a bit with the example below

> >

> > > Now, we can argue that 64ms is a pretty long time and thus it's quite

> > > unlucky we will have no idle for such a long time.

> > >

> > > Still, I'm wondering if we should keep collecting those samples or

> > > better find a way to detect that and skip the sampling.

> >

> > The problem is that you can have util_avg above capacity even with idle time

> > In the 1st example of this thread, the 39ms/80ms task will reach 709

> > which is the value saved by util_est on a big core

> > But on core with half capacity, there is still idle time so 709 is a

> > correct value although above 512

>

> Right, I see your point and (in principle) I like the idea of

> collecting samples for tasks which happen to run at a lower capacity

> then required and the utilization value makes sense...

>

> > In fact, max will be always above the linear ratio because it's based

> > on geometric series

> >

> > And this is true even with 15.6ms/32ms (same ratio as above) task

> > although the impact is smaller (max value, which should be saved by

> > util est, becomes  587 in this case).

>

> However that's not always the case... as per my example above.

>

> Moreover, we should also consider that util_est is mainly meant to be

> a lower-bound for tasks utilization.

> That's why task_util_est() already returns the actual util_avg when

> it's higher than the estimated utilization.


I can imagine that the fact that we use max(util_avg, util_est) helps
to keep using correct utilization in the scheduler when util_avg goes
above cpu capacity whereas there is still idle time

>

> With your new signal and without any special check on samples

> collection, if a task is limited because of thermal capping for

> example, we could end up overestimating its utilization and thus

> perhaps generating an unwanted frequency spike when the capping is

> relaxed... and (even worst) it will take some more activations for the

> estimated utilization to converge back to the actual utilization.

>

> Since we cannot easily know if there is idle time in a CPU when a task

> completes an activation with a utilization higher then the CPU

> capacity, I would better prefer to just skip the sampling with

> something like:

>

> ---8<---

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 9332863d122a..485053026533 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -3639,6 +3639,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

>  {

>         long last_ewma_diff;

>         struct util_est ue;

> +       int cpu;

>

>         if (!sched_feat(UTIL_EST))

>                 return;

> @@ -3672,6 +3673,14 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

>         if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))

>                 return;

>

> +       /*

> +        * To avoid overestimation of actual task utilization, skip updates if

> +        * we cannot grant there is idle time in this CPU.

> +        */

> +       cpu = cpu_of(rq_of(cfs_rq));

> +       if (task_util(p) > cpu_capacity(cpu))

> +               return;

> +

>         /*

>          * Update Task's estimated utilization

>          *

> ---8<---

>

> At least this will ensure that util_est always provides an actual

> measured lower bound for a task utilization.

>

> If you think this makes sense, feel free to add such a patch on

> top of your series.


ok. I'm going to add it when rebasing the series

Thanks
Vincent
>

> Cheers Patrick

>

> --

> #include <best/regards.h>

>

> Patrick Bellasi

Peter Zijlstra Jan. 24, 2019, 9:07 a.m. UTC | #18

Sorry; trying to get back to this and re-reading the old conversations.

On Thu, Nov 29, 2018 at 03:13:16PM +0000, Patrick Bellasi wrote:
> On 29-Nov 13:53, Peter Zijlstra wrote:

> > On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote:

> > 

> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > > index ac855b2f4774..93e0cf5d8a76 100644

> > > --- a/kernel/sched/fair.c

> > > +++ b/kernel/sched/fair.c

> > > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

> > >  	if (!task_sleep)

> > >  		return;

> > > 

> > > +	/* Skip samples which do not represent an actual utilization */

> > > +	if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> > > +		return;

> > > +

> > >  	/*

> > >  	 * If the PELT values haven't changed since enqueue time,

> > >  	 * skip the util_est update.

> > 

> > Would you not want something like:

> > 

> > 	min(task_util(p), capacity_of(task_cpu(p)))

> > 

> > And is this the only place where we need this?

> 

> Mmm... even this could be an over-estimation:

> 

> I've just posted an example in my last reply to Vincent, end of:

> 

>    Message-ID: <20181129150020.GF23094@e110439-lin>

>    https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/


In particular this bit:

 | Seems we agree that, when there is no idle time:
 | - the two 15% tasks will be overestimated
 | - their utilization will reach 50% after a while

Right?

> > OTOH, if the task is always running, it will be always running

> > irrespective of where it runs.

> 

> That's not what I'm concerned about. I'm concerned about small tasks

> which are running on limited capacity (e.g. due to thermal capping)

> without idle time. In this case, the new "utilization" signal could

> overestimate the real task needs.

> 

> > Not storing these samples seems weird though; this is the exact

> > condition you want to record -- the task is very active, if we skip

> > these, we'll come back at a low frequency on the next wakeup.

> 

> When there is not idle time, we don't know if the reported

> utilization, above the cpu capacity, is due to the task being bigger...

> or just the new utilization signal converging towards:

> 

>     100% / RUNNABLE_TASKS_COUNT


So if I'm not mistaken we then have 3 cases:

 1) runnable == util <= capacity

    no contention, idle

 2) runnable == util > capacity

    no contention, no idle

 3) runnable > util

    contention, no idle

For 1) we can use: 'util'
For 2) we can use: 'capacity'
For 3) we can use: 'util * capacity >> 10'

(note that 2 is a special case of 3 when u=1)

This should work right?

Now, instead of doing complicated things like that, you instead figure
that when there's no idle there's also no dequeue happening and we can
simply short-cut by skipping the entire thing, forgetting everything
about 2,3.

Did I get that right?

Patrick Bellasi Jan. 24, 2019, 2:04 p.m. UTC | #19

On 24-Jan 10:07, Peter Zijlstra wrote:
> 

> Sorry; trying to get back to this and re-reading the old conversations.

> 

> On Thu, Nov 29, 2018 at 03:13:16PM +0000, Patrick Bellasi wrote:

> > On 29-Nov 13:53, Peter Zijlstra wrote:

> > > On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote:

> > > 

> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > > > index ac855b2f4774..93e0cf5d8a76 100644

> > > > --- a/kernel/sched/fair.c

> > > > +++ b/kernel/sched/fair.c

> > > > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

> > > >  	if (!task_sleep)

> > > >  		return;

> > > > 

> > > > +	/* Skip samples which do not represent an actual utilization */

> > > > +	if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> > > > +		return;

> > > > +

> > > >  	/*

> > > >  	 * If the PELT values haven't changed since enqueue time,

> > > >  	 * skip the util_est update.

> > > 

> > > Would you not want something like:

> > > 

> > > 	min(task_util(p), capacity_of(task_cpu(p)))

> > > 

> > > And is this the only place where we need this?

> > 

> > Mmm... even this could be an over-estimation:

> > 

> > I've just posted an example in my last reply to Vincent, end of:

> > 

> >    Message-ID: <20181129150020.GF23094@e110439-lin>

> >    https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/

> 

> In particular this bit:

> 

>  | Seems we agree that, when there is no idle time:

>  | - the two 15% tasks will be overestimated

>  | - their utilization will reach 50% after a while

> 

> Right?

> 

> > > OTOH, if the task is always running, it will be always running

> > > irrespective of where it runs.

> > 

> > That's not what I'm concerned about. I'm concerned about small tasks

> > which are running on limited capacity (e.g. due to thermal capping)

> > without idle time. In this case, the new "utilization" signal could

> > overestimate the real task needs.

> > 

> > > Not storing these samples seems weird though; this is the exact

> > > condition you want to record -- the task is very active, if we skip

> > > these, we'll come back at a low frequency on the next wakeup.

> > 

> > When there is not idle time, we don't know if the reported

> > utilization, above the cpu capacity, is due to the task being bigger...

> > or just the new utilization signal converging towards:

> > 

> >     100% / RUNNABLE_TASKS_COUNT

> 

> So if I'm not mistaken we then have 3 cases:

> 

>  1) runnable == util <= capacity

> 

>     no contention, idle

> 

>  2) runnable == util > capacity

> 

>     no contention, no idle

> 

>  3) runnable > util

> 

>     contention, no idle

> 

> For 1) we can use: 'util'

> For 2) we can use: 'capacity'

> For 3) we can use: 'util * capacity >> 10'

> 

> (note that 2 is a special case of 3 when u=1)

> 

> This should work right?


I think there is a case, similar to 2, in which the new 'util' could
potentially be used. That's the case for example of a 20% (estimated)
utilization task running alone on a 15% capacity CPU, for a single
activation. In that case such a task will complete and be dequeued
with:

   runnable == util > capacity

The problem is that we need to be sure there was not contention... and
that seems to be difficult to detect.

> Now, instead of doing complicated things like that, you instead figure

> that when there's no idle there's also no dequeue happening and we can

> simply short-cut by skipping the entire thing, forgetting everything

> about 2,3.

> 

> Did I get that right?


More or less... just saying that 1 is the only easy to detect scenario
in which we are granted the utilization represents an actual bandwidth
request and thus the only safe values to sample for estimated
utilization. For the other cases, since anyway:

   util_est := max(max(ewma, last_util), util_avg)

util_est will just keep representing a safe and actually measured
lower-bound for the expected utilization of a task, without
side-affecting the EWMA which has a "slow" update dynamic.

-- 
#include <best/regards.h>

Patrick Bellasi

Peter Zijlstra Jan. 29, 2019, 7:42 p.m. UTC | #20

On Thu, Jan 24, 2019 at 02:04:32PM +0000, Patrick Bellasi wrote:

> > So if I'm not mistaken we then have 3 cases:

> > 

> >  1) runnable == util <= capacity

> > 

> >     no contention, idle

> > 

> >  2) runnable == util > capacity

> > 

> >     no contention, no idle

> > 

> >  3) runnable > util

> > 

> >     contention, no idle

> > 

> > For 1) we can use: 'util'

> > For 2) we can use: 'capacity'

> > For 3) we can use: 'util * capacity >> 10'

> > 

> > (note that 2 is a special case of 3 when u=1)

> > 

> > This should work right?

> 

> I think there is a case, similar to 2, in which the new 'util' could

> potentially be used. That's the case for example of a 20% (estimated)

> utilization task running alone on a 15% capacity CPU, for a single

> activation. In that case such a task will complete and be dequeued

> with:

> 

>    runnable == util > capacity

> 

> The problem is that we need to be sure there was not contention... and

> that seems to be difficult to detect.


When there is contention runnable and util should diverge. Given this is
all discrete stuff, there's a few funnies, but who cares about those :-)

> > Now, instead of doing complicated things like that, you instead figure

> > that when there's no idle there's also no dequeue happening and we can

> > simply short-cut by skipping the entire thing, forgetting everything

> > about 2,3.

> > 

> > Did I get that right?

> 

> More or less... just saying that 1 is the only easy to detect scenario

> in which we are granted the utilization represents an actual bandwidth

> request and thus the only safe values to sample for estimated

> utilization. For the other cases, since anyway:

> 

>    util_est := max(max(ewma, last_util), util_avg)

> 

> util_est will just keep representing a safe and actually measured

> lower-bound for the expected utilization of a task, without

> side-affecting the EWMA which has a "slow" update dynamic.


Right, so maybe we should expound the comment here a bit; but otherwise
I'm inclined to merge that v9 Vincent posted.

[v7,2/2] sched/fair: update scale invariance of PELT

Commit Message

Comments

Patch