[v2] sched/fair: update scale invariance of PELT

Message ID	1491815909-13345-1-git-send-email-vincent.guittot@linaro.org
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Vincent Guittot <vincent.guittot@linaro.org> To: peterz@infradead.org, mingo@kernel.org, linux-kernel@vger.kernel.org Cc: dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, yuyang.du@intel.com, pjt@google.com, bsegall@google.com, Vincent Guittot <vincent.guittot@linaro.org> Subject: [PATCH v2] sched/fair: update scale invariance of PELT Date: Mon, 10 Apr 2017 11:18:29 +0200 Message-Id: <1491815909-13345-1-git-send-email-vincent.guittot@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk

Vincent Guittot April 10, 2017, 9:18 a.m. UTC

The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.

The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :

    U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

with U is the max util_avg value = SCHED_CAPACITY_SCALE

At a lower capacity, the range becomes:

    U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

with C reflecting the compute capacity ratio between current capacity and
max capacity.

so C tries to compensate changes in (1-y^r') but it can't be accurate.

Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.

In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to track
the amount of "stolen" idle time and to apply it when task becomes idle.

But once we have reached the maximum utilization value (SCHED_CAPACITY_SCALE),
it means that the task is seen as an always-running task whatever the
capacity of the cpu (even at max compute capacity). In this case, we can
discard the "stolen" idle times which becomes meaningless. In order to
cope with rounding effect of PELT algo we take a margin and consider task
with utilization greater than 1000 (vs 1024 max) as an always-running task.

Then, we can use the same algorithm for both utilization and load and
simplify __update_load_avg now that the load of a task doesn't have to be
capped by CPU uarch.

The responsivness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch.

Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
972 (95%)    138ms         not reachable            276ms
486 (47.5%)  30ms          138ms                     60ms
256 (25%)    13ms           32ms                     26ms

On my hikey (octo ARM platform) with schedutil governor, the time to reach
max OPP when starting from a null utilization, decreases from 223ms with
current scale invariance down to 121ms with the new algorithm. For this
test, i have enable arch_scale_freq for arm64.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---

Update since v1:
- rebase on latest tip/sched/core which includes
  "Optimize __update_sched_avg()" + patch : https://lkml.org/lkml/2017/3/31/308


 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 53 ++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 47 insertions(+), 7 deletions(-)

-- 
2.7.4

Peter Zijlstra April 10, 2017, 5:38 p.m. UTC | #1

Thanks for the rebase.

On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:

Ok, so let me try and paraphrase what this patch does.

So consider a task that runs 16 out of our 32ms window:

   running   idle
  |---------|---------|

You're saying that when we scale running with the frequency, suppose we
were at 50% freq, we'll end up with:

   run  idle
  |----|---------|

Which is obviously a shorter total then before; so what you do is add
back the lost idle time like:

   run  lost idle
  |----|----|---------|

to arrive at the same total time. Which seems to make sense.

Now I have vague memories of Morten having issues with your previous
patches, so I'll wait for him to chime in as well.

On to the implementation:

>  /*

> + * Scale the time to reflect the effective amount of computation done during

> + * this delta time.

> + */

> +static __always_inline u64

> +scale_time(u64 delta, int cpu, struct sched_avg *sa,

> +		unsigned long weight, int running)

> +{

> +	if (running) {

> +		sa->stolen_idle_time += delta;

> +		/*

> +		 * scale the elapsed time to reflect the real amount of

> +		 * computation

> +		 */

> +		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));

> +		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));

> +

> +		/*

> +		 * Track the amount of stolen idle time due to running at

> +		 * lower capacity

> +		 */

> +		sa->stolen_idle_time -= delta;

OK so far so good, this tracks, in stolen_idle_time, the 'lost' bit from
above.

> +	} else if (!weight) {

> +		if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {

But here I'm completely lost. WTF just happened ;-)

Firstly, I think we want a comment on why we care about the !weight
case. Why isn't !running sufficient?

Secondly, what's up with the util_sum < LOAD_AVG_MAX * 1000 thing?

Is that to deal with cpu_capacity?

> +			/*

> +			 * Add the idle time stolen by running at lower compute

> +			 * capacity

> +			 */

> +			delta += sa->stolen_idle_time;

> +		}

> +		sa->stolen_idle_time = 0;

> +	}

> +

> +	return delta;

> +}

Thirdly, I'm thinking this isn't quite right. Imagine a task that's
running across a decay window, then we'll only add back the stolen_idle
time in the next window, even though it should've been in this one,
right?

Vincent Guittot April 11, 2017, 7:52 a.m. UTC | #2

Le Monday 10 Apr 2017 à 19:38:02 (+0200), Peter Zijlstra a écrit :
> 

> Thanks for the rebase.

> 

> On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:

> 

> Ok, so let me try and paraphrase what this patch does.

> 

> So consider a task that runs 16 out of our 32ms window:

> 

>    running   idle

>   |---------|---------|

> 

> 

> You're saying that when we scale running with the frequency, suppose we

> were at 50% freq, we'll end up with:

> 

>    run  idle

>   |----|---------|

> 

> 

> Which is obviously a shorter total then before; so what you do is add

> back the lost idle time like:

> 

>    run  lost idle

>   |----|----|---------|

> 

> 

> to arrive at the same total time. Which seems to make sense.

Yes

> 

> Now I have vague memories of Morten having issues with your previous

> patches, so I'll wait for him to chime in as well.

IIRC, Morten's concerns were about the lost idle time which was not taken into account in previous version.

> 

> 

> On to the implementation:

> 

> >  /*

> > + * Scale the time to reflect the effective amount of computation done during

> > + * this delta time.

> > + */

> > +static __always_inline u64

> > +scale_time(u64 delta, int cpu, struct sched_avg *sa,

> > +		unsigned long weight, int running)

> > +{

> > +	if (running) {

> > +		sa->stolen_idle_time += delta;

> > +		/*

> > +		 * scale the elapsed time to reflect the real amount of

> > +		 * computation

> > +		 */

> > +		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));

> > +		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));

> > +

> > +		/*

> > +		 * Track the amount of stolen idle time due to running at

> > +		 * lower capacity

> > +		 */

> > +		sa->stolen_idle_time -= delta;

> 

> OK so far so good, this tracks, in stolen_idle_time, the 'lost' bit from

> above.

> 

> > +	} else if (!weight) {

> > +		if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {

> 

> But here I'm completely lost. WTF just happened ;-)

> 

> Firstly, I think we want a comment on why we care about the !weight

> case. Why isn't !running sufficient?

We track the time when the task is "really" idle but not the time that the task spent
to wait for running on the CPU. Running is used to detect when the task is really
running and how much idle time has been lost while weight is used to detect when the
task is back to sleep state and when we have account the lost idle time.

>

> 

> Secondly, what's up with the util_sum < LOAD_AVG_MAX * 1000 thing?

The lost idle time makes sense only if the task can also be "idle" when running at max
capacity. When util_sum reaches the LOAD_AVG_MAX*SCHED_CAPACITY_SCALE value, all tasks
are considered to be the same as we can't make any difference between a task running
400ms or a task running 400sec. It means that these tasks are "always running" tasks
even at max capacity. In this case, there is no lost idle time as they always run and
tracking and adding back the lost idle time because we run at lower capacity doesn't
make sense anymore so we discard it. Then an always running task can have a util_sum
that is less than the max value because of the rounding (util_avg varies between
[1006..1023]), so I use LOAD_AVG_MAX*1000 instead of LOAD_AVG_MAX*1024

> 

> Is that to deal with cpu_capacity?

> 

> 

> > +			/*

> > +			 * Add the idle time stolen by running at lower compute

> > +			 * capacity

> > +			 */

> > +			delta += sa->stolen_idle_time;

> > +		}

> > +		sa->stolen_idle_time = 0;

> > +	}

> > +

> > +	return delta;

> > +}

> 

> 

> Thirdly, I'm thinking this isn't quite right. Imagine a task that's

> running across a decay window, then we'll only add back the stolen_idle

> time in the next window, even though it should've been in this one,

> right?

I don't think so because the PELT should not see more decay window at half capacity
than at max capacity.
In the example below we can see that we cross the absolute time decay window when
running at half capacity but once we scale the running delta time we don't cross
it anymore and the update of PELT is done in the same manner in both case

decay window |-------|-------|-------|--
max capacity ---xxxx------------xxxx----
update         *   *           *   *

half capacity---xxxxxxxx--------xxxxxxxx
accounted    ---xxxx____--------xxxx____
update         *   *           *   *

x running
- sleep
_ lost idle time

Peter Zijlstra April 11, 2017, 8:53 a.m. UTC | #3

On Tue, Apr 11, 2017 at 09:52:21AM +0200, Vincent Guittot wrote:
> Le Monday 10 Apr 2017 à 19:38:02 (+0200), Peter Zijlstra a écrit :

> > 

> > Thanks for the rebase.

> > 

> > On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:

> > 

> > Ok, so let me try and paraphrase what this patch does.

> > 

> > So consider a task that runs 16 out of our 32ms window:

> > 

> >    running   idle

> >   |---------|---------|

> > 

> > 

> > You're saying that when we scale running with the frequency, suppose we

> > were at 50% freq, we'll end up with:

> > 

> >    run  idle

> >   |----|---------|

> > 

> > 

> > Which is obviously a shorter total then before; so what you do is add

> > back the lost idle time like:

> > 

> >    run  lost idle

> >   |----|----|---------|

> > 

> > 

> > to arrive at the same total time. Which seems to make sense.

> 

> Yes


OK, bear with me.


So we have:


  util_sum' = utilsum * y^p +

                                 p-1
              d1 * y^p + 1024 * \Sum y^n + d3 * y^0
	                         n=1

For the unscaled version, right?

Now for the scaled version, instead of adding a full 'd1,d2,d3' running
segments, we want to add partially running segments, where r=f*d/f_max,
and lost segments l=d-r to fill out the idle time.

But afaict we then end up with (F=f/f_max):


  util_sum' = utilsum * y^p +

                                         p-1
              F * d1 * y^p + F * 1024 * \Sum y^n + F * d3 * y^0
	                                 n=1

And we can collect the common term F:

  util_sum' = utilsum * y^p +

                                      p-1
              F * (d1 * y^p + 1024 * \Sum y^n + d3 * y^0)
	                              n=1


Which is exactly what we already did.

So now I'm confused. Where did I go wrong?


Because by scaling the contribution we get the exact result of doing the
smaller 'running' + 'lost' segments.

Peter Zijlstra April 11, 2017, 9:12 a.m. UTC | #4

On Tue, Apr 11, 2017 at 09:52:21AM +0200, Vincent Guittot wrote:

> > > +	} else if (!weight) {

> > > +		if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {

> > 

> > But here I'm completely lost. WTF just happened ;-)

> > 

> > Firstly, I think we want a comment on why we care about the !weight

> > case. Why isn't !running sufficient?

> 

> We track the time when the task is "really" idle but not the time that

> the task spent to wait for running on the CPU. Running is used to

> detect when the task is really running and how much idle time has been

> lost while weight is used to detect when the task is back to sleep

> state and when we have account the lost idle time.


Huh? You're redefining what 'idle' means wrt. util_sum.

util used to consider anything !running as idle. So this is the main
trickery? I feel that deserves a comment of exceptional clarity.

Vincent Guittot April 11, 2017, 9:40 a.m. UTC | #5

Le Tuesday 11 Apr 2017 à 10:53:05 (+0200), Peter Zijlstra a écrit :
> On Tue, Apr 11, 2017 at 09:52:21AM +0200, Vincent Guittot wrote:

> > Le Monday 10 Apr 2017 à 19:38:02 (+0200), Peter Zijlstra a écrit :

> > > 

> > > Thanks for the rebase.

> > > 

> > > On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:

> > > 

> > > Ok, so let me try and paraphrase what this patch does.

> > > 

> > > So consider a task that runs 16 out of our 32ms window:

> > > 

> > >    running   idle

> > >   |---------|---------|

> > > 

> > > 

> > > You're saying that when we scale running with the frequency, suppose we

> > > were at 50% freq, we'll end up with:

> > > 

> > >    run  idle

> > >   |----|---------|

> > > 

> > > 

> > > Which is obviously a shorter total then before; so what you do is add

> > > back the lost idle time like:

> > > 

> > >    run  lost idle

> > >   |----|----|---------|

> > > 

> > > 

> > > to arrive at the same total time. Which seems to make sense.

> > 

> > Yes

> 

> OK, bear with me.

> 

> 

> So we have:

> 

> 

>   util_sum' = utilsum * y^p +

> 

>                                  p-1

>               d1 * y^p + 1024 * \Sum y^n + d3 * y^0

> 	                         n=1

> 

> For the unscaled version, right?


Yes for the running state.

For sleeping state, it's just util_sum' = utilsum * y^p

>

>

> 

> Now for the scaled version, instead of adding a full 'd1,d2,d3' running

> segments, we want to add partially running segments, where r=f*d/f_max,

> and lost segments l=d-r to fill out the idle time.

> 

> But afaict we then end up with (F=f/f_max):

> 

> 

>   util_sum' = utilsum * y^p +

> 

>                                          p-1

>               F * d1 * y^p + F * 1024 * \Sum y^n + F * d3 * y^0

> 	                                 n=1


you also have a longer running time as it runs slower. We make the assumption that
d1+d2+d3 = (d1'+d2'+d3') * F

If we consider that we cross a decay window, we still have the d1 to complete
the past one but then p'*F= p and d'3 will be the remaining part of the
current window and most probably not equal to d3

so we have with current invariance:

   util_sum' = utilsum * y^(p/F) + 
                                              (p/F - 1)
               F * d1 * y^(p/F) + F * 1024 * \Sum y^n + F * d'3 * y^0
 	                                          n=1

with the new invariance we have

   util_sum' = utilsum * y^(F*p/F) + 
                                        (F*p/F - 1)
               d1 * y^(F*p/F) + 1024 * \Sum y^n + d3 * y^0
 	                                    n=1

For a sleeping time of d at max capacity, we have
a sleeping time d'=d-l, with l the lost time of the previous running time
With the current implementation: 
  util_sum' = utilsum * y^(p')
  util_sum' = utilsum * y^(p-l)

With the new invaraince, we have 
  util_sum' = utilsum * y^(p'+l)
  util_sum' = utilsum * y^(p-l+l)

> 

> And we can collect the common term F:

> 

>   util_sum' = utilsum * y^p +

> 

>                                       p-1

>               F * (d1 * y^p + 1024 * \Sum y^n + d3 * y^0)

> 	                              n=1

> 

> 

> Which is exactly what we already did.


In the new invariance scale, the F is applied on p not on the contribution
value

>

> 

> So now I'm confused. Where did I go wrong?

> 

> 

> Because by scaling the contribution we get the exact result of doing the

> smaller 'running' + 'lost' segments.

Vincent Guittot April 11, 2017, 9:46 a.m. UTC | #6

On 11 April 2017 at 11:12, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Apr 11, 2017 at 09:52:21AM +0200, Vincent Guittot wrote:

>

>> > > + } else if (!weight) {

>> > > +         if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {

>> >

>> > But here I'm completely lost. WTF just happened ;-)

>> >

>> > Firstly, I think we want a comment on why we care about the !weight

>> > case. Why isn't !running sufficient?

>>

>> We track the time when the task is "really" idle but not the time that

>> the task spent to wait for running on the CPU. Running is used to

>> detect when the task is really running and how much idle time has been

>> lost while weight is used to detect when the task is back to sleep

>> state and when we have account the lost idle time.

>

> Huh? You're redefining what 'idle' means wrt. util_sum.

>

> util used to consider anything !running as idle. So this is the main

> trickery? I feel that deserves a comment of exceptional clarity.


So we still decay during runnable but !running state but we don't add
the lost idle time of the previous running step yet.
We wait the task to go back to sleep before applying the lost idle
time in order to not apply the decay of the lost idle time in the
middle of a run phase that has been preempted by RT task as an example

I will try to make a comment that will explain these details

Peter Zijlstra April 11, 2017, 10:41 a.m. UTC | #7

On Tue, Apr 11, 2017 at 11:40:21AM +0200, Vincent Guittot wrote:
> Le Tuesday 11 Apr 2017 à 10:53:05 (+0200), Peter Zijlstra a écrit :

> > On Tue, Apr 11, 2017 at 09:52:21AM +0200, Vincent Guittot wrote:

> > > Le Monday 10 Apr 2017 à 19:38:02 (+0200), Peter Zijlstra a écrit :

> > > > 

> > > > Thanks for the rebase.

> > > > 

> > > > On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:

> > > > 

> > > > Ok, so let me try and paraphrase what this patch does.

> > > > 

> > > > So consider a task that runs 16 out of our 32ms window:

> > > > 

> > > >    running   idle

> > > >   |---------|---------|


(A)

> > > > 

> > > > 

> > > > You're saying that when we scale running with the frequency, suppose we

> > > > were at 50% freq, we'll end up with:

> > > > 

> > > >    run  idle

> > > >   |----|---------|


(B)

> > > > 

> > > > 

> > > > Which is obviously a shorter total then before; so what you do is add

> > > > back the lost idle time like:

> > > > 

> > > >    run  lost idle

> > > >   |----|----|---------|


(C)

> > > > 

> > > > 

> > > > to arrive at the same total time. Which seems to make sense.

> > > 

> > > Yes

> > 

> > OK, bear with me.

> > 

> > 

> > So we have:

> > 

> > 

> >   util_sum' = utilsum * y^p +

> > 

> >                                  p-1

> >               d1 * y^p + 1024 * \Sum y^n + d3 * y^0

> > 	                         n=1

> > 

> > For the unscaled version, right?

> 

> Yes for the running state.

> 

> For sleeping state, it's just util_sum' = utilsum * y^p


Sure, and from this is follows that for idle time we add 0, while we do
decay. Lets call this (1).

> >

> >

> > 

> > Now for the scaled version, instead of adding a full 'd1,d2,d3' running

> > segments, we want to add partially running segments, where r=f*d/f_max,

> > and lost segments l=d-r to fill out the idle time.

> > 

> > But afaict we then end up with (F=f/f_max):

> > 

> > 

> >   util_sum' = utilsum * y^p +

> > 

> >                                          p-1

> >               F * d1 * y^p + F * 1024 * \Sum y^n + F * d3 * y^0

> > 	                                 n=1

> 

> you also have a longer running time as it runs slower. We make the assumption that

> d1+d2+d3 = (d1'+d2'+d3') * F


No, d's stay the same length, r's are the scaled d, and l's the
remainder, or lost idle time.

That is; r + l = d, that way the time scale stays invariant as above (A)
& (C).

So if we run slower, we scale back r and l becomes !0.

> If we consider that we cross a decay window, we still have the d1 to

> complete the past one but then p'*F= p and d'3 will be the remaining

> part of the current window and most probably not equal to d3


So by doing r=Fd we get some (lost) idle time for every bit of runtime,
equally distributed, as if the CPU inserted NOP cycles to lower the
effective frequency.

You want to explicitly place the idle time at the end? That would bias
the sum downwards. To what point?

> so we have with current invariance:

> 

>    util_sum' = utilsum * y^(p/F) + 

>                                               (p/F - 1)

>                F * d1 * y^(p/F) + F * 1024 * \Sum y^n + F * d'3 * y^0

>  	                                          n=1


No, we don't have p/F. p is very much _NOT_ scaled.

Look at accumulate_sum(), we compute p from d, not r.

> > 

> > And we can collect the common term F:

> > 

> >   util_sum' = utilsum * y^p +

> > 

> >                                       p-1

> >               F * (d1 * y^p + 1024 * \Sum y^n + d3 * y^0)

> > 	                              n=1

> > 

> > 

> > Which is exactly what we already did.

> 

> In the new invariance scale, the F is applied on p not on the contribution

> value


That's wrong... That would result in (B) not (C).

Peter Zijlstra April 11, 2017, 10:49 a.m. UTC | #8

Lets go back to the unscaled version:

    running   idle
   |*********|---------|

With the current code, that would effectively end up like (again
assuming 50%):

    running   idle
   |*_*_*_*_*|---------|


Time stays the same, but we add extra idle cycles.

Vincent Guittot April 11, 2017, 12:08 p.m. UTC | #9

Le Tuesday 11 Apr 2017 à 12:41:36 (+0200), Peter Zijlstra a écrit :
> On Tue, Apr 11, 2017 at 11:40:21AM +0200, Vincent Guittot wrote:

> > Le Tuesday 11 Apr 2017 à 10:53:05 (+0200), Peter Zijlstra a écrit :

> > > On Tue, Apr 11, 2017 at 09:52:21AM +0200, Vincent Guittot wrote:

> > > > Le Monday 10 Apr 2017 à 19:38:02 (+0200), Peter Zijlstra a écrit :

> > > > > 

> > > > > Thanks for the rebase.

> > > > > 

> > > > > On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:

> > > > > 

> > > > > Ok, so let me try and paraphrase what this patch does.

> > > > > 

> > > > > So consider a task that runs 16 out of our 32ms window:

> > > > > 

> > > > >    running   idle

> > > > >   |---------|---------|

> 

> (A)

> 

> > > > > 

> > > > > 

> > > > > You're saying that when we scale running with the frequency, suppose we

> > > > > were at 50% freq, we'll end up with:

> > > > > 

> > > > >    run  idle

> > > > >   |----|---------|

> 

> (B)

> 

> > > > > 

> > > > > 

> > > > > Which is obviously a shorter total then before; so what you do is add

> > > > > back the lost idle time like:

> > > > > 

> > > > >    run  lost idle

> > > > >   |----|----|---------|

> 

> (C)

> 

> > > > > 

> > > > > 

> > > > > to arrive at the same total time. Which seems to make sense.

> > > > 

> > > > Yes

> > > 

> > > OK, bear with me.

> > > 

> > > 

> > > So we have:

> > > 

> > > 

> > >   util_sum' = utilsum * y^p +

> > > 

> > >                                  p-1

> > >               d1 * y^p + 1024 * \Sum y^n + d3 * y^0

> > > 	                         n=1

> > > 

> > > For the unscaled version, right?

> > 

> > Yes for the running state.

> > 

> > For sleeping state, it's just util_sum' = utilsum * y^p

> 

> Sure, and from this is follows that for idle time we add 0, while we do

> decay. Lets call this (1).

> 

> > >

> > >

> > > 

> > > Now for the scaled version, instead of adding a full 'd1,d2,d3' running

> > > segments, we want to add partially running segments, where r=f*d/f_max,

> > > and lost segments l=d-r to fill out the idle time.

> > > 

> > > But afaict we then end up with (F=f/f_max):

> > > 

> > > 

> > >   util_sum' = utilsum * y^p +

> > > 

> > >                                          p-1

> > >               F * d1 * y^p + F * 1024 * \Sum y^n + F * d3 * y^0

> > > 	                                 n=1

> > 

> > you also have a longer running time as it runs slower. We make the assumption that

> > d1+d2+d3 = (d1'+d2'+d3') * F

> 

> No, d's stay the same length, r's are the scaled d, and l's the

> remainder, or lost idle time.

> 

> That is; r + l = d, that way the time scale stays invariant as above (A)

> & (C).

> 

> So if we run slower, we scale back r and l becomes !0.

> 

> > If we consider that we cross a decay window, we still have the d1 to

> > complete the past one but then p'*F= p and d'3 will be the remaining

> > part of the current window and most probably not equal to d3

> 

> So by doing r=Fd we get some (lost) idle time for every bit of runtime,

> equally distributed, as if the CPU inserted NOP cycles to lower the

> effective frequency.

> 

> You want to explicitly place the idle time at the end? That would bias

> the sum downwards. To what point?

> 

> > so we have with current invariance:

> > 

> >    util_sum' = utilsum * y^(p/F) + 

> >                                               (p/F - 1)

> >                F * d1 * y^(p/F) + F * 1024 * \Sum y^n + F * d'3 * y^0

> >  	                                          n=1

> 

> No, we don't have p/F. p is very much _NOT_ scaled.


ok. so confusion may come from that we don't have the same meaning of p and I have skipped important intermediate formula.

I'm going to continue on the new fresh thread

> 

> Look at accumulate_sum(), we compute p from d, not r.

> 

> > > 

> > > And we can collect the common term F:

> > > 

> > >   util_sum' = utilsum * y^p +

> > > 

> > >                                       p-1

> > >               F * (d1 * y^p + 1024 * \Sum y^n + d3 * y^0)

> > > 	                              n=1

> > > 

> > > 

> > > Which is exactly what we already did.

> > 

> > In the new invariance scale, the F is applied on p not on the contribution

> > value

> 

> That's wrong... That would result in (B) not (C).

Vincent Guittot April 11, 2017, 1:09 p.m. UTC | #10

Le Tuesday 11 Apr 2017 à 12:49:49 (+0200), Peter Zijlstra a écrit :
> 

> Lets go back to the unscaled version:

> 

>     running   idle

>    |*********|---------|

> 

> With the current code, that would effectively end up like (again

> assuming 50%):

> 

>     running   idle

>    |*_*_*_*_*|---------|

> 

> 

> Time stays the same, but we add extra idle cycles.


In fact it's not really like that because this doesn't reflect the impact of
the decay window which is not linear with time

For a task that run at max freq

(1) |-------|-------|-------|----
       ****            ****
    ---****------------****------

The same task running at half frequency, will run twice longer
    |-------|-------|-------|----
       ********        ********                          
    ---********--------********--

With the current implementation, we are not inserting idle cycle but
dividing the contribution by half in order to compensate the fact that the task
will run twice longer:

(2) |-------|-------|-------|----

    ---********--------********--  

But the final result is neither equal to

    |-------|-------|-------|----
       * * * *         * * * *                
    ---*_*_*_*---------*_*_*_*---

nor to (1) as described below:

Let assume all durations are aligned with decay window. This will simplify the
formula for the explanation and will not change the demonstration. We can come back
on the full equation later on.

For (1), we have the equation that you write previously:

     util_sum' = utilsum * y^p +
                                  p-1
               d1 * y^p + 1024 * \Sum y^n + d3 * y^0
 	                              n=1

Which becomes like below when we are aligned to decay window (d1 and d3 are null)
(A)  util_sum' = utilsum * y^p +
                       p-1
			   1024 * \Sum y^n
 	                   n=1

The running time at max frequency is d and p is the number of decay window period
In this case p = d/1024us and l = 0

For (2), task runs at half frequency so the duration d' is twice longer than
d and p'=2*p. In current implementation, we compensate the increase of running
duration (and lost idle time) by dividing the contribution by 2:
     util_sum' = utilsum * y^p' +
                       p'-1
                512 * \Sum y^n
 	                   n=1

(B)  util_sum' = utilsum * y^(2*p) +
                       2*p-1
                512 * \Sum y^n
                       n=1

With the new scale invariance, we have the following pattern before scaling
time:

    |-------|-------|-------|--
       ********        ********
    ---********--------********

Task still runs at half frequency so the duration d' is twice longer than
d and p': p'=2*p just like for (2).
But before computing util_sum', we change the temporal referencial to reflect
what would have been done at max frequency: we scale the running time (divide it by 2)
in order to have something like:

    |-------|-------|-------|--
       ****            ****
    ---****____--------****____

so we now have a duration d'' that is half d'and a number of period p'' that
is half p' so p'' = 1/2 * p' == p
     util_sum' = utilsum * y^p'' +
                       p''-1
               1024 * \Sum y^n
 	                   n=1
     util_sum' = utilsum * y^p +
                       p-1
                512 * \Sum y^n
 	                   n=1

Then l = p' - p''. The lost idle time is tracked to apply the same amount of decay
window when the task is sleeping

so at the end we have a number of decay window of p''+l = p'' so we still have
the same amount of decay window than previously.


>

Peter Zijlstra April 12, 2017, 11:28 a.m. UTC | #11

On Tue, Apr 11, 2017 at 03:09:20PM +0200, Vincent Guittot wrote:
> Le Tuesday 11 Apr 2017 à 12:49:49 (+0200), Peter Zijlstra a écrit :

> > 

> > Lets go back to the unscaled version:

> > 

> >     running   idle

> >    |*********|---------|

> > 

> > With the current code, that would effectively end up like (again

> > assuming 50%):

> > 

> >     running   idle

> >    |*_*_*_*_*|---------|

> > 

> > 

> > Time stays the same, but we add extra idle cycles.

> 

> In fact it's not really like that because this doesn't reflect the impact of

> the decay window which is not linear with time

> 

> For a task that run at max freq

> 

> (1) |-------|-------|-------|----

>        ****            ****

>     ---****------------****------

> 

> The same task running at half frequency, will run twice longer

>     |-------|-------|-------|----

>        ********        ********                          

>     ---********--------********--

> 

> With the current implementation, we are not inserting idle cycle but

> dividing the contribution by half in order to compensate the fact that the task

> will run twice longer:

> 

> (2) |-------|-------|-------|----

> 

>     ---********--------********--  

> 

> But the final result is neither equal to

> 

>     |-------|-------|-------|----

>        * * * *         * * * *                

>     ---*_*_*_*---------*_*_*_*---

> 

> nor to (1) as described below:


I'm not sure I get what you say; dividing the contribution in half is
exactly equal to the above picture. Because as you can see, there's half
the number of * per window.

> Let assume all durations are aligned with decay window. This will simplify the

> formula for the explanation and will not change the demonstration. We can come back

> on the full equation later on.

> 

> For (1), we have the equation that you write previously:

> 

>      util_sum' = utilsum * y^p +

>                                   p-1

>                d1 * y^p + 1024 * \Sum y^n + d3 * y^0

>  	                              n=1

> 

> Which becomes like below when we are aligned to decay window (d1 and d3 are null)


> (A)  util_sum' = utilsum * y^p +

>                        p-1

> 			   1024 * \Sum y^n

>  	                   n=1

> 

> The running time at max frequency is d and p is the number of decay window period

> In this case p = d/1024us and l = 0

> 

> For (2), task runs at half frequency so the duration d' is twice longer than

> d and p'=2*p. In current implementation, we compensate the increase of running

> duration (and lost idle time) by dividing the contribution by 2:


>      util_sum' = utilsum * y^p' +

>                        p'-1

>                 512 * \Sum y^n

>  	                   n=1

> 

> (B)  util_sum' = utilsum * y^(2*p) +

>                        2*p-1

>                 512 * \Sum y^n

>                        n=1


Hmmm.. but 2p is the actual wall-time of the window period.

> With the new scale invariance, we have the following pattern before scaling

> time:

> 

>     |-------|-------|-------|--

>        ********        ********

>     ---********--------********

> 

> Task still runs at half frequency so the duration d' is twice longer than

> d and p': p'=2*p just like for (2).


> But before computing util_sum', we change the temporal referencial to reflect

> what would have been done at max frequency: we scale the running time (divide it by 2)

> in order to have something like:

> 

>     |-------|-------|-------|--

>        ****            ****

>     ---****____--------****____


So you're moving the window edges away from wall-time.

> so we now have a duration d'' that is half d'and a number of period p'' that

> is half p' so p'' = 1/2 * p' == p


>      util_sum' = utilsum * y^p'' +

>                        p''-1

>                1024 * \Sum y^n

>  	                   n=1

>      util_sum' = utilsum * y^p +

>                        p-1

>                 512 * \Sum y^n

>  	                   n=1



  |---------|---------|          (wall-time)
  ----****------------- F=100%
  ----******----------- F= 66%
  |--------------|----|          (fudge-time)

(explicitly not used 50%, because then the second window would have
collapsed to 0, imagine the joy if you go lower still)

So in fudge-time the first window has 6/15 == 4/10 for the max-freq /
wall-time combo.

> 

> Then l = p' - p''. The lost idle time is tracked to apply the same amount of decay

> window when the task is sleeping

> 

> so at the end we have a number of decay window of p''+l = p'' so we still have

> the same amount of decay window than previously.


Now, we have to stretch time back to equal window size, and while you do
that for the active windows, we have to do manual compensation for idle
windows (which is somewhat ugleh) and is where the lost-time comes from.

Also, this all feels entirely yucky, because as per the above, if we'd
ran at 33%, we'd have ended up with a negative time window.

Not to mention that this only seems to work for low utilization. Once
you hit higher utilization scenarios, where there isn't much idle time
to compensate for the stretching, things go wobbly. Although both
scenarios might end up being the same.

And instead of resurrecting 0 sized windows, you throw them out, which
results in max-util at low F being reported as max-util at F=1, which I
suppose is a useful property and results in the increased ramp-up (which
is a desired property).

So not only do you have non-linear time, you also have non-continuous
time.

I still wonder about the whole !running vs !weight thing.. this all
needs more thinking.

Vincent Guittot April 12, 2017, 2:50 p.m. UTC | #12

Le Wednesday 12 Apr 2017 à 13:28:58 (+0200), Peter Zijlstra a écrit :
> On Tue, Apr 11, 2017 at 03:09:20PM +0200, Vincent Guittot wrote:

> > Le Tuesday 11 Apr 2017 à 12:49:49 (+0200), Peter Zijlstra a écrit :

> > > 

> > > Lets go back to the unscaled version:

> > > 

> > >     running   idle

> > >    |*********|---------|

> > > 

> > > With the current code, that would effectively end up like (again

> > > assuming 50%):

> > > 

> > >     running   idle

> > >    |*_*_*_*_*|---------|

> > > 

> > > 

> > > Time stays the same, but we add extra idle cycles.

> > 

> > In fact it's not really like that because this doesn't reflect the impact of

> > the decay window which is not linear with time

> > 

> > For a task that run at max freq

> > 

> > (1) |-------|-------|-------|----

> >        ****            ****

> >     ---****------------****------

> > 

> > The same task running at half frequency, will run twice longer

> >     |-------|-------|-------|----

> >        ********        ********                          

> >     ---********--------********--

> > 

> > With the current implementation, we are not inserting idle cycle but

> > dividing the contribution by half in order to compensate the fact that the task

> > will run twice longer:

> > 

> > (2) |-------|-------|-------|----

> > 

> >     ---********--------********--  

> > 

> > But the final result is neither equal to

> > 

> >     |-------|-------|-------|----

> >        * * * *         * * * *                

> >     ---*_*_*_*---------*_*_*_*---

> > 

> > nor to (1) as described below:

> 

> I'm not sure I get what you say; dividing the contribution in half is

> exactly equal to the above picture. Because as you can see, there's half

> the number of * per window.


yes you're right, the above is equal to (2) but not to (1)
I mess up the _ which some additional decays

> 

> > Let assume all durations are aligned with decay window. This will simplify the

> > formula for the explanation and will not change the demonstration. We can come back

> > on the full equation later on.

> > 

> > For (1), we have the equation that you write previously:

> > 

> >      util_sum' = utilsum * y^p +

> >                                   p-1

> >                d1 * y^p + 1024 * \Sum y^n + d3 * y^0

> >  	                              n=1

> > 

> > Which becomes like below when we are aligned to decay window (d1 and d3 are null)

> 

> > (A)  util_sum' = utilsum * y^p +

> >                        p-1

> > 			   1024 * \Sum y^n

> >  	                   n=1

> > 

> > The running time at max frequency is d and p is the number of decay window period

> > In this case p = d/1024us and l = 0

> > 

> > For (2), task runs at half frequency so the duration d' is twice longer than

> > d and p'=2*p. In current implementation, we compensate the increase of running

> > duration (and lost idle time) by dividing the contribution by 2:

> 

> >      util_sum' = utilsum * y^p' +

> >                        p'-1

> >                 512 * \Sum y^n

> >  	                   n=1

> > 

> > (B)  util_sum' = utilsum * y^(2*p) +

> >                        2*p-1

> >                 512 * \Sum y^n

> >                        n=1

> 

> Hmmm.. but 2p is the actual wall-time of the window period.

> 

> > With the new scale invariance, we have the following pattern before scaling

> > time:

> > 

> >     |-------|-------|-------|--

> >        ********        ********

> >     ---********--------********

> > 

> > Task still runs at half frequency so the duration d' is twice longer than

> > d and p': p'=2*p just like for (2).

> 

> > But before computing util_sum', we change the temporal referencial to reflect

> > what would have been done at max frequency: we scale the running time (divide it by 2)

> > in order to have something like:

> > 

> >     |-------|-------|-------|--

> >        ****            ****

> >     ---****____--------****____

> 

> So you're moving the window edges away from wall-time.

> 

> > so we now have a duration d'' that is half d'and a number of period p'' that

> > is half p' so p'' = 1/2 * p' == p

> 

> >      util_sum' = utilsum * y^p'' +

> >                        p''-1

> >                1024 * \Sum y^n

> >  	                   n=1

> >      util_sum' = utilsum * y^p +

> >                        p-1

> >                 512 * \Sum y^n

> >  	                   n=1

> 

> 

>   |---------|---------|          (wall-time)

>   ----****------------- F=100%

>   ----******----------- F= 66%

>   |--------------|----|          (fudge-time)


It has been a bit hard for me to catch the diagram above because you scale the
idle time to get same ratio at 100% and 66% wherease I don't scale idle
time but only running time.
Then, we are not reducing the next window but delaying it
so it's something like below

   |---------|---------|          (wall-time)
   ----****------------- F=100%
   ----******----------- F= 66%
   |--------------|---------|        (fudge-time)

> 

> (explicitly not used 50%, because then the second window would have

> collapsed to 0, imagine the joy if you go lower still)


The second window can't collapse because we are working on delta time not
absolute wall-time and the delta is for only 1 type at a time: running or idle

>

> 

> So in fudge-time the first window has 6/15 == 4/10 for the max-freq /

> wall-time combo.

> 

> > 

> > Then l = p' - p''. The lost idle time is tracked to apply the same amount of decay

> > window when the task is sleeping

> > 

> > so at the end we have a number of decay window of p''+l = p'' so we still have

> > the same amount of decay window than previously.

> 

> Now, we have to stretch time back to equal window size, and while you do

> that for the active windows, we have to do manual compensation for idle

> windows (which is somewhat ugleh) and is where the lost-time comes from.


We can't stretch idle time because there is no relation between the idle time
and the current capacity. 

> 

> Also, this all feels entirely yucky, because as per the above, if we'd

> ran at 33%, we'd have ended up with a negative time window.


Not sure to catch how we can end up with negative window. We are working with
delta time not absolute time.

>

> 

> Not to mention that this only seems to work for low utilization. Once

> you hit higher utilization scenarios, where there isn't much idle time

> to compensate for the stretching, things go wobbly. Although both

> scenarios might end up being the same.


During the running phase, we calculate how much idle time has diseappeared
because we are running at lower frequency and we compensate it once back to
idle. 

> 

> And instead of resurrecting 0 sized windows, you throw them out, which


I don't catch point above

> results in max-util at low F being reported as max-util at F=1, which I

> suppose is a useful property and results in the increased ramp-up (which

> is a desired property).

> 

> So not only do you have non-linear time, you also have non-continuous

> time.

> 

> I still wonder about the whole !running vs !weight thing.. this all

> needs more thinking.

>

Peter Zijlstra April 12, 2017, 3:44 p.m. UTC | #13

On Wed, Apr 12, 2017 at 04:50:47PM +0200, Vincent Guittot wrote:
> Le Wednesday 12 Apr 2017 à 13:28:58 (+0200), Peter Zijlstra a écrit :


> > 

> >   |---------|---------|          (wall-time)

> >   ----****------------- F=100%

> >   ----******----------- F= 66%

> >   |--------------|----|          (fudge-time)

> 

> It has been a bit hard for me to catch the diagram above because you scale the

> idle time to get same ratio at 100% and 66% wherease I don't scale idle

> time but only running time.


Ah, so below I wrote that we then scale each window back to equal size,
so the absolute size in wall-time becomes immaterial.

> > (explicitly not used 50%, because then the second window would have

> > collapsed to 0, imagine the joy if you go lower still)

> 

> The second window can't collapse because we are working on delta time not

> absolute wall-time and the delta is for only 1 type at a time: running or idle


Right, but consider what happens when F drops too low, idle goes away
from where there would've been some at F=1. At that point things become
unrecoverable afaict.

> > So in fudge-time the first window has 6/15 == 4/10 for the max-freq /

> > wall-time combo.

> > 

> > > 

> > > Then l = p' - p''. The lost idle time is tracked to apply the same amount of decay

> > > window when the task is sleeping

> > > 

> > > so at the end we have a number of decay window of p''+l = p'' so we still have

> > > the same amount of decay window than previously.

> > 

> > Now, we have to stretch time back to equal window size, and while you do

> > that for the active windows, we have to do manual compensation for idle

> > windows (which is somewhat ugleh) and is where the lost-time comes from.

> 

> We can't stretch idle time because there is no relation between the idle time

> and the current capacity.


Brain melts..

> > Also, this all feels entirely yucky, because as per the above, if we'd

> > ran at 33%, we'd have ended up with a negative time window.

> 

> Not sure to catch how we can end up with negative window. We are working with

> delta time not absolute time.



   |---------|---------|---------|  F=100%
    --****------------------------

   |--------------|----|---------|  F= 66%
    --******----------------------

   |-------------------|---------|  F= 50%
    --********--------------------

   |-----------------------------|  F= 33%
    --************----------------


So what happens is that when the (wall) time for a window goes negative
it simply moves the next window along, until that too is compressed
etc..

So in the above figure, the right most edge of F=33% contains 2 whole
periods of idle time, both contracted to measure 0 (wall) time.

The only thing you have to recover them from is the lost idle time
measure.

> > Not to mention that this only seems to work for low utilization. Once

> > you hit higher utilization scenarios, where there isn't much idle time

> > to compensate for the stretching, things go wobbly. Although both

> > scenarios might end up being the same.

> 

> During the running phase, we calculate how much idle time has diseappeared

> because we are running at lower frequency and we compensate it once back to

> idle. 

> 

> > 

> > And instead of resurrecting 0 sized windows, you throw them out, which

> 

> I don't catch point above


It might've been slightly inaccurate. But the point remains that you
destroy time. Not all accrued lost idle time is recovered.

+               if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {
+                       /*
+                        * Add the idle time stolen by running at lower compute
+                        * capacity
+                        */
+                       delta += sa->stolen_idle_time;
+               }
+               sa->stolen_idle_time = 0;

See here, stolen_idle_time is reset regardless. Time is non-continuous
at that point.


I still have to draw me more interesting cases, I'm not convinced I
fully understand things.

Vincent Guittot April 13, 2017, 9:42 a.m. UTC | #14

On 12 April 2017 at 17:44, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Apr 12, 2017 at 04:50:47PM +0200, Vincent Guittot wrote:

>> Le Wednesday 12 Apr 2017 à 13:28:58 (+0200), Peter Zijlstra a écrit :

>

>> >

>> >   |---------|---------|          (wall-time)

>> >   ----****------------- F=100%

>> >   ----******----------- F= 66%

>> >   |--------------|----|          (fudge-time)

>>

>> It has been a bit hard for me to catch the diagram above because you scale the

>> idle time to get same ratio at 100% and 66% wherease I don't scale idle

>> time but only running time.

>

> Ah, so below I wrote that we then scale each window back to equal size,

> so the absolute size in wall-time becomes immaterial.

>

>> > (explicitly not used 50%, because then the second window would have

>> > collapsed to 0, imagine the joy if you go lower still)

>>

>> The second window can't collapse because we are working on delta time not

>> absolute wall-time and the delta is for only 1 type at a time: running or idle

>

> Right, but consider what happens when F drops too low, idle goes away

> from where there would've been some at F=1. At that point things become

> unrecoverable afaict.


Yes I agree that if the frequency is too low to handle the running
time in the period, there is no more idle time to recover lost idle
time.
In this case, either the frequency will be increase by cpufreq and we
will finally run fast enough to recover the lost idle time
or we stay at this frequency and the task is always running and we
can't do anything else than assuming that this is an always running
task and at this end (once task util_avg reach max value) discard the
lost idle time

>

>> > So in fudge-time the first window has 6/15 == 4/10 for the max-freq /

>> > wall-time combo.

>> >

>> > >

>> > > Then l = p' - p''. The lost idle time is tracked to apply the same amount of decay

>> > > window when the task is sleeping

>> > >

>> > > so at the end we have a number of decay window of p''+l = p'' so we still have

>> > > the same amount of decay window than previously.

>> >

>> > Now, we have to stretch time back to equal window size, and while you do

>> > that for the active windows, we have to do manual compensation for idle

>> > windows (which is somewhat ugleh) and is where the lost-time comes from.

>>

>> We can't stretch idle time because there is no relation between the idle time

>> and the current capacity.

>

> Brain melts..

>

>> > Also, this all feels entirely yucky, because as per the above, if we'd

>> > ran at 33%, we'd have ended up with a negative time window.

>>

>> Not sure to catch how we can end up with negative window. We are working with

>> delta time not absolute time.

>

>

>    |---------|---------|---------|  F=100%

>     --****------------------------

>

>    |--------------|----|---------|  F= 66%

>     --******----------------------

>

>    |-------------------|---------|  F= 50%

>     --********--------------------

>

>    |-----------------------------|  F= 33%

>     --************----------------

>

>

> So what happens is that when the (wall) time for a window goes negative

> it simply moves the next window along, until that too is compressed

> etc..

>

> So in the above figure, the right most edge of F=33% contains 2 whole

> periods of idle time, both contracted to measure 0 (wall) time.

>

> The only thing you have to recover them from is the lost idle time

> measure.

>

>> > Not to mention that this only seems to work for low utilization. Once

>> > you hit higher utilization scenarios, where there isn't much idle time

>> > to compensate for the stretching, things go wobbly. Although both

>> > scenarios might end up being the same.

>>

>> During the running phase, we calculate how much idle time has diseappeared

>> because we are running at lower frequency and we compensate it once back to

>> idle.

>>

>> >

>> > And instead of resurrecting 0 sized windows, you throw them out, which

>>

>> I don't catch point above

>

> It might've been slightly inaccurate. But the point remains that you

> destroy time. Not all accrued lost idle time is recovered.

>

> +               if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {

> +                       /*

> +                        * Add the idle time stolen by running at lower compute

> +                        * capacity

> +                        */

> +                       delta += sa->stolen_idle_time;

> +               }

> +               sa->stolen_idle_time = 0;

>

> See here, stolen_idle_time is reset regardless. Time is non-continuous

> at that point.


In fact, we are provisioning some possible stolen time when running at
lower frequency but it can happen that there is no idle time even if
the task runs at max frequency. In this case, there is no lost idle
time to recover.
I consider that a task that have a utilization at max value (minus a
threshold) is an always running time and in this case there is no lost
idle time to recover

>

>

> I still have to draw me more interesting cases, I'm not convinced I

> fully understand things.

Peter Zijlstra April 13, 2017, 1:32 p.m. UTC | #15

On Wed, Apr 12, 2017 at 01:28:58PM +0200, Peter Zijlstra wrote:

> I still wonder about the whole !running vs !weight thing.,


Ah, since we use this for both util _and_ load, we need !running &&
!weight, and it so happens that !weight implies !running. Is that it?

Peter Zijlstra April 13, 2017, 1:39 p.m. UTC | #16

On Tue, Apr 11, 2017 at 09:52:21AM +0200, Vincent Guittot wrote:

> > Secondly, what's up with the util_sum < LOAD_AVG_MAX * 1000 thing?

> 

> The lost idle time makes sense only if the task can also be "idle"

> when running at max capacity. When util_sum reaches the

> LOAD_AVG_MAX*SCHED_CAPACITY_SCALE value, all tasks are considered to

> be the same as we can't make any difference between a task running

> 400ms or a task running 400sec. It means that these tasks are "always

> running" tasks even at max capacity. In this case, there is no lost

> idle time as they always run and tracking and adding back the lost

> idle time because we run at lower capacity doesn't make sense anymore

> so we discard it.

Right, this is the point we reached yesterday with the too low F. At
that point you cannot know and we assuming u=1, F<1 -> u=1, F=1, which
is a sensible assumption.

> Then an always running task can have a util_sum that is less than the

> max value because of the rounding (util_avg varies between

> [1006..1023]), so I use LOAD_AVG_MAX*1000 instead of LOAD_AVG_MAX*1024

OK, so the reason util_avg varies is because we compute it wrong. And I
think we can easily fix that once we pull out all the factors (which
would mean your patch and the pulling out of weight patch which still
needs to be finished).

But you're comparing against util_sum here, that behaves slightly
different. I think you want 'util_sum >= 1024 * (LOAD_AVG_MAX - 1024)'
instead.

Vincent Guittot April 13, 2017, 2:59 p.m. UTC | #17

On 13 April 2017 at 15:32, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Apr 12, 2017 at 01:28:58PM +0200, Peter Zijlstra wrote:

>

>> I still wonder about the whole !running vs !weight thing.,

>

> Ah, since we use this for both util _and_ load, we need !running &&

> !weight, and it so happens that !weight implies !running. Is that it?


exactly
sorry, I should have started with that

>

Vincent Guittot April 13, 2017, 3:16 p.m. UTC | #18

On 13 April 2017 at 15:39, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Apr 11, 2017 at 09:52:21AM +0200, Vincent Guittot wrote:

>

>> > Secondly, what's up with the util_sum < LOAD_AVG_MAX * 1000 thing?

>>

>> The lost idle time makes sense only if the task can also be "idle"

>> when running at max capacity. When util_sum reaches the

>> LOAD_AVG_MAX*SCHED_CAPACITY_SCALE value, all tasks are considered to

>> be the same as we can't make any difference between a task running

>> 400ms or a task running 400sec. It means that these tasks are "always

>> running" tasks even at max capacity. In this case, there is no lost

>> idle time as they always run and tracking and adding back the lost

>> idle time because we run at lower capacity doesn't make sense anymore

>> so we discard it.

>

> Right, this is the point we reached yesterday with the too low F. At

> that point you cannot know and we assuming u=1, F<1 -> u=1, F=1, which

> is a sensible assumption.

>

>> Then an always running task can have a util_sum that is less than the

>> max value because of the rounding (util_avg varies between

>> [1006..1023]), so I use LOAD_AVG_MAX*1000 instead of LOAD_AVG_MAX*1024

>

> OK, so the reason util_avg varies is because we compute it wrong. And I

> think we can easily fix that once we pull out all the factors (which

> would mean your patch and the pulling out of weight patch which still

> needs to be finished).


That would be great to remove this unwanted variation.

>

> But you're comparing against util_sum here, that behaves slightly

> different. I think you want 'util_sum >= 1024 * (LOAD_AVG_MAX - 1024)'

> instead.


yes, the variation happens on the util_sum

Peter Zijlstra April 13, 2017, 4:13 p.m. UTC | #19

On Thu, Apr 13, 2017 at 05:16:20PM +0200, Vincent Guittot wrote:
> On 13 April 2017 at 15:39, Peter Zijlstra <peterz@infradead.org> wrote:

> > OK, so the reason util_avg varies is because we compute it wrong. And I

> > think we can easily fix that once we pull out all the factors (which

> > would mean your patch and the pulling out of weight patch which still

> > needs to be finished).

> 

> That would be great to remove this unwanted variation.

So the problem with the _avg stuff is that we include the d3 segment,
that is the unfinished current window. Since we only re-compute the _avg
whenever we roll over, the intent already seems to be to only compute it
on completed windows.

But because 'complicated/expensive', its hard to not include d3 and thus
we get the wobble.

Once we compute pure running/runnable sums, without extra contrib
factors, we can simply subtract our d3 term from sum when doing the
division and change the divider to LOAD_AVG_MAX*y, getting the stable
_avg over all completed windows.

(you could do the same with factors, but then we get to do a bunch of
extra multiplications which aren't free).

> >

> > But you're comparing against util_sum here, that behaves slightly

> > different. I think you want 'util_sum >= 1024 * (LOAD_AVG_MAX - 1024)'

> > instead.

> 

> yes, the variation happens on the util_sum

Well, for util_sum its simple to ignore the current window, which is
what the suggested equation does (note that LOAD_AVG_MAX*y ==
LOAD_AVG_MAX-1024).

Peter Zijlstra April 13, 2017, 6:06 p.m. UTC | #20

On Thu, Apr 13, 2017 at 04:59:15PM +0200, Vincent Guittot wrote:
> On 13 April 2017 at 15:32, Peter Zijlstra <peterz@infradead.org> wrote:

> > On Wed, Apr 12, 2017 at 01:28:58PM +0200, Peter Zijlstra wrote:

> >

> >> I still wonder about the whole !running vs !weight thing.,

> >

> > Ah, since we use this for both util _and_ load, we need !running &&

> > !weight, and it so happens that !weight implies !running. Is that it?

> 

> exactly

> sorry, I should have started with that


Damn, that just bring me around to wondering why running is the right
condition to create lost-time.

Because for runnable we want everything that has weight.

Vincent Guittot April 14, 2017, 8:47 a.m. UTC | #21

On 13 April 2017 at 20:06, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Apr 13, 2017 at 04:59:15PM +0200, Vincent Guittot wrote:

>> On 13 April 2017 at 15:32, Peter Zijlstra <peterz@infradead.org> wrote:

>> > On Wed, Apr 12, 2017 at 01:28:58PM +0200, Peter Zijlstra wrote:

>> >

>> >> I still wonder about the whole !running vs !weight thing.,

>> >

>> > Ah, since we use this for both util _and_ load, we need !running &&

>> > !weight, and it so happens that !weight implies !running. Is that it?

>>

>> exactly

>> sorry, I should have started with that

>

> Damn, that just bring me around to wondering why running is the right

> condition to create lost-time.

>

> Because for runnable we want everything that has weight.


I have considered that the waiting time doesn't have to be scaled
unlike the running time of the runnable because waiting is the same
whatever the current capacity

>

Vincent Guittot April 14, 2017, 8:49 a.m. UTC | #22

On 13 April 2017 at 18:13, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Apr 13, 2017 at 05:16:20PM +0200, Vincent Guittot wrote:

>> On 13 April 2017 at 15:39, Peter Zijlstra <peterz@infradead.org> wrote:

>

>> > OK, so the reason util_avg varies is because we compute it wrong. And I

>> > think we can easily fix that once we pull out all the factors (which

>> > would mean your patch and the pulling out of weight patch which still

>> > needs to be finished).

>>

>> That would be great to remove this unwanted variation.

>

> So the problem with the _avg stuff is that we include the d3 segment,

> that is the unfinished current window. Since we only re-compute the _avg

> whenever we roll over, the intent already seems to be to only compute it

> on completed windows.


yes make sense

>

> But because 'complicated/expensive', its hard to not include d3 and thus

> we get the wobble.

>

> Once we compute pure running/runnable sums, without extra contrib

> factors, we can simply subtract our d3 term from sum when doing the

> division and change the divider to LOAD_AVG_MAX*y, getting the stable

> _avg over all completed windows.


I'm going to make it a try to check that it removes the variation i'm seeing

>

> (you could do the same with factors, but then we get to do a bunch of

> extra multiplications which aren't free).

>

>> >

>> > But you're comparing against util_sum here, that behaves slightly

>> > different. I think you want 'util_sum >= 1024 * (LOAD_AVG_MAX - 1024)'

>> > instead.

>>

>> yes, the variation happens on the util_sum

>

> Well, for util_sum its simple to ignore the current window, which is

> what the suggested equation does (note that LOAD_AVG_MAX*y ==

> LOAD_AVG_MAX-1024).

Vincent Guittot April 19, 2017, 4:31 p.m. UTC | #23

On 14 April 2017 at 10:49, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 13 April 2017 at 18:13, Peter Zijlstra <peterz@infradead.org> wrote:

>> On Thu, Apr 13, 2017 at 05:16:20PM +0200, Vincent Guittot wrote:

>>> On 13 April 2017 at 15:39, Peter Zijlstra <peterz@infradead.org> wrote:

>>

>>> > OK, so the reason util_avg varies is because we compute it wrong. And I

>>> > think we can easily fix that once we pull out all the factors (which

>>> > would mean your patch and the pulling out of weight patch which still

>>> > needs to be finished).

>>>

>>> That would be great to remove this unwanted variation.

>>

>> So the problem with the _avg stuff is that we include the d3 segment,

>> that is the unfinished current window. Since we only re-compute the _avg

>> whenever we roll over, the intent already seems to be to only compute it

>> on completed windows.

>

> yes make sense

>

>>

>> But because 'complicated/expensive', its hard to not include d3 and thus

>> we get the wobble.

>>

>> Once we compute pure running/runnable sums, without extra contrib

>> factors, we can simply subtract our d3 term from sum when doing the

>> division and change the divider to LOAD_AVG_MAX*y, getting the stable

>> _avg over all completed windows.

>

> I'm going to make it a try to check that it removes the variation i'm seeing


I have sent a patchset based on your proposal that fix this variation issue

>

>>

>> (you could do the same with factors, but then we get to do a bunch of

>> extra multiplications which aren't free).

>>

>>> >

>>> > But you're comparing against util_sum here, that behaves slightly

>>> > different. I think you want 'util_sum >= 1024 * (LOAD_AVG_MAX - 1024)'

>>> > instead.

>>>

>>> yes, the variation happens on the util_sum

>>

>> Well, for util_sum its simple to ignore the current window, which is

>> what the suggested equation does (note that LOAD_AVG_MAX*y ==

>> LOAD_AVG_MAX-1024).

Morten Rasmussen April 28, 2017, 3:52 p.m. UTC | #24

Hi Vincent,

Sorry for crashing the party this late. As you know, it takes a long
period of uninterrupted review time to properly review PELT stuff.

Disclaimer: I haven't read the rest of the thread yet.

On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:
> The current implementation of load tracking invariance scales the

> contribution with current frequency and uarch performance (only for

> utilization) of the CPU. One main result of this formula is that the

> figures are capped by current capacity of CPU. Another one is that the

> load_avg is not invariant because not scaled with uarch.

> 

> The util_avg of a periodic task that runs r time slots every p time slots

> varies in the range :

> 

>     U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

> 

> with U is the max util_avg value = SCHED_CAPACITY_SCALE

> 

> At a lower capacity, the range becomes:

> 

>     U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

> 

> with C reflecting the compute capacity ratio between current capacity and

> max capacity.

> 

> so C tries to compensate changes in (1-y^r') but it can't be accurate.

I would say that C is trying to compensate for difference in running
time, which is difference between r and r'. But I think that is what you
are saying as well.

> Instead of scaling the contribution value of PELT algo, we should scale the

> running time. The PELT signal aims to track the amount of computation of

> tasks and/or rq so it seems more correct to scale the running time to

> reflect the effective amount of computation done since the last update.

> 

> In order to be fully invariant, we need to apply the same amount of

> running time and idle time whatever the current capacity. Because running

> at lower capacity implies that the task will run longer, we have to track

> the amount of "stolen" idle time and to apply it when task becomes idle.

Overall, I agree that scaling time according to compute capacity is the
right thing to do, if we want accurate scale-invariance. The current
implementation is an approximation which isn't perfect, but it is easy
to implement.

One thing that became clear to me after reading the patch and tried to
recreate what it does, is that you don't seem to consider waiting time
and the patch doesn't scale it. That is fine for utilization, but it
breaks load. For load, you have to scale waiting time as well, which
makes things rather complicated as you would need two scaled time
deltas, one for utilization and one for load.

What currently happens is that waiting tasks as low OPPs are
accumulating load at the rate of the highest OPP without accumulating
"stolen" time. So waiting tasks accumulate too much load at lower OPPs.
We have confirmed it by scheduling two periodic tasks on the same cpu
and verified that the load does ramp up faster than it should.

For the aggregate cfs_rq util/load sums, I can't convince myself whether
they break when "stolen" time isn't inserted at the same time in the
task signal as it is for the cfs_rq signal. If you have two periodic
tasks scheduled on the same cpu with aligned periods, then the first
task will inject its stolen time as soon as it finishes, while the
aggregate sum won't inject the stolen time until it has completed the
second task and inject the stolen for both tasks at the same time. As
long as no tasks are migrated to/from the cpu it should work, but I'm not
convinced that it works if tasks are migrated or new tasks show up
before the stolen time has been committed to the aggregate sum. Have you
looked more into this?

Another question is what happens to blocked utilization? IIUC, the
stolen time is injected when the task wakes up again so the utilization
drops to where it should be. Does that mean that the blocked
contribution of the task is inflated the entire time while it is
blocked?

> But once we have reached the maximum utilization value (SCHED_CAPACITY_SCALE),

> it means that the task is seen as an always-running task whatever the

> capacity of the cpu (even at max compute capacity). In this case, we can

> discard the "stolen" idle times which becomes meaningless. In order to

> cope with rounding effect of PELT algo we take a margin and consider task

> with utilization greater than 1000 (vs 1024 max) as an always-running task.

I understand why the threshold is necessary, but I'm slightly concerned
that it might introduce corner cases where things could go very wrong. I
haven't proven that it is indeed a problem, but is it possible to have
aggregate sum of utilization >1000 without the cpu being fully utilized?
So the tasks will have their stolen time injected, but the cfs_rq
utilization won't?

> 

> Then, we can use the same algorithm for both utilization and load and

> simplify __update_load_avg now that the load of a task doesn't have to be

> capped by CPU uarch.

As said above, waiting time has to be handled differently for
utilization and load.

> The responsivness of PELT is improved when CPU is not running at max

> capacity with this new algorithm. I have put below some examples of

> duration to reach some typical load values according to the capacity of the

> CPU with current implementation and with this patch.

> 

> Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)

> 972 (95%)    138ms         not reachable            276ms

> 486 (47.5%)  30ms          138ms                     60ms

> 256 (25%)    13ms           32ms                     26ms

> 

> On my hikey (octo ARM platform) with schedutil governor, the time to reach

> max OPP when starting from a null utilization, decreases from 223ms with

> current scale invariance down to 121ms with the new algorithm. For this

> test, i have enable arch_scale_freq for arm64.

Note that the time to reach a specific utilization with your
implementation depends on the OPP. The lower the OPP, the slower
utilization accumulates.

[...]

> @@ -2852,19 +2850,54 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,

>  	}

>  	sa->period_contrib = delta;

>  

> -	contrib = cap_scale(contrib, scale_freq);

>  	if (weight) {

>  		sa->load_sum += weight * contrib;

>  		if (cfs_rq)

>  			cfs_rq->runnable_load_sum += weight * contrib;

>  	}

>  	if (running)

> -		sa->util_sum += contrib * scale_cpu;

> +		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;

>  

>  	return periods;

>  }

>  

>  /*

> + * Scale the time to reflect the effective amount of computation done during

> + * this delta time.

> + */

> +static __always_inline u64

> +scale_time(u64 delta, int cpu, struct sched_avg *sa,

> +		unsigned long weight, int running)

> +{

> +	if (running) {

> +		sa->stolen_idle_time += delta;

> +		/*

> +		 * scale the elapsed time to reflect the real amount of

> +		 * computation

> +		 */

> +		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));

> +		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));

> +

> +		/*

> +		 * Track the amount of stolen idle time due to running at

> +		 * lower capacity

> +		 */

> +		sa->stolen_idle_time -= delta;

> +	} else if (!weight) {

> +		if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {

> +			/*

> +			 * Add the idle time stolen by running at lower compute

> +			 * capacity

> +			 */

> +			delta += sa->stolen_idle_time;

> +		}

> +		sa->stolen_idle_time = 0;

> +	}

> +

> +	return delta;

As mentioned above, waiting time, i.e. !running && weight, is not
scaled, which causes trouble for load.

Morten

Dietmar Eggemann April 28, 2017, 5:08 p.m. UTC | #25

On 28/04/17 16:52, Morten Rasmussen wrote:
> Hi Vincent,

[...]

> As mentioned above, waiting time, i.e. !running && weight, is not

> scaled, which causes trouble for load.

I ran some rt-app-based tests on a system with frequency and cpu invariance.

(1) Two periodic 20% tasks with 12ms period on a cpu (capacity=1024) at
625Mhz (max 1100Mhz) starting to run at the same time, so one task
(task1) is wakeup preempted. (I'm only considering the phase of the test
run where this is a stable condition, i.e. task1 is always wakeup
preempted by task2).

So the runtime of a task is 0.2*12ms*1100/625 = 4.2ms.

At the beginning of the preemption period, __update_load_avg_se(task1)
is called with running=0 and weight=0, at the end with running=0 and
weight=1024.

When task1 finally runs there are two calls with (running=1,
weight=1024) before the next wakeup preemption period for task1 starts
again with (running=0, weight=0).

Task task2 which doesn't suffer from wakeup preemption starts running
with (running=0, weight=0), then there are 2 calls with (running=1,
weight=1024) before it starts running again with (running=0, weight=0).

Task1 is runnable for 8.4ms and sleeps for 3.6ms whereas task is
runnable for 4.2ms and sleeps for 7.8ms.

The load signal of task1 is ~600 whereas the the load of task2 is ~200.

(2) Two periodic 20% tasks with 12ms period on a cpu (capacity=1024) at
1100Mhz (max 1100Mhz) starting to run at the same time, so one task
(task1) is wakeup preempted.

So the runtime of one task is 0.2*12ms*1100/1100 = 2.4ms.

Task1 is runnable for 4.8ms and sleeps for 7.2ms whereas task is
runnable for 2.4ms and sleeps for 9.6ms.

The load signal of task1 is ~400 whereas the the load of task2 is ~200.

Like Morten said, the scaling for load works differently on different
OPP's. Scaling for utilization looks fine.

IMHO, the implementation of your scale_time() function can't take
preemption into consideration.

I also did tests comparing the time_scaling implementation with tip
(contribution scaling) (two periodic tasks 20%/16ms at 625Mhz/1100Mhz
and 20%/32ms at 625Mhz/1100Mhz) showing this as a difference between
time_scaling and tip.

-- Dietmar

[...]

Peter Zijlstra April 28, 2017, 10:09 p.m. UTC | #26

On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:
> +++ b/include/linux/sched.h

> @@ -313,6 +313,7 @@ struct load_weight {

>   */

>  struct sched_avg {

>  	u64				last_update_time;

> +	u64				stolen_idle_time;

>  	u64				load_sum;

>  	u32				util_sum;

>  	u32				period_contrib;

> +		if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {

> +			/*

> +			 * Add the idle time stolen by running at lower compute

> +			 * capacity

> +			 */

> +			delta += sa->stolen_idle_time;

> +		}

> +		sa->stolen_idle_time = 0;

So I was wondering if stolen_idle_time really needs to be a u64. Afaict
we'll be at LOAD_AVG_MAX after LOAD_AVG_MAX_N periods, or LOAD_AVG_MAX_N
* LOAD_AVG_PERIOD time, which ends up being 11040.

After that you'll truncate it anyway.. so there shouldn't be a need to
be much larger than that, no?

Peter Zijlstra May 1, 2017, 9 a.m. UTC | #27

On Sat, Apr 29, 2017 at 12:09:24AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:

> > +++ b/include/linux/sched.h

> > @@ -313,6 +313,7 @@ struct load_weight {

> >   */

> >  struct sched_avg {

> >  	u64				last_update_time;

> > +	u64				stolen_idle_time;

> >  	u64				load_sum;

> >  	u32				util_sum;

> >  	u32				period_contrib;

> 

> > +		if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {

> > +			/*

> > +			 * Add the idle time stolen by running at lower compute

> > +			 * capacity

> > +			 */

> > +			delta += sa->stolen_idle_time;

> > +		}

> > +		sa->stolen_idle_time = 0;

> 

> 

> So I was wondering if stolen_idle_time really needs to be a u64. Afaict

> we'll be at LOAD_AVG_MAX after LOAD_AVG_MAX_N periods, or LOAD_AVG_MAX_N

> * LOAD_AVG_PERIOD time, which ends up being 11040.


* 1024 or course, but still easily fits in u32.

Vincent Guittot May 2, 2017, 1:38 p.m. UTC | #28

On 1 May 2017 at 11:00, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Apr 29, 2017 at 12:09:24AM +0200, Peter Zijlstra wrote:

>> On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:

>> > +++ b/include/linux/sched.h

>> > @@ -313,6 +313,7 @@ struct load_weight {

>> >   */

>> >  struct sched_avg {

>> >     u64                             last_update_time;

>> > +   u64                             stolen_idle_time;

>> >     u64                             load_sum;

>> >     u32                             util_sum;

>> >     u32                             period_contrib;

>>

>> > +           if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {

>> > +                   /*

>> > +                    * Add the idle time stolen by running at lower compute

>> > +                    * capacity

>> > +                    */

>> > +                   delta += sa->stolen_idle_time;

>> > +           }

>> > +           sa->stolen_idle_time = 0;

>>

>>

>> So I was wondering if stolen_idle_time really needs to be a u64. Afaict

>> we'll be at LOAD_AVG_MAX after LOAD_AVG_MAX_N periods, or LOAD_AVG_MAX_N

>> * LOAD_AVG_PERIOD time, which ends up being 11040.

>

> * 1024 or course, but still easily fits in u32.


Correct

>

Vincent Guittot May 3, 2017, 5:11 p.m. UTC | #29

Le Friday 28 Apr 2017 à 16:52:59 (+0100), Morten Rasmussen a écrit :
> Hi Vincent,

> 

> Sorry for crashing the party this late. As you know, it takes a long

> period of uninterrupted review time to properly review PELT stuff.

> 

> Disclaimer: I haven't read the rest of the thread yet.

> 

> On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:

> > The current implementation of load tracking invariance scales the

> > contribution with current frequency and uarch performance (only for

> > utilization) of the CPU. One main result of this formula is that the

> > figures are capped by current capacity of CPU. Another one is that the

> > load_avg is not invariant because not scaled with uarch.

> > 

> > The util_avg of a periodic task that runs r time slots every p time slots

> > varies in the range :

> > 

> >     U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

> > 

> > with U is the max util_avg value = SCHED_CAPACITY_SCALE

> > 

> > At a lower capacity, the range becomes:

> > 

> >     U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

> > 

> > with C reflecting the compute capacity ratio between current capacity and

> > max capacity.

> > 

> > so C tries to compensate changes in (1-y^r') but it can't be accurate.

> 

> I would say that C is trying to compensate for difference in running

> time, which is difference between r and r'. But I think that is what you

> are saying as well.

> 

> > Instead of scaling the contribution value of PELT algo, we should scale the

> > running time. The PELT signal aims to track the amount of computation of

> > tasks and/or rq so it seems more correct to scale the running time to

> > reflect the effective amount of computation done since the last update.

> > 

> > In order to be fully invariant, we need to apply the same amount of

> > running time and idle time whatever the current capacity. Because running

> > at lower capacity implies that the task will run longer, we have to track

> > the amount of "stolen" idle time and to apply it when task becomes idle.

> 

> Overall, I agree that scaling time according to compute capacity is the

> right thing to do, if we want accurate scale-invariance. The current

> implementation is an approximation which isn't perfect, but it is easy

> to implement.

> 

> One thing that became clear to me after reading the patch and tried to

> recreate what it does, is that you don't seem to consider waiting time

> and the patch doesn't scale it. That is fine for utilization, but it

> breaks load. For load, you have to scale waiting time as well, which

> makes things rather complicated as you would need two scaled time

> deltas, one for utilization and one for load.

>


In fact I have always considered that we don't need to scale !running time
(runnable and sleep time) because waiting is the same whatever the frequency or
the micro arch.
Nevertheless, i agree that running longer at lower capacity increases the
number of time a task can be preempted by another another thus increasing the
runnable time. 

The solution would be to scale both running and runnable and to add stolen time
once idle. This seems to be also more accurate in the case of utilization
when the running task is preempted by another task. In fact, it's something
that Peter raised in former comments but i didn't saw the benefit of scaling
runnable at that time... I was probably too much focused on utilization

This behavior is not really noticeable with the short running time use case of
dietmar's example but it becomes relevant when tasks need more time and are
preempted more often by others.

With the addons below of top of the patch, the load_avg issue raised by dietmar's
example is fixed and the utilization remains ok

---
 kernel/sched/fair.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

-- 

>

> What currently happens is that waiting tasks as low OPPs are

> accumulating load at the rate of the highest OPP without accumulating

> "stolen" time. So waiting tasks accumulate too much load at lower OPPs.

> We have confirmed it by scheduling two periodic tasks on the same cpu

> and verified that the load does ramp up faster than it should.

> 

> For the aggregate cfs_rq util/load sums, I can't convince myself whether

> they break when "stolen" time isn't inserted at the same time in the

> task signal as it is for the cfs_rq signal. If you have two periodic

> tasks scheduled on the same cpu with aligned periods, then the first

> task will inject its stolen time as soon as it finishes, while thei


In fact, I think that the 1st task will wait for its next update before
adding lost idle time but that doesn't change the point

> aggregate sum won't inject the stolen time until it has completed the

> second task and inject the stolen for both tasks at the same time. As

> long as no tasks are migrated to/from the cpu it should work, but I'm not

> convinced that it works if tasks are migrated or new tasks show up

> before the stolen time has been committed to the aggregate sum. Have you

> looked more into this?

>

>

task and cfs_rq are synced before any change of the state of task
attach/detach or enqueue/dequeue so i don't expect any out of sync

> 

> Another question is what happens to blocked utilization? IIUC, the

> stolen time is injected when the task wakes up again so the utilization

> drops to where it should be. Does that mean that the blocked

> contribution of the task is inflated the entire time while it is

> blocked?

>


only at the task level but its contribution to the cfs_rq will be updated.
This is exactly what already happen, we update task metrics only when it wakes up

> 

> > But once we have reached the maximum utilization value (SCHED_CAPACITY_SCALE),

> > it means that the task is seen as an always-running task whatever the

> > capacity of the cpu (even at max compute capacity). In this case, we can

> > discard the "stolen" idle times which becomes meaningless. In order to

> > cope with rounding effect of PELT algo we take a margin and consider task

> > with utilization greater than 1000 (vs 1024 max) as an always-running task.

> 

> I understand why the threshold is necessary, but I'm slightly concerned

> that it might introduce corner cases where things could go very wrong. I

> haven't proven that it is indeed a problem, but is it possible to have

> aggregate sum of utilization >1000 without the cpu being fully utilized?

> So the tasks will have their stolen time injected, but the cfs_rq

> utilization won't?

> 

> > 

> > Then, we can use the same algorithm for both utilization and load and

> > simplify __update_load_avg now that the load of a task doesn't have to be

> > capped by CPU uarch.

> 

> As said above, waiting time has to be handled differently for

> utilization and load.

> 

> > The responsivness of PELT is improved when CPU is not running at max

> > capacity with this new algorithm. I have put below some examples of

> > duration to reach some typical load values according to the capacity of the

> > CPU with current implementation and with this patch.

> > 

> > Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)

> > 972 (95%)    138ms         not reachable            276ms

> > 486 (47.5%)  30ms          138ms                     60ms

> > 256 (25%)    13ms           32ms                     26ms

> > 

> > On my hikey (octo ARM platform) with schedutil governor, the time to reach

> > max OPP when starting from a null utilization, decreases from 223ms with

> > current scale invariance down to 121ms with the new algorithm. For this

> > test, i have enable arch_scale_freq for arm64.

> 

> Note that the time to reach a specific utilization with your

> implementation depends on the OPP. The lower the OPP, the slower

> utilization accumulates.


yes like the current implementation. Everything is based on : lower OPP means
proportionally longer running time.

Thanks

> 

> [...]

> 

> > @@ -2852,19 +2850,54 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,

> >  	}

> >  	sa->period_contrib = delta;

> >  

> > -	contrib = cap_scale(contrib, scale_freq);

> >  	if (weight) {

> >  		sa->load_sum += weight * contrib;

> >  		if (cfs_rq)

> >  			cfs_rq->runnable_load_sum += weight * contrib;

> >  	}

> >  	if (running)

> > -		sa->util_sum += contrib * scale_cpu;

> > +		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;

> >  

> >  	return periods;

> >  }

> >  

> >  /*

> > + * Scale the time to reflect the effective amount of computation done during

> > + * this delta time.

> > + */

> > +static __always_inline u64

> > +scale_time(u64 delta, int cpu, struct sched_avg *sa,

> > +		unsigned long weight, int running)

> > +{

> > +	if (running) {

> > +		sa->stolen_idle_time += delta;

> > +		/*

> > +		 * scale the elapsed time to reflect the real amount of

> > +		 * computation

> > +		 */

> > +		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));

> > +		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));

> > +

> > +		/*

> > +		 * Track the amount of stolen idle time due to running at

> > +		 * lower capacity

> > +		 */

> > +		sa->stolen_idle_time -= delta;

> > +	} else if (!weight) {

> > +		if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {

> > +			/*

> > +			 * Add the idle time stolen by running at lower compute

> > +			 * capacity

> > +			 */

> > +			delta += sa->stolen_idle_time;

> > +		}

> > +		sa->stolen_idle_time = 0;

> > +	}

> > +

> > +	return delta;

> 

> As mentioned above, waiting time, i.e. !running && weight, is not

> scaled, which causes trouble for load.

> 

> Mortendiff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

index 8b036f1..3e647b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2854,12 +2854,15 @@ static __always_inline u64
 scale_time(u64 delta, int cpu, struct sched_avg *sa,
 		unsigned long weight, int running)
 {
-	if (running) {
+	if (weight) {
 		/*
-		 * When an entity runs at a lower compute capacity, it will
-		 * need more time to do the same amount of work than at max
-		 * capacity. In order to be invariant, we scale the delta to
-		 * reflect how much work has been really done.
+		 * When an entity is runnable at a lower compute capacity,
+		 * it will need more time to do the same amount of work than
+		 * at max capacity: either because it takes more time to
+		 * compute the same amount of work or because taking more time
+		 * means sharing more often the CPU with others.
+		 * In order to be invariant, we scale the delta to reflect how
+		 * much work has been really done.
 		 * Running at lower capacity also means running longer to do
 		 * the same amount of work and this results in stealing some
 		 * idle time that will disturbed the load signal compared to
@@ -2883,7 +2886,7 @@ scale_time(u64 delta, int cpu, struct sched_avg *sa,
 		 * lower capacity
 		 */
 		sa->stolen_idle_time -= delta;
-	} else if (!weight) {
+	} else {
 		/*
 		 * Entity is sleeping so both utilization and load will decay
 		 * and we can safely add the stolen time. Reflecting some

[v2] sched/fair: update scale invariance of PELT

Commit Message

Comments

Patch