diff mbox

[v3] sched/fair: update scale invariance of PELT

Message ID 1493389435-2525-1-git-send-email-vincent.guittot@linaro.org
State New
Headers show

Commit Message

Vincent Guittot April 28, 2017, 2:23 p.m. UTC
The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.

The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :

    U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

with U is the max util_avg value = SCHED_CAPACITY_SCALE

At a lower capacity, the range becomes:

    U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

with C reflecting the compute capacity ratio between current capacity and
max capacity.

so C tries to compensate changes in (1-y^r') but it can't be accurate.

Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.

In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to track
the amount of "stolen" idle time and to apply it when task becomes idle.

But once we have reached the maximum utilization value (SCHED_CAPACITY_SCALE),
it means that the task is seen as an always-running task whatever the
capacity of the cpu (even at max compute capacity). In this case, we can
discard the "stolen" idle times which becomes meaningless. In order to
cope with rounding effect of PELT algo we take a margin and consider task
with utilization greater than 1000 (vs 1024 max) as an always-running task.

Then, we can use the same algorithm for both utilization and load and
simplify __update_load_avg now that the load of a task doesn't have to be
capped by CPU uarch.

The responsivness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch.

Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
972 (95%)    138ms         not reachable            276ms
486 (47.5%)  30ms          138ms                     60ms
256 (25%)    13ms           32ms                     26ms

On my hikey (octo ARM platform) with schedutil governor, the time to reach
max OPP when starting from a null utilization, decreases from 223ms with
current scale invariance down to 121ms with the new algorithm. For this
test, i have enable arch_scale_freq for arm64.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---

Change since v3
- Add comments
- With patch ("sched/cfs: make util/load_avg more stable"), utilization
  stays stable when reaching max value. Removed margin used to detect
  always running task

 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 76 ++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 70 insertions(+), 7 deletions(-)

-- 
2.7.4

Comments

Peter Zijlstra May 18, 2018, 9:36 a.m. UTC | #1
Replying to the latest version available; given the current interest I
figure I'd re-read some of the old threads and look at this stuff again.

On Fri, Apr 28, 2017 at 04:23:55PM +0200, Vincent Guittot wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h

> index 0978fb7..f8dde36 100644

> --- a/include/linux/sched.h

> +++ b/include/linux/sched.h

> @@ -313,6 +313,7 @@ struct load_weight {

>   */

>  struct sched_avg {

>  	u64				last_update_time;

> +	u64				stolen_idle_time;

>  	u64				load_sum;

>  	u32				util_sum;

>  	u32				period_contrib;


Right, so sadly Patrick stole that space with the util_est bits.

Also, given the comment here:

  https://marc.info/?l=linux-kernel&m=149373232422941&w=2

this should be a u32, right? Which might be slightly easier finding a
hole for.

>  /*

> + * Scale the time to reflect the effective amount of computation done during

> + * this delta time.


I would much appreciate a more extended comment here. One that includes
pictures of the of the moving window edges, as in:

  https://marc.info/?l=linux-kernel&m=149200866116792&w=2
  https://marc.info/?l=linux-kernel&m=149201190517985&w=2

> + */

> +static __always_inline u64

> +scale_time(u64 delta, int cpu, struct sched_avg *sa,

> +		unsigned long weight, int running)

> +{

> +	if (running) {

> +		/*

> +		 * When an entity runs at a lower compute capacity, it will

> +		 * need more time to do the same amount of work than at max

> +		 * capacity. In order to be invariant, we scale the delta to

> +		 * reflect how much work has been really done.

> +		 * Running at lower capacity also means running longer to do

> +		 * the same amount of work and this results in stealing some

> +		 * idle time that will disturbed the load signal compared to

> +		 * max capacity; We also track this amount of stolen time to

> +		 * reflect it when the entity will go back to sleep.

> +		 *

> +		 * stolen time = (current run time) - (effective time at max

> +		 * capacity)

> +		 */

> +		sa->stolen_idle_time += delta;

> +

> +		/*

> +		 * scale the elapsed time to reflect the real amount of

> +		 * computation

> +		 */

> +		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));

> +		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));

> +

> +		/*

> +		 * Track the amount of stolen idle time due to running at

> +		 * lower capacity

> +		 */

> +		sa->stolen_idle_time -= delta;

> +	} else if (!weight) {

> +		/*

> +		 * Entity is sleeping so both utilization and load will decay

> +		 * and we can safely add the stolen time. Reflecting some

> +		 * stolen time make sense only if this idle phase would be

> +		 * present at max capacity. As soon as the utilization of an

> +		 * entity has reached the maximum value, it is considered as

> +		 * an always runnnig entity without idle time to steal.

> +		 */

> +		if (sa->util_avg < (SCHED_CAPACITY_SCALE - 1)) {

> +			/*

> +			 * Add the idle time stolen by running at lower compute

> +			 * capacity

> +			 */

> +			delta += sa->stolen_idle_time;

> +		}

> +		sa->stolen_idle_time = 0;

> +	}


What happened to the proposed changes here:

  https://marc.info/?l=linux-kernel&m=149383148721909&w=2

to deal with the load scaling issues?

> +

> +	return delta;

> +}
Patrick Bellasi May 18, 2018, 10:17 a.m. UTC | #2
On 18-May 11:36, Peter Zijlstra wrote:
> 

> Replying to the latest version available; given the current interest I

> figure I'd re-read some of the old threads and look at this stuff again.

> 

> On Fri, Apr 28, 2017 at 04:23:55PM +0200, Vincent Guittot wrote:

> 

> > diff --git a/include/linux/sched.h b/include/linux/sched.h

> > index 0978fb7..f8dde36 100644

> > --- a/include/linux/sched.h

> > +++ b/include/linux/sched.h

> > @@ -313,6 +313,7 @@ struct load_weight {

> >   */

> >  struct sched_avg {

> >  	u64				last_update_time;

> > +	u64				stolen_idle_time;

> >  	u64				load_sum;

> >  	u32				util_sum;

> >  	u32				period_contrib;

> 

> Right, so sadly Patrick stole that space with the util_est bits.


Sorry :(

However, I remember we already talked about the idea to
update load_avg and runnable_load_avg to use u32:

   https://marc.info/?l=linux-kernel&m=151334269426419&w=2

-- 
#include <best/regards.h>

Patrick Bellasi
Peter Zijlstra May 18, 2018, 12:08 p.m. UTC | #3
On Fri, May 18, 2018 at 11:17:38AM +0100, Patrick Bellasi wrote:
> On 18-May 11:36, Peter Zijlstra wrote:

> > 

> > Replying to the latest version available; given the current interest I

> > figure I'd re-read some of the old threads and look at this stuff again.

> > 

> > On Fri, Apr 28, 2017 at 04:23:55PM +0200, Vincent Guittot wrote:

> > 

> > > diff --git a/include/linux/sched.h b/include/linux/sched.h

> > > index 0978fb7..f8dde36 100644

> > > --- a/include/linux/sched.h

> > > +++ b/include/linux/sched.h

> > > @@ -313,6 +313,7 @@ struct load_weight {

> > >   */

> > >  struct sched_avg {

> > >  	u64				last_update_time;

> > > +	u64				stolen_idle_time;

> > >  	u64				load_sum;

> > >  	u32				util_sum;

> > >  	u32				period_contrib;

> > 

> > Right, so sadly Patrick stole that space with the util_est bits.

> 

> Sorry :(

> 

> However, I remember we already talked about the idea to

> update load_avg and runnable_load_avg to use u32:

> 

>    https://marc.info/?l=linux-kernel&m=151334269426419&w=2

> 


Sadly not I fear, (runnable_)load for entities is obviously limited in
range, but we use the same struct sched_avg for the cfs_rq aggregates,
and there is no real limit on the total weight of a cfs_rq.

However, I think we can shrink util_avg unconditionally, which would
give us exactly the 4 byte hole we need.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 959a8588e365..670b53b10a68 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -402,7 +402,8 @@ struct sched_avg {
 	u32				period_contrib;
 	unsigned long			load_avg;
 	unsigned long			runnable_load_avg;
-	unsigned long			util_avg;
+	u32				util_avg;
+	u32				stolen_idle_time;
 	struct util_est			util_est;
 } ____cacheline_aligned;
Peter Zijlstra May 18, 2018, 12:19 p.m. UTC | #4
On Fri, May 18, 2018 at 02:08:51PM +0200, Peter Zijlstra wrote:
> However, I think we can shrink util_avg unconditionally, which would

> give us exactly the 4 byte hole we need.


There is of course the problem that we track rq util_avg as a straight
sum of entity util_avg, such that if you migrate/wake a bunch of tasks
into the rq the sum can exceed the natural bounds, although it will
settle down back to it.

If we're really worried we could try and detect overflows on the
summing and clip.
Dietmar Eggemann May 18, 2018, 3:16 p.m. UTC | #5
On 05/18/2018 10:36 AM, Peter Zijlstra wrote:
> 

> Replying to the latest version available; given the current interest I

> figure I'd re-read some of the old threads and look at this stuff again.

> 

> On Fri, Apr 28, 2017 at 04:23:55PM +0200, Vincent Guittot wrote:


[...]

> What happened to the proposed changes here:

> 

>    https://marc.info/?l=linux-kernel&m=149383148721909&w=2

> 

> to deal with the load scaling issues?


They are not in this v3 version. I haven't tested them back then. I 
assume that Vincent will incorporate that bit into his update.
Vincent Guittot May 18, 2018, 3:43 p.m. UTC | #6
On 18 May 2018 at 17:16, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 05/18/2018 10:36 AM, Peter Zijlstra wrote:

>>

>>

>> Replying to the latest version available; given the current interest I

>> figure I'd re-read some of the old threads and look at this stuff again.

>>

>> On Fri, Apr 28, 2017 at 04:23:55PM +0200, Vincent Guittot wrote:

>

>

> [...]

>

>> What happened to the proposed changes here:

>>

>>    https://marc.info/?l=linux-kernel&m=149383148721909&w=2

>>

>> to deal with the load scaling issues?

>

>

> They are not in this v3 version. I haven't tested them back then. I assume

> that Vincent will incorporate that bit into his update.


Yes I will integrated all pending changes in the next version. I have
some changes that I haven't fully tested yet

>

>
diff mbox

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0978fb7..f8dde36 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -313,6 +313,7 @@  struct load_weight {
  */
 struct sched_avg {
 	u64				last_update_time;
+	u64				stolen_idle_time;
 	u64				load_sum;
 	u32				util_sum;
 	u32				period_contrib;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a903276..8b036f1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -729,6 +729,7 @@  void init_entity_runnable_average(struct sched_entity *se)
 	struct sched_avg *sa = &se->avg;
 
 	sa->last_update_time = 0;
+	sa->stolen_idle_time = 0;
 	/*
 	 * sched_avg's period_contrib should be strictly less then 1024, so
 	 * we give it 1023 to make sure it is almost a period (1024us), and
@@ -2804,15 +2805,12 @@  static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  *                     n=1
  */
 static __always_inline u32
-accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
+accumulate_sum(u64 delta, struct sched_avg *sa,
 	       unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	unsigned long scale_freq, scale_cpu;
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
 	u64 periods;
 
-	scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
 
 	delta += sa->period_contrib;
 	periods = delta / 1024; /* A period is 1024us (~1ms) */
@@ -2837,19 +2835,77 @@  accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
 	}
 	sa->period_contrib = delta;
 
-	contrib = cap_scale(contrib, scale_freq);
 	if (weight) {
 		sa->load_sum += weight * contrib;
 		if (cfs_rq)
 			cfs_rq->runnable_load_sum += weight * contrib;
 	}
 	if (running)
-		sa->util_sum += contrib * scale_cpu;
+		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
 	return periods;
 }
 
 /*
+ * Scale the time to reflect the effective amount of computation done during
+ * this delta time.
+ */
+static __always_inline u64
+scale_time(u64 delta, int cpu, struct sched_avg *sa,
+		unsigned long weight, int running)
+{
+	if (running) {
+		/*
+		 * When an entity runs at a lower compute capacity, it will
+		 * need more time to do the same amount of work than at max
+		 * capacity. In order to be invariant, we scale the delta to
+		 * reflect how much work has been really done.
+		 * Running at lower capacity also means running longer to do
+		 * the same amount of work and this results in stealing some
+		 * idle time that will disturbed the load signal compared to
+		 * max capacity; We also track this amount of stolen time to
+		 * reflect it when the entity will go back to sleep.
+		 *
+		 * stolen time = (current run time) - (effective time at max
+		 * capacity)
+		 */
+		sa->stolen_idle_time += delta;
+
+		/*
+		 * scale the elapsed time to reflect the real amount of
+		 * computation
+		 */
+		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
+		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
+
+		/*
+		 * Track the amount of stolen idle time due to running at
+		 * lower capacity
+		 */
+		sa->stolen_idle_time -= delta;
+	} else if (!weight) {
+		/*
+		 * Entity is sleeping so both utilization and load will decay
+		 * and we can safely add the stolen time. Reflecting some
+		 * stolen time make sense only if this idle phase would be
+		 * present at max capacity. As soon as the utilization of an
+		 * entity has reached the maximum value, it is considered as
+		 * an always runnnig entity without idle time to steal.
+		 */
+		if (sa->util_avg < (SCHED_CAPACITY_SCALE - 1)) {
+			/*
+			 * Add the idle time stolen by running at lower compute
+			 * capacity
+			 */
+			delta += sa->stolen_idle_time;
+		}
+		sa->stolen_idle_time = 0;
+	}
+
+	return delta;
+}
+
+/*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
  * history into segments of approximately 1ms (1024us); label the segment that
@@ -2904,13 +2960,19 @@  ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	sa->last_update_time += delta << 10;
 
 	/*
+	 * Scale time to reflect the amount a computation effectively done
+	 * during the time slot at current capacity
+	 */
+	delta = scale_time(delta, cpu, sa, weight, running);
+
+	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
 	 * accrues by two steps:
 	 *
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, cpu, sa, weight, running, cfs_rq))
+	if (!accumulate_sum(delta, sa, weight, running, cfs_rq))
 		return 0;
 
 	/*