[v2,2/4] sched/fair: add util_est on top of PELT

Message ID	20171205171018.9203-3-patrick.bellasi@arm.com
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-pm-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Patrick Bellasi <patrick.bellasi@arm.com> To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>, Viresh Kumar <viresh.kumar@linaro.org>, Vincent Guittot <vincent.guittot@linaro.org>, Paul Turner <pjt@google.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Morten Rasmussen <morten.rasmussen@arm.com>, Juri Lelli <juri.lelli@redhat.com>, Todd Kjos <tkjos@android.com>, Joel Fernandes <joelaf@google.com> Subject: [PATCH v2 2/4] sched/fair: add util_est on top of PELT Date: Tue, 5 Dec 2017 17:10:16 +0000 Message-Id: <20171205171018.9203-3-patrick.bellasi@arm.com> In-Reply-To: <20171205171018.9203-1-patrick.bellasi@arm.com> References: <20171205171018.9203-1-patrick.bellasi@arm.com> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk
Series	Utilization estimation (util_est) for FAIR tasks \| expand [v2,0/4] Utilization estimation (util_est) for FAIR tasks [v2,1/4] sched/fair: always used unsigned long for utilization [v2,2/4] sched/fair: add util_est on top of PELT [v2,3/4] sched/fair: use util_est in LB and WU paths

Patrick Bellasi Dec. 5, 2017, 5:10 p.m. UTC

The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.

For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

    util_est(task) = max(task::util_avg, f(task::util_avg@dequeue_times))

This allows to remember how big a task has been reported by PELT in its
previous activations via the function: f(task::util_avg@dequeue_times).

If a task should change its behavior and it runs even longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).

The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization due
to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:

    util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est_runnable)

where:

    cfs_rq::util_est_runnable = sum(util_est(task))
                                for each RUNNABLE task on that root cfs_rq

It's worth to note that the estimated utilization is tracked only for
entities of interests, specifically:
 - Tasks: to better support tasks placement decisions
 - root cfs_rqs: to better support both tasks placement decisions as
                 well as frequencies selection

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>

Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org

---
Changes v1->v2:
 - rebase on top of v4.15-rc2
 - tested that overhauled PELT code does not affect the util_est
---
 include/linux/sched.h   |  21 ++++++++++
 kernel/sched/debug.c    |   4 ++
 kernel/sched/fair.c     | 102 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/features.h |   5 +++
 kernel/sched/sched.h    |   1 +
 5 files changed, 132 insertions(+), 1 deletion(-)

-- 
2.14.1

Peter Zijlstra Dec. 13, 2017, 4:05 p.m. UTC | #1

On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> +	if (cfs_rq->nr_running > 0) {

> +		util_est  = cfs_rq->util_est_runnable;

> +		util_est -= task_util_est(p);

> +		if (util_est < 0)

> +			util_est = 0;

> +		cfs_rq->util_est_runnable = util_est;

> +	} else {

I'm thinking that's an explicit load-store to avoid intermediate values
landing in cfs_rq->util_esp_runnable, right?

That would need READ_ONCE() / WRITE_ONCE() I think, without that the
compiler is free to munge the lot together.

Peter Zijlstra Dec. 13, 2017, 4:19 p.m. UTC | #2

On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> @@ -562,6 +577,12 @@ struct task_struct {

>  

>  	const struct sched_class	*sched_class;

>  	struct sched_entity		se;

> +	/*

> +	 * Since we use se.avg.util_avg to update util_est fields,

> +	 * this last can benefit from being close to se which

> +	 * also defines se.avg as cache aligned.

> +	 */

> +	struct util_est			util_est;

>  	struct sched_rt_entity		rt;

>  #ifdef CONFIG_CGROUP_SCHED

>  	struct task_group		*sched_task_group;



> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

> index b19552a212de..8371839075fa 100644

> --- a/kernel/sched/sched.h

> +++ b/kernel/sched/sched.h

> @@ -444,6 +444,7 @@ struct cfs_rq {

>  	 * CFS load tracking

>  	 */

>  	struct sched_avg avg;

> +	unsigned long util_est_runnable;

>  #ifndef CONFIG_64BIT

>  	u64 load_last_update_time_copy;

>  #endif



So you put the util_est in task_struct (not sched_entity) but the
util_est_runnable in cfs_rq (not rq). Seems inconsistent.

Patrick Bellasi Dec. 13, 2017, 4:36 p.m. UTC | #3

On 13-Dec 17:19, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:

> > @@ -562,6 +577,12 @@ struct task_struct {

> >  

> >  	const struct sched_class	*sched_class;

> >  	struct sched_entity		se;

> > +	/*

> > +	 * Since we use se.avg.util_avg to update util_est fields,

> > +	 * this last can benefit from being close to se which

> > +	 * also defines se.avg as cache aligned.

> > +	 */

> > +	struct util_est			util_est;

> >  	struct sched_rt_entity		rt;

> >  #ifdef CONFIG_CGROUP_SCHED

> >  	struct task_group		*sched_task_group;

> 

> 

> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

> > index b19552a212de..8371839075fa 100644

> > --- a/kernel/sched/sched.h

> > +++ b/kernel/sched/sched.h

> > @@ -444,6 +444,7 @@ struct cfs_rq {

> >  	 * CFS load tracking

> >  	 */

> >  	struct sched_avg avg;

> > +	unsigned long util_est_runnable;

> >  #ifndef CONFIG_64BIT

> >  	u64 load_last_update_time_copy;

> >  #endif

> 

> 

> So you put the util_est in task_struct (not sched_entity) but the

> util_est_runnable in cfs_rq (not rq). Seems inconsistent.


One goal was to keep util_est variables close to the util_avg used to
load the filter, for caches affinity sakes.

The other goal was to have util_est data only for Tasks and CPU's
RQ, thus avoiding unused data for TG's RQ and SE.

Unfortunately the first goal does not allow to achieve completely the
second and, you right, the solution looks a bit inconsistent.

Do you think we should better disregard cache proximity and move
util_est_runnable to rq?

-- 
#include <best/regards.h>

Patrick Bellasi

Peter Zijlstra Dec. 13, 2017, 5:03 p.m. UTC | #4

On Wed, Dec 13, 2017 at 04:36:53PM +0000, Patrick Bellasi wrote:
> On 13-Dec 17:19, Peter Zijlstra wrote:

> > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:

> > > @@ -562,6 +577,12 @@ struct task_struct {

> > >  

> > >  	const struct sched_class	*sched_class;

> > >  	struct sched_entity		se;

> > > +	/*

> > > +	 * Since we use se.avg.util_avg to update util_est fields,

> > > +	 * this last can benefit from being close to se which

> > > +	 * also defines se.avg as cache aligned.

> > > +	 */

> > > +	struct util_est			util_est;


The thing is, since sched_entity has a member with cacheline alignment,
the whole structure must have cacheline alignment, and this util_est
_will_ start on a new line.

See also:

$ pahole -EC task_struct defconfig/kernel/sched/core.o

...
		struct sched_avg {
                                /* typedef u64 */ long long unsigned int last_update_time; /*   576     8 */
                                /* typedef u64 */ long long unsigned int load_sum;       /*   584     8 */
                                /* typedef u32 */ unsigned int util_sum;                 /*   592     4 */
                                /* typedef u32 */ unsigned int period_contrib;           /*   596     4 */
                                long unsigned int load_avg;                              /*   600     8 */
                                long unsigned int util_avg;                              /*   608     8 */
                        } avg; /*   576    40 */
                        /* --- cacheline 6 boundary (384 bytes) --- */
                } se; /*   192   448 */
                /* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */
                struct util_est {
                        long unsigned int last;                                          /*   640     8 */
                        long unsigned int ewma;                                          /*   648     8 */
                } util_est; /*   640    16 */
...

The thing is somewhat confused on which cacheline is which, but you'll
see sched_avg landing at 576 (cacheline #9) and util_est at 640 (line
#10).

> > >  	struct sched_rt_entity		rt;

> > >  #ifdef CONFIG_CGROUP_SCHED

> > >  	struct task_group		*sched_task_group;


> One goal was to keep util_est variables close to the util_avg used to

> load the filter, for caches affinity sakes.

> 

> The other goal was to have util_est data only for Tasks and CPU's

> RQ, thus avoiding unused data for TG's RQ and SE.

> 

> Unfortunately the first goal does not allow to achieve completely the

> second and, you right, the solution looks a bit inconsistent.

> 

> Do you think we should better disregard cache proximity and move

> util_est_runnable to rq?


proximity is likely important; I'd suggest moving util_est into
sched_entity.

Patrick Bellasi Dec. 15, 2017, 12:03 p.m. UTC | #5

On 13-Dec 18:03, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 04:36:53PM +0000, Patrick Bellasi wrote:

> > On 13-Dec 17:19, Peter Zijlstra wrote:

> > > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:

> > > > @@ -562,6 +577,12 @@ struct task_struct {

> > > >  

> > > >  	const struct sched_class	*sched_class;

> > > >  	struct sched_entity		se;

> > > > +	/*

> > > > +	 * Since we use se.avg.util_avg to update util_est fields,

> > > > +	 * this last can benefit from being close to se which

> > > > +	 * also defines se.avg as cache aligned.

> > > > +	 */

> > > > +	struct util_est			util_est;

> 

> The thing is, since sched_entity has a member with cacheline alignment,

> the whole structure must have cacheline alignment, and this util_est

> _will_ start on a new line.


Right, I was not considering that "aligned" affects also the
start of the following data. Thus

> See also:

> 

> $ pahole -EC task_struct defconfig/kernel/sched/core.o

> 

> ...

> 		struct sched_avg {

>                                 /* typedef u64 */ long long unsigned int last_update_time; /*   576     8 */

>                                 /* typedef u64 */ long long unsigned int load_sum;       /*   584     8 */

>                                 /* typedef u32 */ unsigned int util_sum;                 /*   592     4 */

>                                 /* typedef u32 */ unsigned int period_contrib;           /*   596     4 */

>                                 long unsigned int load_avg;                              /*   600     8 */

>                                 long unsigned int util_avg;                              /*   608     8 */

>                         } avg; /*   576    40 */

>                         /* --- cacheline 6 boundary (384 bytes) --- */

>                 } se; /*   192   448 */

>                 /* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */

>                 struct util_est {

>                         long unsigned int last;                                          /*   640     8 */

>                         long unsigned int ewma;                                          /*   648     8 */

>                 } util_est; /*   640    16 */

> ...

> 

> The thing is somewhat confused on which cacheline is which, but you'll

> see sched_avg landing at 576 (cacheline #9) and util_est at 640 (line

> #10).

> 

> > > >  	struct sched_rt_entity		rt;

> > > >  #ifdef CONFIG_CGROUP_SCHED

> > > >  	struct task_group		*sched_task_group;

> 

> > One goal was to keep util_est variables close to the util_avg used to

> > load the filter, for caches affinity sakes.

> > 

> > The other goal was to have util_est data only for Tasks and CPU's

> > RQ, thus avoiding unused data for TG's RQ and SE.

> > 

> > Unfortunately the first goal does not allow to achieve completely the

> > second and, you right, the solution looks a bit inconsistent.

> > 

> > Do you think we should better disregard cache proximity and move

> > util_est_runnable to rq?

> 

> proximity is likely important; I'd suggest moving util_est into

> sched_entity.


So, by moving util_est right after sched_avg, here is what we get (with some
lines to better highlight 64B boundaries):

                const struct sched_class  * sched_class;                                 /*   152     8 */
                struct sched_entity {
			[...]
		---[ Line 9 ]-------------------------------------------------------------------------------
                        struct sched_avg {
                                /* typedef u64 */ long long unsigned int last_update_time; /*   576     8 */
                                /* typedef u64 */ long long unsigned int load_sum;       /*   584     8 */
                                /* typedef u64 */ long long unsigned int runnable_load_sum; /*   592     8 */
                                /* typedef u32 */ unsigned int util_sum;                 /*   600     4 */
                                /* typedef u32 */ unsigned int period_contrib;           /*   604     4 */
                                long unsigned int load_avg;                              /*   608     8 */
                                long unsigned int runnable_load_avg;                     /*   616     8 */
                                long unsigned int util_avg;                              /*   624     8 */
                        } avg; /*   576    56 */
                        /* --- cacheline 6 boundary (384 bytes) was 24 bytes ago --- */
                        struct util_est {
                                long unsigned int last;                                  /*   632     8 */
		---[ Line 10 ]------------------------------------------------------------------------------
                                long unsigned int ewma;                                  /*   640     8 */
                        } util_est; /*   632    16 */
                } se; /*   192   512 */
		---[ Line 11 ]------------------------------------------------------------------------------
                /* --- cacheline 9 boundary (576 bytes) was 24 bytes ago --- */
                struct sched_rt_entity {
                        struct list_head {
                                struct list_head * next;                                 /*   704     8 */
                                struct list_head * prev;                                 /*   712     8 */
                        } run_list; /*   704    16 */


As you can see we still end up with util_est spanning acrosss two cache and
even worst with an almost empty Line 10. The point is that sched_avg already
uses 56B... which leave just 8bytes left.

So, I can to move util_est there and use unsigned int for "last" and "ewma"
storage. This should fix the cache alignment but only until we do not add
other stuff to sched_avg.

BTW, should not be possible to use a similar "fasting" approach for load_avg
and runnable_load_avg? Given their range a u32 should be just good enough,
isn't it?

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

Patrick Bellasi Dec. 15, 2017, 12:14 p.m. UTC | #6

On 13-Dec 17:16, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:

> > +static inline void util_est_dequeue(struct task_struct *p, int flags)

> > +{

> > +	struct cfs_rq *cfs_rq = &task_rq(p)->cfs;

> > +	unsigned long util_last = task_util(p);

> > +	bool sleep = flags & DEQUEUE_SLEEP;

> > +	unsigned long ewma;

> > +	long util_est;

> > +

> > +	if (!sched_feat(UTIL_EST))

> > +		return;

> > +

> > +	/*

> > +	 * Update root cfs_rq's estimated utilization

> > +	 *

> > +	 * If *p is the last task then the root cfs_rq's estimated utilization

> > +	 * of a CPU is 0 by definition.

> > +	 *

> > +	 * Otherwise, in removing *p's util_est from its cfs_rq's

> > +	 * util_est_runnable we should account for cases where this last

> > +	 * activation of *p was longer then the previous ones.

> > +	 * Also in these cases we need to set 0 the estimated utilization for

> > +	 * the CPU.

> > +	 */

> > +	if (cfs_rq->nr_running > 0) {

> > +		util_est  = cfs_rq->util_est_runnable;

> > +		util_est -= task_util_est(p);

> > +		if (util_est < 0)

> > +			util_est = 0;

> > +		cfs_rq->util_est_runnable = util_est;

> > +	} else {

> > +		cfs_rq->util_est_runnable = 0;

> > +	}

> > +

> > +	/*

> > +	 * Skip update of task's estimated utilization when the task has not

> > +	 * yet completed an activation, e.g. being migrated.

> > +	 */

> > +	if (!sleep)

> > +		return;

> > +

> > +	/*

> > +	 * Skip update of task's estimated utilization when its EWMA is already

> > +	 * ~1% close to its last activation value.

> > +	 */

> > +	util_est = p->util_est.ewma;

> > +	if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))

> > +		return;

> 

> Isn't that computation almost as expensive as the stuff you're trying to

> avoid?


Mmm... maybe slightly simpler. I'll profile it again but I remember
I've added it because it was slightly better on backbench.

This code at the end it's just a "sub" and a "compare to constant" and
it's likely to bail early for all "almost regular" tasks.

Are you worried about the branch overhead?

> > +	/*

> > +	 * Update Task's estimated utilization

> > +	 *

> > +	 * When *p completes an activation we can consolidate another sample

> > +	 * about the task size. This is done by storing the last PELT value

> > +	 * for this task and using this value to load another sample in the

> > +	 * exponential weighted moving average:

> > +	 *

> > +	 *      ewma(t) = w *  task_util(p) + (1 - w) ewma(t-1)

> > +	 *              = w *  task_util(p) + ewma(t-1) - w * ewma(t-1)

> > +	 *              = w * (task_util(p) + ewma(t-1) / w - ewma(t-1))

> > +	 *

> > +	 * Where 'w' is the weight of new samples, which is configured to be

> > +	 * 0.25, thus making w=1/4

> > +	 */

> > +	p->util_est.last = util_last;

> > +	ewma = p->util_est.ewma;

> > +	if (likely(ewma != 0)) {

> 

> Why special case 0? Yes it helps with the initial ramp-on, but would not

> an asymmetric IIR (with a consistent upward bias) be better?


Yes, maybe the fast ramp-up is not really necessary... I'll test it
without on some real use-cases and see if we really get any noticiable
benefit, otheriwse I'll remove it.

Thanks for pointing this out.

> > +		ewma   = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;

> > +		ewma >>= UTIL_EST_WEIGHT_SHIFT;

> > +	} else {

> > +		ewma = util_last;

> > +	}

> > +	p->util_est.ewma = ewma;

> > +}


-- 
#include <best/regards.h>

Patrick Bellasi

Peter Zijlstra Dec. 15, 2017, 12:58 p.m. UTC | #7

On Fri, Dec 15, 2017 at 12:03:31PM +0000, Patrick Bellasi wrote:

> So, by moving util_est right after sched_avg, here is what we get (with some

> lines to better highlight 64B boundaries):

> 

>                 const struct sched_class  * sched_class;                                 /*   152     8 */

>                 struct sched_entity {

> 			[...]

> 		---[ Line 9 ]-------------------------------------------------------------------------------

>                         struct sched_avg {

>                                 /* typedef u64 */ long long unsigned int last_update_time; /*   576     8 */

>                                 /* typedef u64 */ long long unsigned int load_sum;       /*   584     8 */

>                                 /* typedef u64 */ long long unsigned int runnable_load_sum; /*   592     8 */

>                                 /* typedef u32 */ unsigned int util_sum;                 /*   600     4 */

>                                 /* typedef u32 */ unsigned int period_contrib;           /*   604     4 */

>                                 long unsigned int load_avg;                              /*   608     8 */

>                                 long unsigned int runnable_load_avg;                     /*   616     8 */

>                                 long unsigned int util_avg;                              /*   624     8 */

>                         } avg; /*   576    56 */

>                         /* --- cacheline 6 boundary (384 bytes) was 24 bytes ago --- */

>                         struct util_est {

>                                 long unsigned int last;                                  /*   632     8 */

> 		---[ Line 10 ]------------------------------------------------------------------------------

>                                 long unsigned int ewma;                                  /*   640     8 */

>                         } util_est; /*   632    16 */

>                 } se; /*   192   512 */

> 		---[ Line 11 ]------------------------------------------------------------------------------

>                 /* --- cacheline 9 boundary (576 bytes) was 24 bytes ago --- */

>                 struct sched_rt_entity {

>                         struct list_head {

>                                 struct list_head * next;                                 /*   704     8 */

>                                 struct list_head * prev;                                 /*   712     8 */

>                         } run_list; /*   704    16 */

> 

> 

> As you can see we still end up with util_est spanning acrosss two cache and

> even worst with an almost empty Line 10. The point is that sched_avg already

> uses 56B... which leave just 8bytes left.


Yes, that's unfortunate.

> So, I can to move util_est there and use unsigned int for "last" and "ewma"

> storage. This should fix the cache alignment but only until we do not add

> other stuff to sched_avg.

> 

> BTW, should not be possible to use a similar "fasting" approach for load_avg

> and runnable_load_avg? Given their range a u32 should be just good enough,

> isn't it?


Probably, I'd have to page all that stuff back in :/

Another issue is that for tasks load and runnable_load are the exact
same; I just never found a sensible way to collapse that.

Peter Zijlstra Dec. 15, 2017, 2:07 p.m. UTC | #8

On Fri, Dec 15, 2017 at 02:02:18PM +0000, Patrick Bellasi wrote:
> On 13-Dec 17:05, Peter Zijlstra wrote:

> > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:

> > > +	if (cfs_rq->nr_running > 0) {

> > > +		util_est  = cfs_rq->util_est_runnable;

> > > +		util_est -= task_util_est(p);

> > > +		if (util_est < 0)

> > > +			util_est = 0;

> > > +		cfs_rq->util_est_runnable = util_est;

> > > +	} else {

> > 

> > I'm thinking that's an explicit load-store to avoid intermediate values

> > landing in cfs_rq->util_esp_runnable, right?

> 

> Was mainly to have an unsigned util_est for the following "sub"...

> 

> 

> > That would need READ_ONCE() / WRITE_ONCE() I think, without that the

> > compiler is free to munge the lot together.

> 

> ... do we still need the {READ,WRITE}_ONCE() in this case?

> I guess adding them however does not hurts.


I think the compiler is free to munge it into something like:

	cfs_rq->util_est_runnable -= task_util_est();
	if (cfs_rq->util_est_runnable < 0)
		cfs_rq->util_est_runnable = 0

and its a fairly simple optimization at that, it just needs to observe
that util_est is an alias for cfs_rq->util_est_runnable.

Using the volatile load/store completely destroys that optimization
though, so yes, I'd say its definitely needed.

Patrick Bellasi Dec. 15, 2017, 3:22 p.m. UTC | #9

On 15-Dec 15:07, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 02:02:18PM +0000, Patrick Bellasi wrote:

> > On 13-Dec 17:05, Peter Zijlstra wrote:

> > > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:

> > > > +	if (cfs_rq->nr_running > 0) {

> > > > +		util_est  = cfs_rq->util_est_runnable;

> > > > +		util_est -= task_util_est(p);

> > > > +		if (util_est < 0)

> > > > +			util_est = 0;

> > > > +		cfs_rq->util_est_runnable = util_est;

> > > > +	} else {

> > > 

> > > I'm thinking that's an explicit load-store to avoid intermediate values

> > > landing in cfs_rq->util_esp_runnable, right?

> > 

> > Was mainly to have an unsigned util_est for the following "sub"...

> > 

> > 

> > > That would need READ_ONCE() / WRITE_ONCE() I think, without that the

> > > compiler is free to munge the lot together.

> > 

> > ... do we still need the {READ,WRITE}_ONCE() in this case?

> > I guess adding them however does not hurts.

> 


This is just to better understand....

> I think the compiler is free to munge it into something like:

> 

> 	cfs_rq->util_est_runnable -= task_util_est();

> 	if (cfs_rq->util_est_runnable < 0)

> 		cfs_rq->util_est_runnable = 0

> 


I'm still confused, we have:

            long util_est
   unsigned long cfs_rq->util_est_runnable

The optimization above can potentially generate an overflow, isn't it?

> and its a fairly simple optimization at that, it just needs to observe

> that util_est is an alias for cfs_rq->util_est_runnable.


Since the first is signed and the last unsigned, can the compiler still
considered them an alias?

At least on ARM I would expect a load of an unsigned value, some
computations on "signed registers", and finally a store of an unsigned
value. This is what I get:


        if (cfs_rq->nr_running > 0) {
    51e4:       91020000        add     x0, x0, #0x80
    51e8:       b9401802        ldr     w2, [x0,#24]
    51ec:       340004a2        cbz     w2, 5280 <dequeue_task_fair+0xb20>
	// skip branch for nr_running == 0

        return max(p->util_est.ewma, p->util_est.last);
    51f0:       f9403ba2        ldr     x2, [x29,#112]
    51f4:       f9418044        ldr     x4, [x2,#768]
    51f8:       f9418443        ldr     x3, [x2,#776]
	// x3 := p->util_est.ewma
	// x4 := p->util_est.last

                util_est -= task_util_est(p);
    51fc:       f9405002        ldr     x2, [x0,#160]
	// x2 := cfs_rq->util_est_runnable

        return max(p->util_est.ewma, p->util_est.last);
    5200:       eb04007f        cmp     x3, x4
    5204:       9a842063        csel    x3, x3, x4, cs
	// x3 := task_util_est(p) := max(p->util_est.ewma, p->util_est.last)

                cfs_rq->util_est_runnable = util_est;
    5208:       eb030042        subs    x2, x2, x3
	// x2 := util_est -= task_util_est(p);

    520c:       9a9f5042        csel    x2, x2, xzr, pl
	// x2 := max(0, util_est)

    5210:       f9005002        str     x2, [x0,#160]
	// store back into cfs_rq->util_est_runnable


And by adding {READ,WRITE}_ONCE I still get the same code.

While, compiling for x86, I get two different versions, here is
the one without {READ,WRITE}_ONCE:

       if (cfs_rq->nr_running > 0) {
    3e3e:       8b 90 98 00 00 00       mov    0x98(%rax),%edx
    3e44:       85 d2                   test   %edx,%edx
    3e46:       0f 84 e0 00 00 00       je     3f2c <dequeue_task_fair+0xf9c>
                util_est  = cfs_rq->util_est_runnable;
                util_est -= task_util_est(p);
    3e4c:       48 8b 74 24 28          mov    0x28(%rsp),%rsi
    3e51:       48 8b 96 80 02 00 00    mov    0x280(%rsi),%rdx
    3e58:       48 39 96 88 02 00 00    cmp    %rdx,0x288(%rsi)
    3e5f:       48 0f 43 96 88 02 00    cmovae 0x288(%rsi),%rdx
    3e66:       00 
                if (util_est < 0)
                        util_est = 0;
                cfs_rq->util_est_runnable = util_est;
    3e67:       48 8b b0 20 01 00 00    mov    0x120(%rax),%rsi
    3e6e:       48 29 d6                sub    %rdx,%rsi
    3e71:       48 89 f2                mov    %rsi,%rdx
    3e74:       be 00 00 00 00          mov    $0x0,%esi
    3e79:       48 0f 48 d6             cmovs  %rsi,%rdx
    3e7d:       48 89 90 20 01 00 00    mov    %rdx,0x120(%rax)

but I'm not confident on "parsing it"...


> Using the volatile load/store completely destroys that optimization

> though, so yes, I'd say its definitely needed.


Ok, since it's definitively not harmful, I'll add it.

-- 
#include <best/regards.h>

Patrick Bellasi

Patrick Bellasi Dec. 15, 2017, 3:41 p.m. UTC | #10

On 15-Dec 13:53, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 12:14:17PM +0000, Patrick Bellasi wrote:

> > On 13-Dec 17:16, Peter Zijlstra wrote:

> 

> > > > +	/*

> > > > +	 * Skip update of task's estimated utilization when its EWMA is already

> > > > +	 * ~1% close to its last activation value.

> > > > +	 */

> > > > +	util_est = p->util_est.ewma;

> > > > +	if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))

> > > > +		return;

> > > 

> > > Isn't that computation almost as expensive as the stuff you're trying to

> > > avoid?

> > 

> > Mmm... maybe slightly simpler. I'll profile it again but I remember

> > I've added it because it was slightly better on backbench.

> > 

> > This code at the end it's just a "sub" and a "compare to constant" and

> > it's likely to bail early for all "almost regular" tasks.

> > 

> > Are you worried about the branch overhead?

> 

> Its a subtract, a test for sign, a conditional branch on test, a negate,

> a subtract constant and another conditinoal branch.


Close enough, the actual code is:

        util_est = p->util_est.ewma;
    5218:       f9403ba3        ldr     x3, [x29,#112]
    521c:       f9418462        ldr     x2, [x3,#776]
        if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
    5220:       eb010040        subs    x0, x2, x1
    5224:       da805400        cneg    x0, x0, mi
    5228:       f100281f        cmp     x0, #0xa
    522c:       54fff9cd        b.le    5164 <dequeue_task_fair+0xa04>

> 

> Branch overhead certainly matters too.

> 

> > > > +	p->util_est.last = util_last;

> > > > +	ewma = p->util_est.ewma;

> > > > +	if (likely(ewma != 0)) {

> > > 

> > > Why special case 0? Yes it helps with the initial ramp-on, but would not

> > > an asymmetric IIR (with a consistent upward bias) be better?

> > 

> > Yes, maybe the fast ramp-up is not really necessary... I'll test it

> > without on some real use-cases and see if we really get any noticiable

> > benefit, otheriwse I'll remove it.

> > 

> > Thanks for pointing this out.

> > 

> > > > +		ewma   = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;

> > > > +		ewma >>= UTIL_EST_WEIGHT_SHIFT;

> > > > +	} else {

> > > > +		ewma = util_last;

> > > > +	}

> > > > +	p->util_est.ewma = ewma;

> 

> And this, without the 0 case, is shift, an add, a subtract and another

> shift followed by a store.


Actual code:

       p->util_est.last = util_last;
    5230:       f9018061        str     x1, [x3,#768]
        if (likely(ewma != 0)) {
    5234:       b40000a2        cbz     x2, 5248 <dequeue_task_fair+0xae8>
                ewma   = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
    5238:       d37ef440        lsl     x0, x2, #2
    523c:       cb020002        sub     x2, x0, x2
    5240:       8b010041        add     x1, x2, x1
                ewma >>= UTIL_EST_WEIGHT_SHIFT;
    5244:       d342fc21        lsr     x1, x1, #2
        p->util_est.ewma = ewma;
    5248:       f9403ba0        ldr     x0, [x29,#112]
    524c:       f9018401        str     x1, [x0,#776]
    5250:       17ffffc5        b       5164 <dequeue_task_fair+0xa04>

> 

> Which is less branches and roughly similar arith ops, some of which can

> be done in parallel.

> 

> I suspect what you saw on the profile is the cacheline hit of the store,

> but I'm not sure.


Yes likely, looking at the two chunks above and considering the
removal of the 0 case, it's probably worth to remove the first check.

I'll give it a try again to measure hackbench overheads with the cache
alignment fixed.

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

Peter Zijlstra Dec. 20, 2017, 8:57 a.m. UTC | #11

On Fri, Dec 15, 2017 at 03:41:40PM +0000, Patrick Bellasi wrote:
> Close enough, the actual code is:

> 

>         util_est = p->util_est.ewma;

>     5218:       f9403ba3        ldr     x3, [x29,#112]

>     521c:       f9418462        ldr     x2, [x3,#776]

>         if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))

>     5220:       eb010040        subs    x0, x2, x1

>     5224:       da805400        cneg    x0, x0, mi

>     5228:       f100281f        cmp     x0, #0xa

>     522c:       54fff9cd        b.le    5164 <dequeue_task_fair+0xa04>


Ah, that cneg instruction is cute; on x86 we end up with something like:

bool abs_test(long s)
{
        return abs(s) < 32;
}

        cmpl    $-31, %eax
        jl      .L107
        movq    -8(%rbp), %rax
        cmpl    $31, %eax
        jg      .L107
        movl    $1, %eax
        jmp     .L108
.L107:
        movl    $0, %eax
.L108:


But I figured you can actually do:

	abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)

Which, if y is a constant, should result in nicer code, and it does for
x86:

        addq    $31, %rax
        cmpq    $62, %rax
        setbe   %al
        movzbl  %al, %eax

Just not measurably faster, I suppose because of all the dependencies :/

Peter Zijlstra Dec. 20, 2017, 9:02 a.m. UTC | #12

On Wed, Dec 20, 2017 at 09:57:47AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 03:41:40PM +0000, Patrick Bellasi wrote:

> > Close enough, the actual code is:

> > 

> >         util_est = p->util_est.ewma;

> >     5218:       f9403ba3        ldr     x3, [x29,#112]

> >     521c:       f9418462        ldr     x2, [x3,#776]

> >         if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))

> >     5220:       eb010040        subs    x0, x2, x1

> >     5224:       da805400        cneg    x0, x0, mi

> >     5228:       f100281f        cmp     x0, #0xa

> >     522c:       54fff9cd        b.le    5164 <dequeue_task_fair+0xa04>

> 

> Ah, that cneg instruction is cute; on x86 we end up with something like:

> 

> bool abs_test(long s)

> {

>         return abs(s) < 32;

> }

> 

>         cmpl    $-31, %eax

>         jl      .L107

>         movq    -8(%rbp), %rax

>         cmpl    $31, %eax

>         jg      .L107

>         movl    $1, %eax

>         jmp     .L108

> .L107:

>         movl    $0, %eax

> .L108:

> 

> 

> But I figured you can actually do:

> 

> 	abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)

> 

> Which, if y is a constant, should result in nicer code, and it does for

> x86:

> 

>         addq    $31, %rax

>         cmpq    $62, %rax

>         setbe   %al

>         movzbl  %al, %eax

> 

> Just not measurably faster, I suppose because of all the dependencies :/


Ah no, it actually is, I'm an idiot and used 'long' for return value. If
I use bool we loose that last movzbl and we go from around 4.0 cycles
down to 3.4 cycles.

[v2,2/4] sched/fair: add util_est on top of PELT

Commit Message

Comments

Patch