[v5,05/10] cpufreq/schedutil: get max utilization

Message ID 1527253951-22709-6-git-send-email-vincent.guittot@linaro.org
State New
Headers show
Series
  • track CPU utilization
Related show

Commit Message

Vincent Guittot May 25, 2018, 1:12 p.m.
Now that we have both the dl class bandwidth requirement and the dl class
utilization, we can use the max of the 2 values when agregating the
utilization of the CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/sched.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

-- 
2.7.4

Comments

Juri Lelli May 28, 2018, 10:12 a.m. | #1
Hi Vincent,

On 25/05/18 15:12, Vincent Guittot wrote:
> Now that we have both the dl class bandwidth requirement and the dl class

> utilization, we can use the max of the 2 values when agregating the

> utilization of the CPU.

> 

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---

>  kernel/sched/sched.h | 6 +++++-

>  1 file changed, 5 insertions(+), 1 deletion(-)

> 

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

> index 4526ba6..0eb07a8 100644

> --- a/kernel/sched/sched.h

> +++ b/kernel/sched/sched.h

> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

>  static inline unsigned long cpu_util_dl(struct rq *rq)

>  {

> -	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> +	unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;


I'd be tempted to say the we actually want to cap to this one above
instead of using the max (as you are proposing below) or the
(theoretical) power reduction benefits of using DEADLINE for certain
tasks might vanish.

> +

> +	util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

> +

> +	return util;


Anyway, just a quick thought. I guess we should experiment with this a
bit. Now, I don't unfortunately have a Arm platform at hand for testing.
Claudio, Luca (now Cc-ed), would you be able to fire some tests with
this change?

Oh, adding Joel and Alessio as well that experimented with DEADLINE
lately.

Thanks,

- Juri
Vincent Guittot May 28, 2018, 2:57 p.m. | #2
Hi Juri,

On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote:
> Hi Vincent,

>

> On 25/05/18 15:12, Vincent Guittot wrote:

>> Now that we have both the dl class bandwidth requirement and the dl class

>> utilization, we can use the max of the 2 values when agregating the

>> utilization of the CPU.

>>

>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>> ---

>>  kernel/sched/sched.h | 6 +++++-

>>  1 file changed, 5 insertions(+), 1 deletion(-)

>>

>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

>> index 4526ba6..0eb07a8 100644

>> --- a/kernel/sched/sched.h

>> +++ b/kernel/sched/sched.h

>> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

>>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

>>  static inline unsigned long cpu_util_dl(struct rq *rq)

>>  {

>> -     return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

>> +     unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

>

> I'd be tempted to say the we actually want to cap to this one above

> instead of using the max (as you are proposing below) or the

> (theoretical) power reduction benefits of using DEADLINE for certain

> tasks might vanish.


The problem that I'm facing is that the sched_entity bandwidth is
removed after the 0-lag time and the rq->dl.running_bw goes back to
zero but if the DL task has preempted a CFS task, the utilization of
the CFS task will be lower than reality and schedutil will set a lower
OPP whereas the CPU is always running. The example with a RT task
described in the cover letter can be run with a DL task and will give
similar results.
avg_dl.util_avg tracks the utilization of the rq seen by the scheduler
whereas rq->dl.running_bw gives the minimum to match DL requirement.

>

>> +

>> +     util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

>> +

>> +     return util;

>

> Anyway, just a quick thought. I guess we should experiment with this a

> bit. Now, I don't unfortunately have a Arm platform at hand for testing.

> Claudio, Luca (now Cc-ed), would you be able to fire some tests with

> this change?

>

> Oh, adding Joel and Alessio as well that experimented with DEADLINE

> lately.

>

> Thanks,

>

> - Juri
Juri Lelli May 28, 2018, 3:22 p.m. | #3
On 28/05/18 16:57, Vincent Guittot wrote:
> Hi Juri,

> 

> On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote:

> > Hi Vincent,

> >

> > On 25/05/18 15:12, Vincent Guittot wrote:

> >> Now that we have both the dl class bandwidth requirement and the dl class

> >> utilization, we can use the max of the 2 values when agregating the

> >> utilization of the CPU.

> >>

> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> >> ---

> >>  kernel/sched/sched.h | 6 +++++-

> >>  1 file changed, 5 insertions(+), 1 deletion(-)

> >>

> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

> >> index 4526ba6..0eb07a8 100644

> >> --- a/kernel/sched/sched.h

> >> +++ b/kernel/sched/sched.h

> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

> >>  static inline unsigned long cpu_util_dl(struct rq *rq)

> >>  {

> >> -     return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> >> +     unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> >

> > I'd be tempted to say the we actually want to cap to this one above

> > instead of using the max (as you are proposing below) or the

> > (theoretical) power reduction benefits of using DEADLINE for certain

> > tasks might vanish.

> 

> The problem that I'm facing is that the sched_entity bandwidth is

> removed after the 0-lag time and the rq->dl.running_bw goes back to

> zero but if the DL task has preempted a CFS task, the utilization of

> the CFS task will be lower than reality and schedutil will set a lower

> OPP whereas the CPU is always running. The example with a RT task

> described in the cover letter can be run with a DL task and will give

> similar results.

> avg_dl.util_avg tracks the utilization of the rq seen by the scheduler

> whereas rq->dl.running_bw gives the minimum to match DL requirement.


Mmm, I see. Note that I'm only being cautious, what you propose might
work OK, but it seems to me that we might lose some of the benefits of
running tasks with DEADLINE if we start selecting frequency as you
propose even when such tasks are running.

An idea might be to copy running_bw util into dl.util_avg when a DL task
goes to sleep, and then decay the latter as for RT contribution. What
you think?
Vincent Guittot May 28, 2018, 4:34 p.m. | #4
On 28 May 2018 at 17:22, Juri Lelli <juri.lelli@redhat.com> wrote:
> On 28/05/18 16:57, Vincent Guittot wrote:

>> Hi Juri,

>>

>> On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote:

>> > Hi Vincent,

>> >

>> > On 25/05/18 15:12, Vincent Guittot wrote:

>> >> Now that we have both the dl class bandwidth requirement and the dl class

>> >> utilization, we can use the max of the 2 values when agregating the

>> >> utilization of the CPU.

>> >>

>> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>> >> ---

>> >>  kernel/sched/sched.h | 6 +++++-

>> >>  1 file changed, 5 insertions(+), 1 deletion(-)

>> >>

>> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

>> >> index 4526ba6..0eb07a8 100644

>> >> --- a/kernel/sched/sched.h

>> >> +++ b/kernel/sched/sched.h

>> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

>> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

>> >>  static inline unsigned long cpu_util_dl(struct rq *rq)

>> >>  {

>> >> -     return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

>> >> +     unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

>> >

>> > I'd be tempted to say the we actually want to cap to this one above

>> > instead of using the max (as you are proposing below) or the

>> > (theoretical) power reduction benefits of using DEADLINE for certain

>> > tasks might vanish.

>>

>> The problem that I'm facing is that the sched_entity bandwidth is

>> removed after the 0-lag time and the rq->dl.running_bw goes back to

>> zero but if the DL task has preempted a CFS task, the utilization of

>> the CFS task will be lower than reality and schedutil will set a lower

>> OPP whereas the CPU is always running. The example with a RT task

>> described in the cover letter can be run with a DL task and will give

>> similar results.

>> avg_dl.util_avg tracks the utilization of the rq seen by the scheduler

>> whereas rq->dl.running_bw gives the minimum to match DL requirement.

>

> Mmm, I see. Note that I'm only being cautious, what you propose might

> work OK, but it seems to me that we might lose some of the benefits of

> running tasks with DEADLINE if we start selecting frequency as you

> propose even when such tasks are running.


I see your point. Taking into account the number cfs running task to
choose between rq->dl.running_bw and avg_dl.util_avg could help

>

> An idea might be to copy running_bw util into dl.util_avg when a DL task

> goes to sleep, and then decay the latter as for RT contribution. What

> you think?


Not sure that this will work because you will overwrite the value each
time a DL task goes to sleep and the decay will mainly happen on the
update when last DL task goes to sleep which might not reflect what
has been used by DL tasks but only the requirement of the last running
DL task. This other interest of the PELT is to have an utilization
tracking which uses the same curve as CFS so the increase of
cfs_rq->avg.util_avg and the decrease of avg_dl.util_avg will
compensate themselves (or the opposite)
Joel Fernandes May 29, 2018, 5:08 a.m. | #5
On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
[..] 
> > +

> > +	util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

> > +

> > +	return util;

> 

> Anyway, just a quick thought. I guess we should experiment with this a

> bit. Now, I don't unfortunately have a Arm platform at hand for testing.

> Claudio, Luca (now Cc-ed), would you be able to fire some tests with

> this change?

> 

> Oh, adding Joel and Alessio as well that experimented with DEADLINE

> lately.


I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
beyond what the running bandwidth is, or atleast do that only if CFS tasks
are running and being preempted as you/Vincent mentioned in one of the
threads.

With our DL experiments, I didn't measure power but got it to a point where
the OPP is scaling correctly based on DL parameters. I think Alessio did
measure power at his setup but I can't recall now. Alessio?

thanks,

 - Joel
Juri Lelli May 29, 2018, 6:31 a.m. | #6
On 28/05/18 22:08, Joel Fernandes wrote:
> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:

> [..] 

> > > +

> > > +	util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

> > > +

> > > +	return util;

> > 

> > Anyway, just a quick thought. I guess we should experiment with this a

> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.

> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with

> > this change?

> > 

> > Oh, adding Joel and Alessio as well that experimented with DEADLINE

> > lately.

> 

> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP

> beyond what the running bandwidth is, or atleast do that only if CFS tasks

> are running and being preempted as you/Vincent mentioned in one of the

> threads.


It's however a bit awkward that we might be running at a higher OPP when
CFS tasks are running (even though they are of less priority). :/

> With our DL experiments, I didn't measure power but got it to a point where

> the OPP is scaling correctly based on DL parameters. I think Alessio did

> measure power at his setup but I can't recall now. Alessio?


I see. Thanks.
Vincent Guittot May 29, 2018, 6:48 a.m. | #7
On 29 May 2018 at 08:31, Juri Lelli <juri.lelli@redhat.com> wrote:
> On 28/05/18 22:08, Joel Fernandes wrote:

>> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:

>> [..]

>> > > +

>> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

>> > > +

>> > > + return util;

>> >

>> > Anyway, just a quick thought. I guess we should experiment with this a

>> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.

>> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with

>> > this change?

>> >

>> > Oh, adding Joel and Alessio as well that experimented with DEADLINE

>> > lately.

>>

>> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP

>> beyond what the running bandwidth is, or atleast do that only if CFS tasks

>> are running and being preempted as you/Vincent mentioned in one of the

>> threads.

>

> It's however a bit awkward that we might be running at a higher OPP when

> CFS tasks are running (even though they are of less priority). :/


Even if cfs task has lower priority that doesn't mean that we should
not take their needs into account.
In the same way, we run at max OPP as soon as a RT task is runnable

>

>> With our DL experiments, I didn't measure power but got it to a point where

>> the OPP is scaling correctly based on DL parameters. I think Alessio did

>> measure power at his setup but I can't recall now. Alessio?

>

> I see. Thanks.
Quentin Perret May 29, 2018, 8:40 a.m. | #8
Hi Vincent,

On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> Now that we have both the dl class bandwidth requirement and the dl class

> utilization, we can use the max of the 2 values when agregating the

> utilization of the CPU.

> 

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---

>  kernel/sched/sched.h | 6 +++++-

>  1 file changed, 5 insertions(+), 1 deletion(-)

> 

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

> index 4526ba6..0eb07a8 100644

> --- a/kernel/sched/sched.h

> +++ b/kernel/sched/sched.h

> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

>  static inline unsigned long cpu_util_dl(struct rq *rq)

>  {

> -	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> +	unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> +

> +	util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));


Would it make sense to use a UTIL_EST version of that signal here ? I
don't think that would make sense for the RT class with your patch-set
since you only really use the blocked part of the signal for RT IIUC,
but would that work for DL ?
> +

> +	return util;

>  }

>  

>  static inline unsigned long cpu_util_cfs(struct rq *rq)

> -- 

> 2.7.4

>
Juri Lelli May 29, 2018, 9:47 a.m. | #9
On 29/05/18 08:48, Vincent Guittot wrote:
> On 29 May 2018 at 08:31, Juri Lelli <juri.lelli@redhat.com> wrote:

> > On 28/05/18 22:08, Joel Fernandes wrote:

> >> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:

> >> [..]

> >> > > +

> >> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

> >> > > +

> >> > > + return util;

> >> >

> >> > Anyway, just a quick thought. I guess we should experiment with this a

> >> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.

> >> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with

> >> > this change?

> >> >

> >> > Oh, adding Joel and Alessio as well that experimented with DEADLINE

> >> > lately.

> >>

> >> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP

> >> beyond what the running bandwidth is, or atleast do that only if CFS tasks

> >> are running and being preempted as you/Vincent mentioned in one of the

> >> threads.

> >

> > It's however a bit awkward that we might be running at a higher OPP when

> > CFS tasks are running (even though they are of less priority). :/

> 

> Even if cfs task has lower priority that doesn't mean that we should

> not take their needs into account.

> In the same way, we run at max OPP as soon as a RT task is runnable


Sure. What I fear is a little CFS utilization generating spikes because
dl.util_avg became big when running DL tasks. Not sure that can happen
though because such DL tasks should be throttled anyway.
Juri Lelli May 29, 2018, 9:52 a.m. | #10
On 29/05/18 09:40, Quentin Perret wrote:
> Hi Vincent,

> 

> On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:

> > Now that we have both the dl class bandwidth requirement and the dl class

> > utilization, we can use the max of the 2 values when agregating the

> > utilization of the CPU.

> > 

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> > ---

> >  kernel/sched/sched.h | 6 +++++-

> >  1 file changed, 5 insertions(+), 1 deletion(-)

> > 

> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

> > index 4526ba6..0eb07a8 100644

> > --- a/kernel/sched/sched.h

> > +++ b/kernel/sched/sched.h

> > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

> >  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

> >  static inline unsigned long cpu_util_dl(struct rq *rq)

> >  {

> > -	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > +	unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > +

> > +	util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

> 

> Would it make sense to use a UTIL_EST version of that signal here ? I

> don't think that would make sense for the RT class with your patch-set

> since you only really use the blocked part of the signal for RT IIUC,

> but would that work for DL ?


Well, UTIL_EST for DL looks pretty much what we already do by computing
utilization based on dl.running_bw. That's why I was thinking of using
that as a starting point for dl.util_avg decay phase.
Quentin Perret May 30, 2018, 8:37 a.m. | #11
On Tuesday 29 May 2018 at 11:52:03 (+0200), Juri Lelli wrote:
> On 29/05/18 09:40, Quentin Perret wrote:

> > Hi Vincent,

> > 

> > On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:

> > > Now that we have both the dl class bandwidth requirement and the dl class

> > > utilization, we can use the max of the 2 values when agregating the

> > > utilization of the CPU.

> > > 

> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> > > ---

> > >  kernel/sched/sched.h | 6 +++++-

> > >  1 file changed, 5 insertions(+), 1 deletion(-)

> > > 

> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

> > > index 4526ba6..0eb07a8 100644

> > > --- a/kernel/sched/sched.h

> > > +++ b/kernel/sched/sched.h

> > > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

> > >  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

> > >  static inline unsigned long cpu_util_dl(struct rq *rq)

> > >  {

> > > -	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > > +	unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > > +

> > > +	util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

> > 

> > Would it make sense to use a UTIL_EST version of that signal here ? I

> > don't think that would make sense for the RT class with your patch-set

> > since you only really use the blocked part of the signal for RT IIUC,

> > but would that work for DL ?

> 

> Well, UTIL_EST for DL looks pretty much what we already do by computing

> utilization based on dl.running_bw. That's why I was thinking of using

> that as a starting point for dl.util_avg decay phase.


Hmmm I see your point, but running_bw and the util_avg are fundamentally
different ... I mean, the util_avg doesn't know about the period, which is
an issue in itself I guess ...

If you have a long running DL task (say 100ms runtime) with a long period
(say 1s), the running_bw should represent ~1/10 of the CPU capacity, but
the util_avg can go quite high, which means that you might end up
executing this task at max OPP. So if we really want to drive OPPs like
that for deadline, a util_est-like version of this util_avg signal
should help. Now, you can also argue that going to max OPP for a task
that _we know_ uses 1/10 of the CPU capacity isn't right ...
Juri Lelli May 30, 2018, 8:51 a.m. | #12
On 30/05/18 09:37, Quentin Perret wrote:
> On Tuesday 29 May 2018 at 11:52:03 (+0200), Juri Lelli wrote:

> > On 29/05/18 09:40, Quentin Perret wrote:

> > > Hi Vincent,

> > > 

> > > On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:

> > > > Now that we have both the dl class bandwidth requirement and the dl class

> > > > utilization, we can use the max of the 2 values when agregating the

> > > > utilization of the CPU.

> > > > 

> > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> > > > ---

> > > >  kernel/sched/sched.h | 6 +++++-

> > > >  1 file changed, 5 insertions(+), 1 deletion(-)

> > > > 

> > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

> > > > index 4526ba6..0eb07a8 100644

> > > > --- a/kernel/sched/sched.h

> > > > +++ b/kernel/sched/sched.h

> > > > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

> > > >  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

> > > >  static inline unsigned long cpu_util_dl(struct rq *rq)

> > > >  {

> > > > -	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > > > +	unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > > > +

> > > > +	util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

> > > 

> > > Would it make sense to use a UTIL_EST version of that signal here ? I

> > > don't think that would make sense for the RT class with your patch-set

> > > since you only really use the blocked part of the signal for RT IIUC,

> > > but would that work for DL ?

> > 

> > Well, UTIL_EST for DL looks pretty much what we already do by computing

> > utilization based on dl.running_bw. That's why I was thinking of using

> > that as a starting point for dl.util_avg decay phase.

> 

> Hmmm I see your point, but running_bw and the util_avg are fundamentally

> different ... I mean, the util_avg doesn't know about the period, which is

> an issue in itself I guess ...

> 

> If you have a long running DL task (say 100ms runtime) with a long period

> (say 1s), the running_bw should represent ~1/10 of the CPU capacity, but

> the util_avg can go quite high, which means that you might end up

> executing this task at max OPP. So if we really want to drive OPPs like

> that for deadline, a util_est-like version of this util_avg signal

> should help. Now, you can also argue that going to max OPP for a task

> that _we know_ uses 1/10 of the CPU capacity isn't right ...


Yep, that's my point. :)
Patrick Bellasi May 31, 2018, 10:27 a.m. | #13
Hi Vincent, Juri,

On 28-May 18:34, Vincent Guittot wrote:
> On 28 May 2018 at 17:22, Juri Lelli <juri.lelli@redhat.com> wrote:

> > On 28/05/18 16:57, Vincent Guittot wrote:

> >> Hi Juri,

> >>

> >> On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote:

> >> > Hi Vincent,

> >> >

> >> > On 25/05/18 15:12, Vincent Guittot wrote:

> >> >> Now that we have both the dl class bandwidth requirement and the dl class

> >> >> utilization, we can use the max of the 2 values when agregating the

> >> >> utilization of the CPU.

> >> >>

> >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> >> >> ---

> >> >>  kernel/sched/sched.h | 6 +++++-

> >> >>  1 file changed, 5 insertions(+), 1 deletion(-)

> >> >>

> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

> >> >> index 4526ba6..0eb07a8 100644

> >> >> --- a/kernel/sched/sched.h

> >> >> +++ b/kernel/sched/sched.h

> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

> >> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

> >> >>  static inline unsigned long cpu_util_dl(struct rq *rq)

> >> >>  {

> >> >> -     return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> >> >> +     unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> >> >

> >> > I'd be tempted to say the we actually want to cap to this one above

> >> > instead of using the max (as you are proposing below) or the

> >> > (theoretical) power reduction benefits of using DEADLINE for certain

> >> > tasks might vanish.

> >>

> >> The problem that I'm facing is that the sched_entity bandwidth is

> >> removed after the 0-lag time and the rq->dl.running_bw goes back to

> >> zero but if the DL task has preempted a CFS task, the utilization of

> >> the CFS task will be lower than reality and schedutil will set a lower

> >> OPP whereas the CPU is always running.


With UTIL_EST enabled I don't expect an OPP reduction below the
expected utilization of a CFS task.

IOW, when a periodic CFS task is preempted by a DL one, what we use
for OPP selection once the DL task is over is still the estimated
utilization for the CFS task itself. Thus, schedutil will eventually
(since we have quite conservative down scaling thresholds) go down to
the right OPP to serve that task.

> >> The example with a RT task described in the cover letter can be

> >> run with a DL task and will give similar results.


In the cover letter you says:

   A rt-app use case which creates an always running cfs thread and a
   rt threads that wakes up periodically with both threads pinned on
   same CPU, show lot of frequency switches of the CPU whereas the CPU
   never goes idles during the test.

I would say that's a quite specific corner case where your always
running CFS task has never accumulated a util_est sample.

Do we really have these cases in real systems?

Otherwise, it seems to me that we are trying to solve quite specific
corner cases by adding a not negligible level of "complexity".

Moreover, I also have the impression that we can fix these
use-cases by:

  - improving the way we accumulate samples in util_est
    i.e. by discarding preemption time

  - maybe by improving the utilization aggregation in schedutil to
    better understand DL requirements
    i.e. a 10% utilization with a 100ms running time is way different
    then the same utilization with a 1ms running time


-- 
#include <best/regards.h>

Patrick Bellasi
Vincent Guittot May 31, 2018, 1:02 p.m. | #14
On 31 May 2018 at 12:27, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> Hi Vincent, Juri,

>

> On 28-May 18:34, Vincent Guittot wrote:

>> On 28 May 2018 at 17:22, Juri Lelli <juri.lelli@redhat.com> wrote:

>> > On 28/05/18 16:57, Vincent Guittot wrote:

>> >> Hi Juri,

>> >>

>> >> On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote:

>> >> > Hi Vincent,

>> >> >

>> >> > On 25/05/18 15:12, Vincent Guittot wrote:

>> >> >> Now that we have both the dl class bandwidth requirement and the dl class

>> >> >> utilization, we can use the max of the 2 values when agregating the

>> >> >> utilization of the CPU.

>> >> >>

>> >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>> >> >> ---

>> >> >>  kernel/sched/sched.h | 6 +++++-

>> >> >>  1 file changed, 5 insertions(+), 1 deletion(-)

>> >> >>

>> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

>> >> >> index 4526ba6..0eb07a8 100644

>> >> >> --- a/kernel/sched/sched.h

>> >> >> +++ b/kernel/sched/sched.h

>> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

>> >> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

>> >> >>  static inline unsigned long cpu_util_dl(struct rq *rq)

>> >> >>  {

>> >> >> -     return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

>> >> >> +     unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

>> >> >

>> >> > I'd be tempted to say the we actually want to cap to this one above

>> >> > instead of using the max (as you are proposing below) or the

>> >> > (theoretical) power reduction benefits of using DEADLINE for certain

>> >> > tasks might vanish.

>> >>

>> >> The problem that I'm facing is that the sched_entity bandwidth is

>> >> removed after the 0-lag time and the rq->dl.running_bw goes back to

>> >> zero but if the DL task has preempted a CFS task, the utilization of

>> >> the CFS task will be lower than reality and schedutil will set a lower

>> >> OPP whereas the CPU is always running.

>

> With UTIL_EST enabled I don't expect an OPP reduction below the

> expected utilization of a CFS task.


I'm not sure to fully catch what you mean but all tests that I ran,
are using util_est (which is enable by default  if i'm not wrong). So
all OPP drops that have been observed, were with util_est

>

> IOW, when a periodic CFS task is preempted by a DL one, what we use

> for OPP selection once the DL task is over is still the estimated

> utilization for the CFS task itself. Thus, schedutil will eventually

> (since we have quite conservative down scaling thresholds) go down to

> the right OPP to serve that task.

>

>> >> The example with a RT task described in the cover letter can be

>> >> run with a DL task and will give similar results.

>

> In the cover letter you says:

>

>    A rt-app use case which creates an always running cfs thread and a

>    rt threads that wakes up periodically with both threads pinned on

>    same CPU, show lot of frequency switches of the CPU whereas the CPU

>    never goes idles during the test.

>

> I would say that's a quite specific corner case where your always

> running CFS task has never accumulated a util_est sample.

>

> Do we really have these cases in real systems?


My example is voluntary an extreme one because it's easier to
highlight the problem

>

> Otherwise, it seems to me that we are trying to solve quite specific

> corner cases by adding a not negligible level of "complexity".


By complexity, do you mean :

Taking into account the number cfs running task to choose between
rq->dl.running_bw and avg_dl.util_avg

I'm preparing a patchset that will provide the cfs waiting time in
addition to dl/rt util_avg for almost no additional cost. I will try
to sent the proposal later today

>

> Moreover, I also have the impression that we can fix these

> use-cases by:

>

>   - improving the way we accumulate samples in util_est

>     i.e. by discarding preemption time

>

>   - maybe by improving the utilization aggregation in schedutil to

>     better understand DL requirements

>     i.e. a 10% utilization with a 100ms running time is way different

>     then the same utilization with a 1ms running time

>

>

> --

> #include <best/regards.h>

>

> Patrick Bellasi
Vincent Guittot June 1, 2018, 1:53 p.m. | #15
Le Thursday 31 May 2018 à 15:02:04 (+0200), Vincent Guittot a écrit :
> On 31 May 2018 at 12:27, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> >

> > Hi Vincent, Juri,

> >

> > On 28-May 18:34, Vincent Guittot wrote:

> >> On 28 May 2018 at 17:22, Juri Lelli <juri.lelli@redhat.com> wrote:

> >> > On 28/05/18 16:57, Vincent Guittot wrote:

> >> >> Hi Juri,

> >> >>

> >> >> On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote:

> >> >> > Hi Vincent,

> >> >> >

> >> >> > On 25/05/18 15:12, Vincent Guittot wrote:

> >> >> >> Now that we have both the dl class bandwidth requirement and the dl class

> >> >> >> utilization, we can use the max of the 2 values when agregating the

> >> >> >> utilization of the CPU.

> >> >> >>

> >> >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> >> >> >> ---

> >> >> >>  kernel/sched/sched.h | 6 +++++-

> >> >> >>  1 file changed, 5 insertions(+), 1 deletion(-)

> >> >> >>

> >> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

> >> >> >> index 4526ba6..0eb07a8 100644

> >> >> >> --- a/kernel/sched/sched.h

> >> >> >> +++ b/kernel/sched/sched.h

> >> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

> >> >> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

> >> >> >>  static inline unsigned long cpu_util_dl(struct rq *rq)

> >> >> >>  {

> >> >> >> -     return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> >> >> >> +     unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> >> >> >

> >> >> > I'd be tempted to say the we actually want to cap to this one above

> >> >> > instead of using the max (as you are proposing below) or the

> >> >> > (theoretical) power reduction benefits of using DEADLINE for certain

> >> >> > tasks might vanish.

> >> >>

> >> >> The problem that I'm facing is that the sched_entity bandwidth is

> >> >> removed after the 0-lag time and the rq->dl.running_bw goes back to

> >> >> zero but if the DL task has preempted a CFS task, the utilization of

> >> >> the CFS task will be lower than reality and schedutil will set a lower

> >> >> OPP whereas the CPU is always running.

> >

> > With UTIL_EST enabled I don't expect an OPP reduction below the

> > expected utilization of a CFS task.

> 

> I'm not sure to fully catch what you mean but all tests that I ran,

> are using util_est (which is enable by default  if i'm not wrong). So

> all OPP drops that have been observed, were with util_est

> 

> >

> > IOW, when a periodic CFS task is preempted by a DL one, what we use

> > for OPP selection once the DL task is over is still the estimated

> > utilization for the CFS task itself. Thus, schedutil will eventually

> > (since we have quite conservative down scaling thresholds) go down to

> > the right OPP to serve that task.

> >

> >> >> The example with a RT task described in the cover letter can be

> >> >> run with a DL task and will give similar results.

> >

> > In the cover letter you says:

> >

> >    A rt-app use case which creates an always running cfs thread and a

> >    rt threads that wakes up periodically with both threads pinned on

> >    same CPU, show lot of frequency switches of the CPU whereas the CPU

> >    never goes idles during the test.

> >

> > I would say that's a quite specific corner case where your always

> > running CFS task has never accumulated a util_est sample.

> >

> > Do we really have these cases in real systems?

> 

> My example is voluntary an extreme one because it's easier to

> highlight the problem

> 

> >

> > Otherwise, it seems to me that we are trying to solve quite specific

> > corner cases by adding a not negligible level of "complexity".

> 

> By complexity, do you mean :

> 

> Taking into account the number cfs running task to choose between

> rq->dl.running_bw and avg_dl.util_avg

> 

> I'm preparing a patchset that will provide the cfs waiting time in

> addition to dl/rt util_avg for almost no additional cost. I will try

> to sent the proposal later today



The code below adds the tracking of the waiting level of cfs tasks because of
rt/dl preemption. This waiting time can then be used when selecting an OPP
instead of the dl util_avg which could become higher than dl bandwidth with
"long" runtime

We need only one new call for the 1st cfs task that is enqueued to get these additional metrics
the call to arch_scale_cpu_capacity() can be removed once the later will be
taken into account when computing the load (which scales only with freq
currently)

For rt task, we must keep to take into account util_avg to have an idea of the
rt level on the cpu which is given by the badnwodth for dl

---
 kernel/sched/fair.c  | 27 +++++++++++++++++++++++++++
 kernel/sched/pelt.c  |  8 ++++++--
 kernel/sched/sched.h |  4 +++-
 3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eac1f9a..1682ea7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+static inline void update_cfs_wait_util_avg(struct rq *rq)
+{
+	/*
+	 * If cfs is already enqueued, we don't have anything to do because
+	 * we already updated the non waiting time
+	 */
+	if (rq->cfs.h_nr_running)
+		return;
+
+	/*
+	 * If rt is running, we update the non wait time before increasing
+	 * cfs.h_nr_running)
+	 */
+	if (rq->curr->sched_class == &rt_sched_class)
+		update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+
+	/*
+	 * If dl is running, we update the non time before increasing
+	 * cfs.h_nr_running)
+	 */
+	if (rq->curr->sched_class == &dl_sched_class)
+		update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5159,6 +5183,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
 
+	/* Update the tracking of waiting time */
+	update_cfs_wait_util_avg(rq);
+
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
 	 * the cfs_rq utilization to select a frequency.
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index a559a53..ef8905e 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -322,9 +322,11 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
 
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 {
+	unsigned long waiting = rq->cfs.h_nr_running ? arch_scale_cpu_capacity(NULL, rq->cpu) : 0;
+
 	if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
 				running,
-				running,
+				waiting,
 				running)) {
 
 		___update_load_avg(&rq->avg_rt, 1, 1);
@@ -345,9 +347,11 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
+	unsigned long waiting = rq->cfs.h_nr_running ? arch_scale_cpu_capacity(NULL, rq->cpu) : 0;
+
 	if (___update_load_sum(now, rq->cpu, &rq->avg_dl,
 				running,
-				running,
+				waiting,
 				running)) {
 
 		___update_load_avg(&rq->avg_dl, 1, 1);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0ea94de..9f72a05 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2184,7 +2184,9 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
 {
 	unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
 
-	util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
+	util = max_t(unsigned long, util,
+			     READ_ONCE(rq->avg_dl.runnable_load_avg));
+
 	trace_printk("cpu_util_dl cpu%d %u %lu %llu", rq->cpu,
 						   rq->cfs.h_nr_running,
 						   READ_ONCE(rq->avg_dl.util_avg),
-- 
2.7.4



> 

> >

> > Moreover, I also have the impression that we can fix these

> > use-cases by:

> >

> >   - improving the way we accumulate samples in util_est

> >     i.e. by discarding preemption time

> >

> >   - maybe by improving the utilization aggregation in schedutil to

> >     better understand DL requirements

> >     i.e. a 10% utilization with a 100ms running time is way different

> >     then the same utilization with a 1ms running time

> >

> >

> > --

> > #include <best/regards.h>

> >

> > Patrick Bellasi
Joel Fernandes June 1, 2018, 5:45 p.m. | #16
On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
> > >> >> The example with a RT task described in the cover letter can be

> > >> >> run with a DL task and will give similar results.

> > >

> > > In the cover letter you says:

> > >

> > >    A rt-app use case which creates an always running cfs thread and a

> > >    rt threads that wakes up periodically with both threads pinned on

> > >    same CPU, show lot of frequency switches of the CPU whereas the CPU

> > >    never goes idles during the test.

> > >

> > > I would say that's a quite specific corner case where your always

> > > running CFS task has never accumulated a util_est sample.

> > >

> > > Do we really have these cases in real systems?

> > 

> > My example is voluntary an extreme one because it's easier to

> > highlight the problem

> > 

> > >

> > > Otherwise, it seems to me that we are trying to solve quite specific

> > > corner cases by adding a not negligible level of "complexity".

> > 

> > By complexity, do you mean :

> > 

> > Taking into account the number cfs running task to choose between

> > rq->dl.running_bw and avg_dl.util_avg

> > 

> > I'm preparing a patchset that will provide the cfs waiting time in

> > addition to dl/rt util_avg for almost no additional cost. I will try

> > to sent the proposal later today

> 

> 

> The code below adds the tracking of the waiting level of cfs tasks because of

> rt/dl preemption. This waiting time can then be used when selecting an OPP

> instead of the dl util_avg which could become higher than dl bandwidth with

> "long" runtime

> 

> We need only one new call for the 1st cfs task that is enqueued to get these additional metrics

> the call to arch_scale_cpu_capacity() can be removed once the later will be

> taken into account when computing the load (which scales only with freq

> currently)

> 

> For rt task, we must keep to take into account util_avg to have an idea of the

> rt level on the cpu which is given by the badnwodth for dl

> 

> ---

>  kernel/sched/fair.c  | 27 +++++++++++++++++++++++++++

>  kernel/sched/pelt.c  |  8 ++++++--

>  kernel/sched/sched.h |  4 +++-

>  3 files changed, 36 insertions(+), 3 deletions(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index eac1f9a..1682ea7 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)

>  }

>  #endif

>  

> +static inline void update_cfs_wait_util_avg(struct rq *rq)

> +{

> +	/*

> +	 * If cfs is already enqueued, we don't have anything to do because

> +	 * we already updated the non waiting time

> +	 */

> +	if (rq->cfs.h_nr_running)

> +		return;

> +

> +	/*

> +	 * If rt is running, we update the non wait time before increasing

> +	 * cfs.h_nr_running)

> +	 */

> +	if (rq->curr->sched_class == &rt_sched_class)

> +		update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);

> +

> +	/*

> +	 * If dl is running, we update the non time before increasing

> +	 * cfs.h_nr_running)

> +	 */

> +	if (rq->curr->sched_class == &dl_sched_class)

> +		update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);

> +}

> +


Please correct me if I'm wrong but the CFS preemption-decay happens in
set_next_entity -> update_load_avg when the CFS task is scheduled again after
the preemption. Then can we not fix this issue by doing our UTIL_EST magic
from set_next_entity? But yeah probably we need to be careful with overhead..

IMO I feel its overkill to account dl_avg when we already have DL's running
bandwidth we can use. I understand it may be too instanenous, but perhaps we
can fix CFS's problems within CFS itself and not have to do this kind of
extra external accounting ?

I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg
from within CFS class as being done above.

thanks,

 - Joel
Vincent Guittot June 4, 2018, 6:41 a.m. | #17
On 1 June 2018 at 19:45, Joel Fernandes <joelaf@google.com> wrote:
> On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:

>> > >> >> The example with a RT task described in the cover letter can be

>> > >> >> run with a DL task and will give similar results.

>> > >

>> > > In the cover letter you says:

>> > >

>> > >    A rt-app use case which creates an always running cfs thread and a

>> > >    rt threads that wakes up periodically with both threads pinned on

>> > >    same CPU, show lot of frequency switches of the CPU whereas the CPU

>> > >    never goes idles during the test.

>> > >

>> > > I would say that's a quite specific corner case where your always

>> > > running CFS task has never accumulated a util_est sample.

>> > >

>> > > Do we really have these cases in real systems?

>> >

>> > My example is voluntary an extreme one because it's easier to

>> > highlight the problem

>> >

>> > >

>> > > Otherwise, it seems to me that we are trying to solve quite specific

>> > > corner cases by adding a not negligible level of "complexity".

>> >

>> > By complexity, do you mean :

>> >

>> > Taking into account the number cfs running task to choose between

>> > rq->dl.running_bw and avg_dl.util_avg

>> >

>> > I'm preparing a patchset that will provide the cfs waiting time in

>> > addition to dl/rt util_avg for almost no additional cost. I will try

>> > to sent the proposal later today

>>

>>

>> The code below adds the tracking of the waiting level of cfs tasks because of

>> rt/dl preemption. This waiting time can then be used when selecting an OPP

>> instead of the dl util_avg which could become higher than dl bandwidth with

>> "long" runtime

>>

>> We need only one new call for the 1st cfs task that is enqueued to get these additional metrics

>> the call to arch_scale_cpu_capacity() can be removed once the later will be

>> taken into account when computing the load (which scales only with freq

>> currently)

>>

>> For rt task, we must keep to take into account util_avg to have an idea of the

>> rt level on the cpu which is given by the badnwodth for dl

>>

>> ---

>>  kernel/sched/fair.c  | 27 +++++++++++++++++++++++++++

>>  kernel/sched/pelt.c  |  8 ++++++--

>>  kernel/sched/sched.h |  4 +++-

>>  3 files changed, 36 insertions(+), 3 deletions(-)

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index eac1f9a..1682ea7 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)

>>  }

>>  #endif

>>

>> +static inline void update_cfs_wait_util_avg(struct rq *rq)

>> +{

>> +     /*

>> +      * If cfs is already enqueued, we don't have anything to do because

>> +      * we already updated the non waiting time

>> +      */

>> +     if (rq->cfs.h_nr_running)

>> +             return;

>> +

>> +     /*

>> +      * If rt is running, we update the non wait time before increasing

>> +      * cfs.h_nr_running)

>> +      */

>> +     if (rq->curr->sched_class == &rt_sched_class)

>> +             update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);

>> +

>> +     /*

>> +      * If dl is running, we update the non time before increasing

>> +      * cfs.h_nr_running)

>> +      */

>> +     if (rq->curr->sched_class == &dl_sched_class)

>> +             update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);

>> +}

>> +

>

> Please correct me if I'm wrong but the CFS preemption-decay happens in

> set_next_entity -> update_load_avg when the CFS task is scheduled again after

> the preemption. Then can we not fix this issue by doing our UTIL_EST magic

> from set_next_entity? But yeah probably we need to be careful with overhead..


util_est is there to keep track of the last max. I'm not sure that
trying to add some magics to take into account preemption is the right
way to do. Mixing several information in the same metric just add more
fuzzy in the meaning of the metric.

>

> IMO I feel its overkill to account dl_avg when we already have DL's running

> bandwidth we can use. I understand it may be too instanenous, but perhaps we


We keep using dl bandwidth which is quite correct for dl needs but
doesn't reflect how it has disturbed other classes

> can fix CFS's problems within CFS itself and not have to do this kind of

> extra external accounting ?

>

> I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg

> from within CFS class as being done above.

>

> thanks,

>

>  - Joel

>
Juri Lelli June 4, 2018, 7:04 a.m. | #18
Hi Vincent,

On 04/06/18 08:41, Vincent Guittot wrote:
> On 1 June 2018 at 19:45, Joel Fernandes <joelaf@google.com> wrote:

> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:


[...]

> > IMO I feel its overkill to account dl_avg when we already have DL's running

> > bandwidth we can use. I understand it may be too instanenous, but perhaps we

> 

> We keep using dl bandwidth which is quite correct for dl needs but

> doesn't reflect how it has disturbed other classes

> 

> > can fix CFS's problems within CFS itself and not have to do this kind of

> > extra external accounting ?


I would also keep accounting for waiting time due to higher prio classes
all inside CFS. My impression, when discussing it with you on IRC, was
that we should be able to do that by not decaying cfs.util_avg when CFS
is preempted (creating a new signal for it). Is not this enough?

I feel we should try to keep cross-class accounting/interaction at a
minimum.

Thanks,

- Juri
Vincent Guittot June 4, 2018, 7:14 a.m. | #19
On 4 June 2018 at 09:04, Juri Lelli <juri.lelli@redhat.com> wrote:
> Hi Vincent,

>

> On 04/06/18 08:41, Vincent Guittot wrote:

>> On 1 June 2018 at 19:45, Joel Fernandes <joelaf@google.com> wrote:

>> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:

>

> [...]

>

>> > IMO I feel its overkill to account dl_avg when we already have DL's running

>> > bandwidth we can use. I understand it may be too instanenous, but perhaps we

>>

>> We keep using dl bandwidth which is quite correct for dl needs but

>> doesn't reflect how it has disturbed other classes

>>

>> > can fix CFS's problems within CFS itself and not have to do this kind of

>> > extra external accounting ?

>

> I would also keep accounting for waiting time due to higher prio classes

> all inside CFS. My impression, when discussing it with you on IRC, was

> that we should be able to do that by not decaying cfs.util_avg when CFS

> is preempted (creating a new signal for it). Is not this enough?


We don't just want to not decay a signal but increase the signal to
reflect the amount of preemption
Then, we can't do that in a current signal. So you would like to add
another metrics in cfs_rq ?
The place doesn't really matter to be honest in cfs_rq or in dl_rq but
you will not prevent to add call in dl class to start/stop the
accounting of the preemption

>

> I feel we should try to keep cross-class accounting/interaction at a

> minimum.


accounting for cross class preemption can't be done without
cross-class accounting

Regards,
Vincent

>

> Thanks,

>

> - Juri
Juri Lelli June 4, 2018, 10:12 a.m. | #20
On 04/06/18 09:14, Vincent Guittot wrote:
> On 4 June 2018 at 09:04, Juri Lelli <juri.lelli@redhat.com> wrote:

> > Hi Vincent,

> >

> > On 04/06/18 08:41, Vincent Guittot wrote:

> >> On 1 June 2018 at 19:45, Joel Fernandes <joelaf@google.com> wrote:

> >> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:

> >

> > [...]

> >

> >> > IMO I feel its overkill to account dl_avg when we already have DL's running

> >> > bandwidth we can use. I understand it may be too instanenous, but perhaps we

> >>

> >> We keep using dl bandwidth which is quite correct for dl needs but

> >> doesn't reflect how it has disturbed other classes

> >>

> >> > can fix CFS's problems within CFS itself and not have to do this kind of

> >> > extra external accounting ?

> >

> > I would also keep accounting for waiting time due to higher prio classes

> > all inside CFS. My impression, when discussing it with you on IRC, was

> > that we should be able to do that by not decaying cfs.util_avg when CFS

> > is preempted (creating a new signal for it). Is not this enough?

> 

> We don't just want to not decay a signal but increase the signal to

> reflect the amount of preemption


OK.

> Then, we can't do that in a current signal. So you would like to add

> another metrics in cfs_rq ?


Since it's CFS related, I'd say it should fit in CFS.

> The place doesn't really matter to be honest in cfs_rq or in dl_rq but

> you will not prevent to add call in dl class to start/stop the

> accounting of the preemption

> 

> >

> > I feel we should try to keep cross-class accounting/interaction at a

> > minimum.

> 

> accounting for cross class preemption can't be done without

> cross-class accounting


Mmm, can't we distinguish in, say, pick_next_task_fair() if prev was of
higher prio class and act accordingly?

Thanks,

- Juri
Vincent Guittot June 4, 2018, 12:35 p.m. | #21
On 4 June 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote:
> On 04/06/18 09:14, Vincent Guittot wrote:

>> On 4 June 2018 at 09:04, Juri Lelli <juri.lelli@redhat.com> wrote:

>> > Hi Vincent,

>> >

>> > On 04/06/18 08:41, Vincent Guittot wrote:

>> >> On 1 June 2018 at 19:45, Joel Fernandes <joelaf@google.com> wrote:

>> >> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:

>> >

>> > [...]

>> >

>> >> > IMO I feel its overkill to account dl_avg when we already have DL's running

>> >> > bandwidth we can use. I understand it may be too instanenous, but perhaps we

>> >>

>> >> We keep using dl bandwidth which is quite correct for dl needs but

>> >> doesn't reflect how it has disturbed other classes

>> >>

>> >> > can fix CFS's problems within CFS itself and not have to do this kind of

>> >> > extra external accounting ?

>> >

>> > I would also keep accounting for waiting time due to higher prio classes

>> > all inside CFS. My impression, when discussing it with you on IRC, was

>> > that we should be able to do that by not decaying cfs.util_avg when CFS

>> > is preempted (creating a new signal for it). Is not this enough?

>>

>> We don't just want to not decay a signal but increase the signal to

>> reflect the amount of preemption

>

> OK.

>

>> Then, we can't do that in a current signal. So you would like to add

>> another metrics in cfs_rq ?

>

> Since it's CFS related, I'd say it should fit in CFS.


It's dl and cfs as the goal is to track cfs preempted by dl
This means creating a new struct whereas some fields are unused in avg_dl struct
And duplicate some call to ___update_load_sum as we track avg_dl for
removing sched_rt_avg_update
and update_dl/rt_rq_load_avg are already call in fair.c for updating
blocked load

>

>> The place doesn't really matter to be honest in cfs_rq or in dl_rq but

>> you will not prevent to add call in dl class to start/stop the

>> accounting of the preemption

>>

>> >

>> > I feel we should try to keep cross-class accounting/interaction at a

>> > minimum.

>>

>> accounting for cross class preemption can't be done without

>> cross-class accounting

>

> Mmm, can't we distinguish in, say, pick_next_task_fair() if prev was of

> higher prio class and act accordingly?


we will not be able to make the difference between rt/dl/stop
preemption by using only pick_next_task_fair

Thanks

>

> Thanks,

>

> - Juri

Patch

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4526ba6..0eb07a8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2194,7 +2194,11 @@  static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
 static inline unsigned long cpu_util_dl(struct rq *rq)
 {
-	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
+	unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
+
+	util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
+
+	return util;
 }
 
 static inline unsigned long cpu_util_cfs(struct rq *rq)