diff mbox series

[v3,1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

Message ID 20171130114723.29210-2-patrick.bellasi@arm.com
State New
Headers show
Series [v3,1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter | expand

Commit Message

Patrick Bellasi Nov. 30, 2017, 11:47 a.m. UTC
Currently, sg_cpu's flags are set to the value defined by the last call
of the cpufreq_update_util(); for RT/DL classes this corresponds to the
SCHED_CPUFREQ_{RT/DL} flags always being set.

When multiple CPUs share the same frequency domain it might happen that
a CPU which executed an RT task, right before entering IDLE, has one of
the SCHED_CPUFREQ_RT_DL flags set, permanently, until it exits IDLE.

Although such an idle CPU is _going to be_ ignored by the
sugov_next_freq_shared():
  1. this kind of "useless RT requests" are ignored only if more then
     TICK_NSEC have elapsed since the last update
  2. we can still potentially trigger an already too late switch to
     MAX, which starts also a new throttling interval
  3. the internal state machine is not consistent with what the
     scheduler knows, i.e. the CPU is now actually idle

Thus, in sugov_next_freq_shared(), where utilisation and flags are
aggregated across all the CPUs of a frequency domain, it can turn out
that all the CPUs of that domain can run unnecessary at the maximum OPP
until another event happens in the idle CPU, which eventually clears the
SCHED_CPUFREQ_{RT/DL} flag, or the IDLE CPUs gets ignored after
TICK_NSEC [ns] since the CPU entering IDLE.

Such a behaviour can harm the energy efficiency of systems where RT
workloads are not so frequent and other CPUs in the same frequency
domain are running small utilisation workloads, which is a quite common
scenario in mobile embedded systems.

This patch proposes a solution which is aligned with the current
principle to update the flags each time a scheduling event happens. The
scheduling of the idle_task on a CPU is considered one of such
meaningful events.  That's why when the idle_task is selected for
execution we poke the schedutil policy to reset the flags for that CPU.

No frequency transitions are activated at that point, which is fair in
case the RT workload should come back in the future. However, this still
allows other CPUs in the same frequency domain to scale down the
frequency in case that should be possible.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org

---
Changes from v2:
- use cpufreq_update_util() instead of cpufreq_update_this_cpu()
- rebased on v4.15-rc1

Changes from v1:
- added "unlikely()" around the statement (SteveR)

Change-Id: I1192ca9a3acb767cb3a745967a7a23a17e1af7b7
---
 include/linux/sched/cpufreq.h    | 1 +
 kernel/sched/cpufreq_schedutil.c | 7 +++++++
 kernel/sched/idle_task.c         | 4 ++++
 3 files changed, 12 insertions(+)

-- 
2.14.1

Comments

Juri Lelli Nov. 30, 2017, 1:12 p.m. UTC | #1
Hi,

On 30/11/17 11:47, Patrick Bellasi wrote:

[...]

> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> index 2f52ec0f1539..67339ccb5595 100644

> --- a/kernel/sched/cpufreq_schedutil.c

> +++ b/kernel/sched/cpufreq_schedutil.c

> @@ -347,6 +347,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

>  

>  	sg_cpu->util = util;

>  	sg_cpu->max = max;

> +

> +	/* CPU is entering IDLE, reset flags without triggering an update */

> +	if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {

> +		sg_cpu->flags = 0;

> +		goto done;

> +	}


Looks good for now. I'm just thinking that we will happen for DL, as a
CPU that still "has" a sleeping task is not going to be really idle
until the 0-lag time. I guess we could move this at that point in time?

>  	sg_cpu->flags = flags;

>  

>  	sugov_set_iowait_boost(sg_cpu, time, flags);

> @@ -361,6 +367,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

>  		sugov_update_commit(sg_policy, time, next_f);

>  	}

>  

> +done:

>  	raw_spin_unlock(&sg_policy->update_lock);

>  }

>  

> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c

> index d518664cce4f..6e8ae2aa7a13 100644

> --- a/kernel/sched/idle_task.c

> +++ b/kernel/sched/idle_task.c

> @@ -30,6 +30,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf

>  	put_prev_task(rq, prev);

>  	update_idle_core(rq);

>  	schedstat_inc(rq->sched_goidle);

> +

> +	/* kick cpufreq (see the comment in kernel/sched/sched.h). */

> +	cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE);


Don't know if it make things any cleaner, but you could add to the
comment that we don't actually trigger a frequency update with this
call.

Best,

Juri
Patrick Bellasi Nov. 30, 2017, 3:41 p.m. UTC | #2
On 30-Nov 14:12, Juri Lelli wrote:
> Hi,

> 

> On 30/11/17 11:47, Patrick Bellasi wrote:

> 

> [...]

> 

> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > index 2f52ec0f1539..67339ccb5595 100644

> > --- a/kernel/sched/cpufreq_schedutil.c

> > +++ b/kernel/sched/cpufreq_schedutil.c

> > @@ -347,6 +347,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

> >  

> >  	sg_cpu->util = util;

> >  	sg_cpu->max = max;

> > +

> > +	/* CPU is entering IDLE, reset flags without triggering an update */

> > +	if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {

> > +		sg_cpu->flags = 0;

> > +		goto done;

> > +	}

> 

> Looks good for now. I'm just thinking that we will happen for DL, as a

> CPU that still "has" a sleeping task is not going to be really idle

> until the 0-lag time.


AFAIU, for the time being, DL already cannot really rely on this flag
for its behaviors to be correct. Indeed, flags are reset as soon as
a FAIR task wakes up and it's enqueued.

Only once your DL integration patches are in, we do not depends on
flags anymore since DL will report a ceratain utilization up to the
0-lag time, isn't it?

If that's the case, I would say that the flags will be used only to
jump to the max OPP for RT tasks. Thus, this patch should still be valid.

> I guess we could move this at that point in time?


Not sure what you mean here. Right now the new SCHED_CPUFREQ_IDLE flag
is notified only by idle tasks. That's the only code path where we are
sure the CPU is entering IDLE.

> >  	sg_cpu->flags = flags;

> >  

> >  	sugov_set_iowait_boost(sg_cpu, time, flags);

> > @@ -361,6 +367,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

> >  		sugov_update_commit(sg_policy, time, next_f);

> >  	}

> >  

> > +done:

> >  	raw_spin_unlock(&sg_policy->update_lock);

> >  }

> >  

> > diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c

> > index d518664cce4f..6e8ae2aa7a13 100644

> > --- a/kernel/sched/idle_task.c

> > +++ b/kernel/sched/idle_task.c

> > @@ -30,6 +30,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf

> >  	put_prev_task(rq, prev);

> >  	update_idle_core(rq);

> >  	schedstat_inc(rq->sched_goidle);

> > +

> > +	/* kick cpufreq (see the comment in kernel/sched/sched.h). */

> > +	cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE);

> 

> Don't know if it make things any cleaner, but you could add to the

> comment that we don't actually trigger a frequency update with this

> call.


Right, will add on next posting.

> Best,

> 

> Juri


Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi
Juri Lelli Nov. 30, 2017, 4:02 p.m. UTC | #3
On 30/11/17 15:41, Patrick Bellasi wrote:
> On 30-Nov 14:12, Juri Lelli wrote:

> > Hi,

> > 

> > On 30/11/17 11:47, Patrick Bellasi wrote:

> > 

> > [...]

> > 

> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > index 2f52ec0f1539..67339ccb5595 100644

> > > --- a/kernel/sched/cpufreq_schedutil.c

> > > +++ b/kernel/sched/cpufreq_schedutil.c

> > > @@ -347,6 +347,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

> > >  

> > >  	sg_cpu->util = util;

> > >  	sg_cpu->max = max;

> > > +

> > > +	/* CPU is entering IDLE, reset flags without triggering an update */

> > > +	if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {

> > > +		sg_cpu->flags = 0;

> > > +		goto done;

> > > +	}

> > 

> > Looks good for now. I'm just thinking that we will happen for DL, as a

> > CPU that still "has" a sleeping task is not going to be really idle

> > until the 0-lag time.

> 

> AFAIU, for the time being, DL already cannot really rely on this flag

> for its behaviors to be correct. Indeed, flags are reset as soon as

> a FAIR task wakes up and it's enqueued.


Right, and your flags ORing patch should help with this.

> 

> Only once your DL integration patches are in, we do not depends on

> flags anymore since DL will report a ceratain utilization up to the

> 0-lag time, isn't it?


Utilization won't decrease until 0-lag time, correct. I was just
wondering if resetting flags before that time (when a CPU enters idle)
might be an issue.

> 

> If that's the case, I would say that the flags will be used only to

> jump to the max OPP for RT tasks. Thus, this patch should still be valid.

> 

> > I guess we could move this at that point in time?

> 

> Not sure what you mean here. Right now the new SCHED_CPUFREQ_IDLE flag

> is notified only by idle tasks. That's the only code path where we are

> sure the CPU is entering IDLE.

> 


W.r.t. the possible issue above, I was thinking that we might want to
reset flags at 0-lag time for DL (if CPU is still idle). Anyway, two
distinct set of patches. Who gets in last will have to ponder the thing
a little bit more. :)

Best,

Juri
Patrick Bellasi Nov. 30, 2017, 4:19 p.m. UTC | #4
On 30-Nov 17:02, Juri Lelli wrote:
> On 30/11/17 15:41, Patrick Bellasi wrote:

> > On 30-Nov 14:12, Juri Lelli wrote:

> > > Hi,

> > > 

> > > On 30/11/17 11:47, Patrick Bellasi wrote:

> > > 

> > > [...]

> > > 

> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > > index 2f52ec0f1539..67339ccb5595 100644

> > > > --- a/kernel/sched/cpufreq_schedutil.c

> > > > +++ b/kernel/sched/cpufreq_schedutil.c

> > > > @@ -347,6 +347,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

> > > >  

> > > >  	sg_cpu->util = util;

> > > >  	sg_cpu->max = max;

> > > > +

> > > > +	/* CPU is entering IDLE, reset flags without triggering an update */

> > > > +	if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {

> > > > +		sg_cpu->flags = 0;

> > > > +		goto done;

> > > > +	}

> > > 

> > > Looks good for now. I'm just thinking that we will happen for DL, as a

> > > CPU that still "has" a sleeping task is not going to be really idle

> > > until the 0-lag time.

> > 

> > AFAIU, for the time being, DL already cannot really rely on this flag

> > for its behaviors to be correct. Indeed, flags are reset as soon as

> > a FAIR task wakes up and it's enqueued.

> 

> Right, and your flags ORing patch should help with this.

> 

> > 

> > Only once your DL integration patches are in, we do not depends on

> > flags anymore since DL will report a ceratain utilization up to the

> > 0-lag time, isn't it?

> 

> Utilization won't decrease until 0-lag time, correct.


Then IMO with your DL patches the DL class don't need the flags
anymore since schedutil will know (and account) for the
utlization required by the DL tasks. Isn't it?

> I was just wondering if resetting flags before that time (when a CPU

> enters idle) might be an issue.


If the above is correct, then flags will be used only for the RT class (and
IO boosting)... and thus this patch will still be useful as it is now:
meaning that once the idle task is selected we do not care anymore
about RT and IOBoosting (only).

> > If that's the case, I would say that the flags will be used only to

> > jump to the max OPP for RT tasks. Thus, this patch should still be valid.

> > 

> > > I guess we could move this at that point in time?

> > 

> > Not sure what you mean here. Right now the new SCHED_CPUFREQ_IDLE flag

> > is notified only by idle tasks. That's the only code path where we are

> > sure the CPU is entering IDLE.

> > 

> 

> W.r.t. the possible issue above, I was thinking that we might want to

> reset flags at 0-lag time for DL (if CPU is still idle). Anyway, two

> distinct set of patches. Who gets in last will have to ponder the thing

> a little bit more. :)


Perhaps I'm still a bit confused but, to me, it seems that with your
patches we completely fix DL but we still can use this exact same
patch just for RT tasks.

> Best,

> 

> Juri


-- 
#include <best/regards.h>

Patrick Bellasi
Juri Lelli Nov. 30, 2017, 4:45 p.m. UTC | #5
On 30/11/17 16:19, Patrick Bellasi wrote:
> On 30-Nov 17:02, Juri Lelli wrote:

> > On 30/11/17 15:41, Patrick Bellasi wrote:

> > > On 30-Nov 14:12, Juri Lelli wrote:

> > > > Hi,

> > > > 

> > > > On 30/11/17 11:47, Patrick Bellasi wrote:

> > > > 

> > > > [...]

> > > > 

> > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > > > index 2f52ec0f1539..67339ccb5595 100644

> > > > > --- a/kernel/sched/cpufreq_schedutil.c

> > > > > +++ b/kernel/sched/cpufreq_schedutil.c

> > > > > @@ -347,6 +347,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

> > > > >  

> > > > >  	sg_cpu->util = util;

> > > > >  	sg_cpu->max = max;

> > > > > +

> > > > > +	/* CPU is entering IDLE, reset flags without triggering an update */

> > > > > +	if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {

> > > > > +		sg_cpu->flags = 0;

> > > > > +		goto done;

> > > > > +	}

> > > > 

> > > > Looks good for now. I'm just thinking that we will happen for DL, as a

> > > > CPU that still "has" a sleeping task is not going to be really idle

> > > > until the 0-lag time.

> > > 

> > > AFAIU, for the time being, DL already cannot really rely on this flag

> > > for its behaviors to be correct. Indeed, flags are reset as soon as

> > > a FAIR task wakes up and it's enqueued.

> > 

> > Right, and your flags ORing patch should help with this.

> > 

> > > 

> > > Only once your DL integration patches are in, we do not depends on

> > > flags anymore since DL will report a ceratain utilization up to the

> > > 0-lag time, isn't it?

> > 

> > Utilization won't decrease until 0-lag time, correct.

> 

> Then IMO with your DL patches the DL class don't need the flags

> anymore since schedutil will know (and account) for the

> utlization required by the DL tasks. Isn't it?

> 

> > I was just wondering if resetting flags before that time (when a CPU

> > enters idle) might be an issue.

> 

> If the above is correct, then flags will be used only for the RT class (and

> IO boosting)... and thus this patch will still be useful as it is now:

> meaning that once the idle task is selected we do not care anymore

> about RT and IOBoosting (only).

> 

> > > If that's the case, I would say that the flags will be used only to

> > > jump to the max OPP for RT tasks. Thus, this patch should still be valid.

> > > 

> > > > I guess we could move this at that point in time?

> > > 

> > > Not sure what you mean here. Right now the new SCHED_CPUFREQ_IDLE flag

> > > is notified only by idle tasks. That's the only code path where we are

> > > sure the CPU is entering IDLE.

> > > 

> > 

> > W.r.t. the possible issue above, I was thinking that we might want to

> > reset flags at 0-lag time for DL (if CPU is still idle). Anyway, two

> > distinct set of patches. Who gets in last will have to ponder the thing

> > a little bit more. :)

> 

> Perhaps I'm still a bit confused but, to me, it seems that with your

> patches we completely fix DL but we still can use this exact same

> patch just for RT tasks.


We don't use the flags for bailing out during aggregation, so it should
be ok for DL yes.

Thanks,

Juri
Viresh Kumar Dec. 7, 2017, 5:01 a.m. UTC | #6
On 30-11-17, 11:47, Patrick Bellasi wrote:
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h

> index d1ad3d825561..bb5f778db023 100644

> --- a/include/linux/sched/cpufreq.h

> +++ b/include/linux/sched/cpufreq.h

> @@ -11,6 +11,7 @@

>  #define SCHED_CPUFREQ_RT	(1U << 0)

>  #define SCHED_CPUFREQ_DL	(1U << 1)

>  #define SCHED_CPUFREQ_IOWAIT	(1U << 2)

> +#define SCHED_CPUFREQ_IDLE	(1U << 3)

>  

>  #define SCHED_CPUFREQ_RT_DL	(SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)

>  

> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> index 2f52ec0f1539..67339ccb5595 100644

> --- a/kernel/sched/cpufreq_schedutil.c

> +++ b/kernel/sched/cpufreq_schedutil.c

> @@ -347,6 +347,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

>  

>  	sg_cpu->util = util;

>  	sg_cpu->max = max;

> +

> +	/* CPU is entering IDLE, reset flags without triggering an update */

> +	if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {

> +		sg_cpu->flags = 0;

> +		goto done;

> +	}

>  	sg_cpu->flags = flags;

>  

>  	sugov_set_iowait_boost(sg_cpu, time, flags);

> @@ -361,6 +367,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

>  		sugov_update_commit(sg_policy, time, next_f);

>  	}

>  

> +done:

>  	raw_spin_unlock(&sg_policy->update_lock);

>  }

>  

> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c

> index d518664cce4f..6e8ae2aa7a13 100644

> --- a/kernel/sched/idle_task.c

> +++ b/kernel/sched/idle_task.c

> @@ -30,6 +30,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf

>  	put_prev_task(rq, prev);

>  	update_idle_core(rq);

>  	schedstat_inc(rq->sched_goidle);

> +

> +	/* kick cpufreq (see the comment in kernel/sched/sched.h). */

> +	cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE);


We posted some comments on V2 for this particular patch suggesting
some improvements. The patch hasn't changed at all and you haven't
replied to few of those suggestions as well. Any particular reason for
that?

For example:
- I suggested to get rid of the conditional expression in
  cpufreq_schedutil.c file that you have added.
- And Joel suggested to clear the RT/DL flags from dequeue path to
  avoid adding SCHED_CPUFREQ_IDLE flag.

-- 
viresh
Patrick Bellasi Dec. 7, 2017, 12:45 p.m. UTC | #7
Hi Viresh,

On 07-Dec 10:31, Viresh Kumar wrote:
> On 30-11-17, 11:47, Patrick Bellasi wrote:

> > diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h

> > index d1ad3d825561..bb5f778db023 100644

> > --- a/include/linux/sched/cpufreq.h

> > +++ b/include/linux/sched/cpufreq.h

> > @@ -11,6 +11,7 @@

> >  #define SCHED_CPUFREQ_RT	(1U << 0)

> >  #define SCHED_CPUFREQ_DL	(1U << 1)

> >  #define SCHED_CPUFREQ_IOWAIT	(1U << 2)

> > +#define SCHED_CPUFREQ_IDLE	(1U << 3)

> >  

> >  #define SCHED_CPUFREQ_RT_DL	(SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)

> >  

> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > index 2f52ec0f1539..67339ccb5595 100644

> > --- a/kernel/sched/cpufreq_schedutil.c

> > +++ b/kernel/sched/cpufreq_schedutil.c

> > @@ -347,6 +347,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

> >  

> >  	sg_cpu->util = util;

> >  	sg_cpu->max = max;

> > +

> > +	/* CPU is entering IDLE, reset flags without triggering an update */

> > +	if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {

> > +		sg_cpu->flags = 0;

> > +		goto done;

> > +	}

> >  	sg_cpu->flags = flags;

> >  

> >  	sugov_set_iowait_boost(sg_cpu, time, flags);

> > @@ -361,6 +367,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

> >  		sugov_update_commit(sg_policy, time, next_f);

> >  	}

> >  

> > +done:

> >  	raw_spin_unlock(&sg_policy->update_lock);

> >  }

> >  

> > diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c

> > index d518664cce4f..6e8ae2aa7a13 100644

> > --- a/kernel/sched/idle_task.c

> > +++ b/kernel/sched/idle_task.c

> > @@ -30,6 +30,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf

> >  	put_prev_task(rq, prev);

> >  	update_idle_core(rq);

> >  	schedstat_inc(rq->sched_goidle);

> > +

> > +	/* kick cpufreq (see the comment in kernel/sched/sched.h). */

> > +	cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE);

> 

> We posted some comments on V2 for this particular patch suggesting

> some improvements. The patch hasn't changed at all and you haven't

> replied to few of those suggestions as well. Any particular reason for

> that?


You right, since the previous posting has been a long time ago, with
this one I mainly wanted to refresh the discussion. Thanks for
highlighting hereafter which one was the main discussion points.


> For example:

> - I suggested to get rid of the conditional expression in

>   cpufreq_schedutil.c file that you have added.


We can probably set flags to SCHED_CPUFREQ_IDLE (instead of resetting
them), however I think we still need an if condition somewhere.

Indeed, when SCHED_CPUFREQ_IDLE is asserted we don't want to trigger
an OPP change (reasons described in the changelog).

If that's still a goal, then we will need to check this flag and bail
out from sugov_update_shared straight away. That's why I've added a
check at the beginning and also defined it as unlikely to have not
impact on all cases where we call a schedutil update with runnable
tasks.

Does this makes sense?

> - And Joel suggested to clear the RT/DL flags from dequeue path to

>   avoid adding SCHED_CPUFREQ_IDLE flag.


I had a thought about Joel's proposal:

>> wouldn't another way be to just clear the flag from the RT scheduling

>> class with an extra call to cpufreq_update_util with flags = 0 during

>> dequeue_rt_entity?


The main concern for me was that the current API is completely
transparent about which scheduling class is calling schedutil for
updates.

Thus, at dequeue time of an RT task we cannot really clear
all the flags (e.g. IOWAIT of a fair task), we should clear only
the RT related flags.

This means that we likely need to implement Joel's idea by:

1. adding a new set of flags like:
   SCHED_CPUFREQ_RT_IDLE, SCHED_CPUFREQ_DL_IDLE, etc...

3. add an operation flag, e.g.
   SCHED_CPUFERQ_SET, SCHED_CPUFREQ_RESET to be ORed with the class
   flag, e.g.
   cpufreq_update_util(rq, SCHED_CPUFREQ_SET|SCHED_CPUFREQ_RT);

3. change the API to carry the operation required for a flag, e.g.:
   cpufreq_update_util(rq, flag, set={true, false});

To be honest I don't like any of those, especially compared to the
simplicity of the one proposed by this patch. :)

IMO, the only pitfall of this patch is that (as Juri pointed out in
v2) for DL it can happen that we do not want to reset the flag right
when a CPU enters IDLE. We need instead a specific call to reset the
DL flag at the 0-lag time.

However, AFAIU, this special case for DL will disappear as long as we
have last Juri's set [1]in.  Indeed, at this point, schedutil will
always and only need to know the utilization required by DL.

[1] https://lkml.org/lkml/2017/12/4/173

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi
Dietmar Eggemann Dec. 7, 2017, 3:55 p.m. UTC | #8
On 12/07/2017 01:45 PM, Patrick Bellasi wrote:
> Hi Viresh,

> 

> On 07-Dec 10:31, Viresh Kumar wrote:

>> On 30-11-17, 11:47, Patrick Bellasi wrote:


[...]

>> We posted some comments on V2 for this particular patch suggesting

>> some improvements. The patch hasn't changed at all and you haven't

>> replied to few of those suggestions as well. Any particular reason for

>> that?

> 

> You right, since the previous posting has been a long time ago, with

> this one I mainly wanted to refresh the discussion. Thanks for

> highlighting hereafter which one was the main discussion points.

> 

> 

>> For example:

>> - I suggested to get rid of the conditional expression in

>>    cpufreq_schedutil.c file that you have added.

> 

> We can probably set flags to SCHED_CPUFREQ_IDLE (instead of resetting

> them), however I think we still need an if condition somewhere.

> 

> Indeed, when SCHED_CPUFREQ_IDLE is asserted we don't want to trigger

> an OPP change (reasons described in the changelog).

> 

> If that's still a goal, then we will need to check this flag and bail

> out from sugov_update_shared straight away. That's why I've added a

> check at the beginning and also defined it as unlikely to have not

> impact on all cases where we call a schedutil update with runnable

> tasks.

> 

> Does this makes sense?


IIRC, there was also this question of doing this not only in the shared 
but also in the single case ...

[...]
Juri Lelli Dec. 12, 2017, 1:38 p.m. UTC | #9
Hi Viresh,

On 12/12/17 17:07, Viresh Kumar wrote:

[...]

> From: Viresh Kumar <viresh.kumar@linaro.org>

> Date: Tue, 12 Dec 2017 15:43:26 +0530

> Subject: [PATCH] sched: Keep track of cpufreq utilization update flags

> 

> Currently the schedutil governor overwrites the sg_cpu->flags field on

> every call to the utilization handler. It was pretty good as the initial

> implementation of utilization handlers, there are several drawbacks

> though.

> 

> The biggest drawback is that the sg_cpu->flags field doesn't always

> represent the correct type of tasks that are enqueued on a CPU's rq. For

> example, if a fair task is enqueued while a RT or DL task is running, we

> will overwrite the flags with value 0 and that may take the CPU to lower

> OPPs unintentionally. There can be other corner cases as well which we

> aren't aware of currently.

> 

> This patch changes the current implementation to keep track of all the

> task types that are currently enqueued to the CPUs rq. There are two

> flags for every scheduling class now, one to set the flag and other one

> to clear it. The flag is set by the scheduling classes from the existing

> set of calls to cpufreq_update_util(), and the flag is cleared when the

> last task of the scheduling class is dequeued. For now, the util update

> handlers return immediately if they were called to clear the flag.

> 

> We can add more optimizations over this patch separately.

> 

> The last parameter of sugov_set_iowait_boost() is also dropped as the

> function can get it from sg_cpu anyway.

> 

> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>


[...]

> @@ -655,7 +669,7 @@ static int sugov_start(struct cpufreq_policy *policy)

>  		memset(sg_cpu, 0, sizeof(*sg_cpu));

>  		sg_cpu->cpu = cpu;

>  		sg_cpu->sg_policy = sg_policy;

> -		sg_cpu->flags = SCHED_CPUFREQ_RT;

> +		sg_cpu->flags = 0;

>  		sg_cpu->iowait_boost_max = policy->cpuinfo.max_freq;

>  	}


Why this change during initialization?

Thanks,

- Juri
Viresh Kumar Dec. 12, 2017, 2:40 p.m. UTC | #10
On 12-12-17, 14:38, Juri Lelli wrote:
> Hi Viresh,

> 

> On 12/12/17 17:07, Viresh Kumar wrote:

> 

> [...]

> 

> > From: Viresh Kumar <viresh.kumar@linaro.org>

> > Date: Tue, 12 Dec 2017 15:43:26 +0530

> > Subject: [PATCH] sched: Keep track of cpufreq utilization update flags

> > 

> > Currently the schedutil governor overwrites the sg_cpu->flags field on

> > every call to the utilization handler. It was pretty good as the initial

> > implementation of utilization handlers, there are several drawbacks

> > though.

> > 

> > The biggest drawback is that the sg_cpu->flags field doesn't always

> > represent the correct type of tasks that are enqueued on a CPU's rq. For

> > example, if a fair task is enqueued while a RT or DL task is running, we

> > will overwrite the flags with value 0 and that may take the CPU to lower

> > OPPs unintentionally. There can be other corner cases as well which we

> > aren't aware of currently.

> > 

> > This patch changes the current implementation to keep track of all the

> > task types that are currently enqueued to the CPUs rq. There are two

> > flags for every scheduling class now, one to set the flag and other one

> > to clear it. The flag is set by the scheduling classes from the existing

> > set of calls to cpufreq_update_util(), and the flag is cleared when the

> > last task of the scheduling class is dequeued. For now, the util update

> > handlers return immediately if they were called to clear the flag.

> > 

> > We can add more optimizations over this patch separately.

> > 

> > The last parameter of sugov_set_iowait_boost() is also dropped as the

> > function can get it from sg_cpu anyway.

> > 

> > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

> 

> [...]

> 

> > @@ -655,7 +669,7 @@ static int sugov_start(struct cpufreq_policy *policy)

> >  		memset(sg_cpu, 0, sizeof(*sg_cpu));

> >  		sg_cpu->cpu = cpu;

> >  		sg_cpu->sg_policy = sg_policy;

> > -		sg_cpu->flags = SCHED_CPUFREQ_RT;

> > +		sg_cpu->flags = 0;

> >  		sg_cpu->iowait_boost_max = policy->cpuinfo.max_freq;

> >  	}

> 

> Why this change during initialization?


Firstly I am not sure why it was set to SCHED_CPUFREQ_RT, as schedutil wouldn't
change the frequency until the first time the util handler is called. And once
that is called we were updating the flag anyway. So, unless I misunderstood its
purpose, it was doing anything helpful.

I need to remove it otherwise the RT flag may remain set for a very long time
unnecessarily. That would be until the time the last RT task is not dequeued.
Consider this for example: we are at max freq when sugov_start() is called and
it sets the RT flag, but there is no RT task to run. Now, we have tons of CFS
tasks but we always keep running at max because of the flag. Even the schedutil
RT thread doesn't get a chance to run/deququed, because we never want a freq
change with the RT flag and stay at max.

Makes sense ?

-- 
viresh
Patrick Bellasi Dec. 20, 2017, 2:51 p.m. UTC | #11
On 20-Dec 15:33, Peter Zijlstra wrote:
> On Thu, Nov 30, 2017 at 11:47:18AM +0000, Patrick Bellasi wrote:

> > Currently, sg_cpu's flags are set to the value defined by the last call

> > of the cpufreq_update_util(); for RT/DL classes this corresponds to the

> > SCHED_CPUFREQ_{RT/DL} flags always being set.

> > 

> > When multiple CPUs share the same frequency domain it might happen that

> > a CPU which executed an RT task, right before entering IDLE, has one of

> > the SCHED_CPUFREQ_RT_DL flags set, permanently, until it exits IDLE.

> > 

> > Although such an idle CPU is _going to be_ ignored by the

> > sugov_next_freq_shared():

> >   1. this kind of "useless RT requests" are ignored only if more then

> >      TICK_NSEC have elapsed since the last update

> >   2. we can still potentially trigger an already too late switch to

> >      MAX, which starts also a new throttling interval

> >   3. the internal state machine is not consistent with what the

> >      scheduler knows, i.e. the CPU is now actually idle

> 

> So I _really_ hate having to clutter the idle path for this shared case

> :/


:)

We would like to have per-CPU frequency domains... but the HW guys
always complain that's too costly from an HW/power standpoint...
and they are likely right :-/

So, here are are just at trying hard to have a SW status matching
the HW status... which is just another pain :-/

> 1, can obviously be fixed by short-circuiting the timeout when idle.


Mmm.. right... it should be possible for schedutil to detect that a
certain CPU is currently idle.

Can we use core.c::idle_cpu() from cpufreq_schedutil?

> 2. not sure how if you do 1; anybody doing a switch will go through

>    sugov_next_freq_shared() which will poll all relevant CPUs and per 1

>    will see its idle, no?


Right, that should work...

> Not sure what that leaves for 3.


When a CPU is detected idle, perhaps we can still clear the RT flags...
... just for "consistency" of current status representation.

> 

> > diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c

> > index d518664cce4f..6e8ae2aa7a13 100644

> > --- a/kernel/sched/idle_task.c

> > +++ b/kernel/sched/idle_task.c

> > @@ -30,6 +30,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf

> >  	put_prev_task(rq, prev);

> >  	update_idle_core(rq);

> >  	schedstat_inc(rq->sched_goidle);

> > +

> > +	/* kick cpufreq (see the comment in kernel/sched/sched.h). */

> > +	cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE);

> > +

> >  	return rq->idle;

> >  }

> >  

> > -- 

> > 2.14.1

> > 


-- 
#include <best/regards.h>

Patrick Bellasi
diff mbox series

Patch

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index d1ad3d825561..bb5f778db023 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -11,6 +11,7 @@ 
 #define SCHED_CPUFREQ_RT	(1U << 0)
 #define SCHED_CPUFREQ_DL	(1U << 1)
 #define SCHED_CPUFREQ_IOWAIT	(1U << 2)
+#define SCHED_CPUFREQ_IDLE	(1U << 3)
 
 #define SCHED_CPUFREQ_RT_DL	(SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 2f52ec0f1539..67339ccb5595 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -347,6 +347,12 @@  static void sugov_update_shared(struct update_util_data *hook, u64 time,
 
 	sg_cpu->util = util;
 	sg_cpu->max = max;
+
+	/* CPU is entering IDLE, reset flags without triggering an update */
+	if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
+		sg_cpu->flags = 0;
+		goto done;
+	}
 	sg_cpu->flags = flags;
 
 	sugov_set_iowait_boost(sg_cpu, time, flags);
@@ -361,6 +367,7 @@  static void sugov_update_shared(struct update_util_data *hook, u64 time,
 		sugov_update_commit(sg_policy, time, next_f);
 	}
 
+done:
 	raw_spin_unlock(&sg_policy->update_lock);
 }
 
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d518664cce4f..6e8ae2aa7a13 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -30,6 +30,10 @@  pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	put_prev_task(rq, prev);
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+
+	/* kick cpufreq (see the comment in kernel/sched/sched.h). */
+	cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE);
+
 	return rq->idle;
 }