[2/4] sched: cpufreq: Keep track of cpufreq utilization update flags

Message ID 17ff0b5d83a1275a98f0d1b87daf275f3e964af3.1513158452.git.viresh.kumar@linaro.org
State New
Headers show
Series
  • sched: cpufreq: Track util update flags
Related show

Commit Message

Viresh Kumar Dec. 13, 2017, 9:53 a.m.
Currently the schedutil governor overwrites the sg_cpu->flags field on
every call to the utilization handler. It was pretty good as the initial
implementation of utilization handlers, there are several drawbacks
though.

The biggest drawback is that the sg_cpu->flags field doesn't always
represent the correct type of tasks that are enqueued on a CPU's rq. For
example, if a fair task is enqueued while a RT or DL task is running, we
will overwrite the flags with value 0 and that may take the CPU to lower
OPPs unintentionally. There can be other corner cases as well which we
aren't aware of currently.

This patch changes the current implementation to keep track of all the
task types that are currently enqueued to the CPUs rq. A new flag: CLEAR
is introduced and is set by the scheduling classes when their last task
is dequeued. When the CLEAR flag bit is set, the schedutil governor resets
all the other flag bits that are present in the flags parameter. For
now, the util update handlers return immediately if they were called to
clear the flag.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

---
 include/linux/sched/cpufreq.h    |  7 ++++++-
 kernel/sched/cpufreq_schedutil.c | 21 ++++++++++++++++++---
 kernel/sched/deadline.c          |  4 ++++
 kernel/sched/fair.c              |  8 ++++++--
 kernel/sched/rt.c                |  4 ++++
 5 files changed, 38 insertions(+), 6 deletions(-)

-- 
2.15.0.194.g9af6a3dea062

Comments

Juri Lelli Dec. 13, 2017, 11:26 a.m. | #1
Hi Viresh,

On 13/12/17 15:23, Viresh Kumar wrote:
> Currently the schedutil governor overwrites the sg_cpu->flags field on

> every call to the utilization handler. It was pretty good as the initial

> implementation of utilization handlers, there are several drawbacks

> though.

> 

> The biggest drawback is that the sg_cpu->flags field doesn't always

> represent the correct type of tasks that are enqueued on a CPU's rq. For

> example, if a fair task is enqueued while a RT or DL task is running, we

> will overwrite the flags with value 0 and that may take the CPU to lower

> OPPs unintentionally. There can be other corner cases as well which we

> aren't aware of currently.

> 

> This patch changes the current implementation to keep track of all the

> task types that are currently enqueued to the CPUs rq. A new flag: CLEAR

> is introduced and is set by the scheduling classes when their last task

> is dequeued. When the CLEAR flag bit is set, the schedutil governor resets

> all the other flag bits that are present in the flags parameter. For

> now, the util update handlers return immediately if they were called to

> clear the flag.

> 

> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

> ---

>  include/linux/sched/cpufreq.h    |  7 ++++++-

>  kernel/sched/cpufreq_schedutil.c | 21 ++++++++++++++++++---

>  kernel/sched/deadline.c          |  4 ++++

>  kernel/sched/fair.c              |  8 ++++++--

>  kernel/sched/rt.c                |  4 ++++

>  5 files changed, 38 insertions(+), 6 deletions(-)

> 

> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h

> index d1ad3d825561..6f6641e61236 100644

> --- a/include/linux/sched/cpufreq.h

> +++ b/include/linux/sched/cpufreq.h

> @@ -8,10 +8,15 @@

>   * Interface between cpufreq drivers and the scheduler:

>   */

>  

> +#define SCHED_CPUFREQ_CLEAR	(1U << 31)

>  #define SCHED_CPUFREQ_RT	(1U << 0)

>  #define SCHED_CPUFREQ_DL	(1U << 1)

> -#define SCHED_CPUFREQ_IOWAIT	(1U << 2)

> +#define SCHED_CPUFREQ_CFS	(1U << 2)


This flag doesn't do much, does it? I mean RT/DL/IOWAIT are used to bump
up frequency, while you are adding CFS for the sake of simmetry, right?
And with my patches DL will hopefully soon be in the same situation.
If my understanding is correct, maybe add a comment?

Apart from this, patch looks good to me; couldn't test it though, sorry.

Reviewed-by: Juri Lelli <juri.lelli@redhat.com>


Thanks,

- Juri
Rafael J. Wysocki Dec. 16, 2017, 4:40 p.m. | #2
On Wed, Dec 13, 2017 at 10:53 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> Currently the schedutil governor overwrites the sg_cpu->flags field on

> every call to the utilization handler. It was pretty good as the initial

> implementation of utilization handlers, there are several drawbacks

> though.

>

> The biggest drawback is that the sg_cpu->flags field doesn't always

> represent the correct type of tasks that are enqueued on a CPU's rq. For

> example, if a fair task is enqueued while a RT or DL task is running, we

> will overwrite the flags with value 0 and that may take the CPU to lower

> OPPs unintentionally. There can be other corner cases as well which we

> aren't aware of currently.

>

> This patch changes the current implementation to keep track of all the

> task types that are currently enqueued to the CPUs rq. A new flag: CLEAR

> is introduced and is set by the scheduling classes when their last task

> is dequeued. When the CLEAR flag bit is set, the schedutil governor resets

> all the other flag bits that are present in the flags parameter. For

> now, the util update handlers return immediately if they were called to

> clear the flag.

>

> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

> ---

>  include/linux/sched/cpufreq.h    |  7 ++++++-

>  kernel/sched/cpufreq_schedutil.c | 21 ++++++++++++++++++---

>  kernel/sched/deadline.c          |  4 ++++

>  kernel/sched/fair.c              |  8 ++++++--

>  kernel/sched/rt.c                |  4 ++++

>  5 files changed, 38 insertions(+), 6 deletions(-)

>

> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h

> index d1ad3d825561..6f6641e61236 100644

> --- a/include/linux/sched/cpufreq.h

> +++ b/include/linux/sched/cpufreq.h

> @@ -8,10 +8,15 @@

>   * Interface between cpufreq drivers and the scheduler:

>   */

>

> +#define SCHED_CPUFREQ_CLEAR    (1U << 31)


I'm not thrilled by this, because schedutil is not the only user of
the flags and it's totally unclear what the other user(s) should do
when this is set.

Thanks,
Rafael
Viresh Kumar Dec. 16, 2017, 4:47 p.m. | #3
On 16 December 2017 at 22:10, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Dec 13, 2017 at 10:53 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:


>> +#define SCHED_CPUFREQ_CLEAR    (1U << 31)

>

> I'm not thrilled by this, because schedutil is not the only user of

> the flags and it's totally unclear what the other user(s) should do

> when this is set.


intel-pstate is the only other user of the IOWAIT flag, right? In order
not to change the current behavior, we can update that to return early
for now ?

--
viresh
Rafael J. Wysocki Dec. 17, 2017, 12:19 a.m. | #4
On Saturday, December 16, 2017 5:47:07 PM CET Viresh Kumar wrote:
> On 16 December 2017 at 22:10, Rafael J. Wysocki <rafael@kernel.org> wrote:

> > On Wed, Dec 13, 2017 at 10:53 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:

> 

> >> +#define SCHED_CPUFREQ_CLEAR    (1U << 31)

> >

> > I'm not thrilled by this, because schedutil is not the only user of

> > the flags and it's totally unclear what the other user(s) should do

> > when this is set.

> 

> intel-pstate is the only other user of the IOWAIT flag, right? In order

> not to change the current behavior, we can update that to return early

> for now ?


We can do that in principle, but why should it return early?  Maybe it's
a good time to update things, incidentally?

I actually don't like the SCHED_CPUFRREQ_CLEAR flag *concept* as it is very
much specific to schedutil and blatantly ignores everybody else.

Alternatively, you could add two flags for clearing SCHED_CPUFREQ_RT and
SCHED_CPUFREQ_DL that could just be ingored entirely by intel_pstate.

So, why don't you make SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL "sticky" until,
say, SCHED_CPUFREQ_NO_RT and SCHED_CPUFREQ_NO_DL are passed, respectively?

Thanks,
Rafael
Viresh Kumar Dec. 18, 2017, 4:59 a.m. | #5
On 17-12-17, 01:19, Rafael J. Wysocki wrote:
> We can do that in principle, but why should it return early?  Maybe it's

> a good time to update things, incidentally?

> 

> I actually don't like the SCHED_CPUFRREQ_CLEAR flag *concept* as it is very

> much specific to schedutil and blatantly ignores everybody else.

> 

> Alternatively, you could add two flags for clearing SCHED_CPUFREQ_RT and

> SCHED_CPUFREQ_DL that could just be ingored entirely by intel_pstate.

> 

> So, why don't you make SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL "sticky" until,

> say, SCHED_CPUFREQ_NO_RT and SCHED_CPUFREQ_NO_DL are passed, respectively?


I didn't like adding scheduling class specific flags, and wanted the code to
treat all of them in the same way. And then the governors can make a policy over
that, on what to ignore and what not to. For example with the current patchset,
the governors can know when nothing else is queued on a CPU and CPU is going to
get into idle loop. They can choose to (or not to) do something in that case.

I just thought that writing consistent (i.e. no special code) code across all
classes would be better.

-- 
viresh
Rafael J. Wysocki Dec. 18, 2017, 11:35 a.m. | #6
On Mon, Dec 18, 2017 at 5:59 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 17-12-17, 01:19, Rafael J. Wysocki wrote:

>> We can do that in principle, but why should it return early?  Maybe it's

>> a good time to update things, incidentally?

>>

>> I actually don't like the SCHED_CPUFRREQ_CLEAR flag *concept* as it is very

>> much specific to schedutil and blatantly ignores everybody else.

>>

>> Alternatively, you could add two flags for clearing SCHED_CPUFREQ_RT and

>> SCHED_CPUFREQ_DL that could just be ingored entirely by intel_pstate.

>>

>> So, why don't you make SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL "sticky" until,

>> say, SCHED_CPUFREQ_NO_RT and SCHED_CPUFREQ_NO_DL are passed, respectively?

>

> I didn't like adding scheduling class specific flags, and wanted the code to

> treat all of them in the same way. And then the governors can make a policy over

> that, on what to ignore and what not to. For example with the current patchset,

> the governors can know when nothing else is queued on a CPU and CPU is going to

> get into idle loop. They can choose to (or not to) do something in that case.


Well, if SCHED_CPUFRREQ_CLEAR means "this CPU is going to enter the
idle loop" really, then it is better to call it
SCHED_CPUFRREQ_ENTER_IDLE, for example.

SCHED_CPUFRREQ_CLEAR meaning basically "you should clear these flags
now" doesn't seem to convey any information to whoever doesn't
squirrel the flags in the first place.
Patrick Bellasi Dec. 18, 2017, 12:14 p.m. | #7
On 18-Dec 17:29, Viresh Kumar wrote:
> On 18-12-17, 12:35, Rafael J. Wysocki wrote:

> > Well, if SCHED_CPUFRREQ_CLEAR means "this CPU is going to enter the

> > idle loop" really, then it is better to call it

> > SCHED_CPUFRREQ_ENTER_IDLE, for example.

> > 

> > SCHED_CPUFRREQ_CLEAR meaning basically "you should clear these flags

> > now" doesn't seem to convey any information to whoever doesn't

> > squirrel the flags in the first place.

> 

> Right, but when all the flags are cleared, then we can infer that we

> are going to idle in the most probable case.

> 

> Anyway, I will implement RT and DL clear flags as you suggested in the

> next version.


I think Rafael is right, the current API is a big odd since we cannot
really set the CLEAR flags by itself. I mean, you can but it will not
have effects.

Thus, it's probably better from an API standpoint to have a dedicated
single clear flag... unfortunately it has to be one per class, which
is still not optimal and will likely make the policy code a bit odd.

What about extending the signature of the callback method?

For example, swithing from:

 - void (*func)(struct update_util_data *data, u64 time,
 -              unsigned int flags))
 + void (*func)(struct update_util_data *data, u64 time,
 +              unsigned int flags, bool set))

Where the additional boolean is actually used to define which
operation we wanna perform on the flags?

-- 
#include <best/regards.h>

Patrick Bellasi
Rafael J. Wysocki Dec. 18, 2017, 5:34 p.m. | #8
On Mon, Dec 18, 2017 at 12:59 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 18-12-17, 12:35, Rafael J. Wysocki wrote:

>> Well, if SCHED_CPUFRREQ_CLEAR means "this CPU is going to enter the

>> idle loop" really, then it is better to call it

>> SCHED_CPUFRREQ_ENTER_IDLE, for example.

>>

>> SCHED_CPUFRREQ_CLEAR meaning basically "you should clear these flags

>> now" doesn't seem to convey any information to whoever doesn't

>> squirrel the flags in the first place.

>

> Right, but when all the flags are cleared, then we can infer that we

> are going to idle in the most probable case.

>

> Anyway, I will implement RT and DL clear flags as you suggested in the

> next version.


Cool, thanks!
Viresh Kumar Dec. 19, 2017, 3:12 a.m. | #9
On 18-12-17, 12:14, Patrick Bellasi wrote:
> For example, swithing from:

> 

>  - void (*func)(struct update_util_data *data, u64 time,

>  -              unsigned int flags))

>  + void (*func)(struct update_util_data *data, u64 time,

>  +              unsigned int flags, bool set))

> 

> Where the additional boolean is actually used to define which

> operation we wanna perform on the flags?


The code will eventually have the same complexity or ugliness in both
the cases. I would like to start with another flag for now and see if
people prefer another parameter.

-- 
viresh
Joel Fernandes Dec. 19, 2017, 3:18 a.m. | #10
Hi Viresh,

On Mon, Dec 18, 2017 at 7:12 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 18-12-17, 12:14, Patrick Bellasi wrote:

>> For example, swithing from:

>>

>>  - void (*func)(struct update_util_data *data, u64 time,

>>  -              unsigned int flags))

>>  + void (*func)(struct update_util_data *data, u64 time,

>>  +              unsigned int flags, bool set))

>>

>> Where the additional boolean is actually used to define which

>> operation we wanna perform on the flags?

>

> The code will eventually have the same complexity or ugliness in both

> the cases. I would like to start with another flag for now and see if

> people prefer another parameter.


Though I think that will solve Rafael's concern of polluting the flags
for something schedutil specific. I also feel adding extra callback
parameter is cleaner than 2 new clear flags.

Thanks,

- Joel
Viresh Kumar Dec. 19, 2017, 3:26 a.m. | #11
On 19-12-17, 08:52, Viresh Kumar wrote:
> On 18-12-17, 19:18, Joel Fernandes wrote:

> > Hi Viresh,

> > 

> > On Mon, Dec 18, 2017 at 7:12 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote:

> > > On 18-12-17, 12:14, Patrick Bellasi wrote:

> > >> For example, swithing from:

> > >>

> > >>  - void (*func)(struct update_util_data *data, u64 time,

> > >>  -              unsigned int flags))

> > >>  + void (*func)(struct update_util_data *data, u64 time,

> > >>  +              unsigned int flags, bool set))

> > >>

> > >> Where the additional boolean is actually used to define which

> > >> operation we wanna perform on the flags?

> > >

> > > The code will eventually have the same complexity or ugliness in both

> > > the cases. I would like to start with another flag for now and see if

> > > people prefer another parameter.

> > 

> > Though I think that will solve Rafael's concern of polluting the flags

> > for something schedutil specific. I also feel adding extra callback

> > parameter is cleaner than 2 new clear flags.

> 

> Okay, I will then wait for Rafael to come online and comment on what

> he would prefer before posting.


I thought about it once more. If we decide eventually to add another
parameter, then why isn't the approach that this patch takes better
than that? i.e. Use the 31st bit of flags for clear bit ? We can
remove setting/clearing flags for CFS, that's it.

-- 
viresh
Joel Fernandes Dec. 19, 2017, 3:30 a.m. | #12
On Mon, Dec 18, 2017 at 7:26 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 19-12-17, 08:52, Viresh Kumar wrote:

>> On 18-12-17, 19:18, Joel Fernandes wrote:

>> > Hi Viresh,

>> >

>> > On Mon, Dec 18, 2017 at 7:12 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote:

>> > > On 18-12-17, 12:14, Patrick Bellasi wrote:

>> > >> For example, swithing from:

>> > >>

>> > >>  - void (*func)(struct update_util_data *data, u64 time,

>> > >>  -              unsigned int flags))

>> > >>  + void (*func)(struct update_util_data *data, u64 time,

>> > >>  +              unsigned int flags, bool set))

>> > >>

>> > >> Where the additional boolean is actually used to define which

>> > >> operation we wanna perform on the flags?

>> > >

>> > > The code will eventually have the same complexity or ugliness in both

>> > > the cases. I would like to start with another flag for now and see if

>> > > people prefer another parameter.

>> >

>> > Though I think that will solve Rafael's concern of polluting the flags

>> > for something schedutil specific. I also feel adding extra callback

>> > parameter is cleaner than 2 new clear flags.

>>

>> Okay, I will then wait for Rafael to come online and comment on what

>> he would prefer before posting.

>

> I thought about it once more. If we decide eventually to add another

> parameter, then why isn't the approach that this patch takes better

> than that? i.e. Use the 31st bit of flags for clear bit ? We can

> remove setting/clearing flags for CFS, that's it.


Yes that's clean to me but then as Rafael said, the use of this flag
will be too specific for schedutil-only sg_cpu->flags clearing purpose
right?

If adding the single flag is OK in the grand cpufreq scheme of things,
then that's fine with me.

Thanks,

- Joel
Viresh Kumar Dec. 19, 2017, 3:41 a.m. | #13
On 18-12-17, 19:30, Joel Fernandes wrote:
> Yes that's clean to me but then as Rafael said, the use of this flag

> will be too specific for schedutil-only sg_cpu->flags clearing purpose

> right?


And so would be the extra parameter ?

-- 
viresh
Peter Zijlstra Dec. 19, 2017, 7:25 p.m. | #14
On Wed, Dec 13, 2017 at 03:23:21PM +0530, Viresh Kumar wrote:
> Currently the schedutil governor overwrites the sg_cpu->flags field on

> every call to the utilization handler. It was pretty good as the initial

> implementation of utilization handlers, there are several drawbacks

> though.

> 

> The biggest drawback is that the sg_cpu->flags field doesn't always

> represent the correct type of tasks that are enqueued on a CPU's rq. For

> example, if a fair task is enqueued while a RT or DL task is running, we

> will overwrite the flags with value 0 and that may take the CPU to lower

> OPPs unintentionally. There can be other corner cases as well which we

> aren't aware of currently.

> 

> This patch changes the current implementation to keep track of all the

> task types that are currently enqueued to the CPUs rq. A new flag: CLEAR

> is introduced and is set by the scheduling classes when their last task

> is dequeued. When the CLEAR flag bit is set, the schedutil governor resets

> all the other flag bits that are present in the flags parameter. For

> now, the util update handlers return immediately if they were called to

> clear the flag.


Yeah, not happy about this either; we had code that did the right thing
without this extra tracking I think.

Also, we can look at the rq state if we want to, we don't need to
duplicate that state.
Viresh Kumar Dec. 20, 2017, 4:04 a.m. | #15
On 19-12-17, 20:25, Peter Zijlstra wrote:
> Yeah, not happy about this either; we had code that did the right thing

> without this extra tracking I think.


Sure, but how do you suggest we fix the problems we are facing with
the current design? Patrick had a completely different proposal for
solving those problems, which I didn't like very much. This patchset
replaced these patches from Patrick:

- [PATCH v3 1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter
  https://marc.info/?l=linux-kernel&m=151204247801633&w=2

- [PATCH v3 2/6] cpufreq: schedutil: ensure max frequency while
  running RT/DL tasks
  https://marc.info/?l=linux-kernel&m=151204253801657&w=2

- [PATCH v3 6/6] cpufreq: schedutil: ignore sugov kthreads
  https://marc.info/?l=linux-kernel&m=151204251501647&w=2

> Also, we can look at the rq state if we want to, we don't need to

> duplicate that state.


Well that also looks fine to me, and that would mean this:

- We remove SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL flags, but still
  call the utilization callbacks from RT and DL classes.

- From the utilization handler, we check runqueues of all three sched
  classes to see if they have some work pending (this can be done
  smartly by checking only RT first and skipping other checks if RT
  has some work).

Will that be acceptable ?

-- 
viresh
Peter Zijlstra Dec. 20, 2017, 8:31 a.m. | #16
On Wed, Dec 20, 2017 at 09:34:46AM +0530, Viresh Kumar wrote:
> On 19-12-17, 20:25, Peter Zijlstra wrote:

> > Yeah, not happy about this either; we had code that did the right thing

> > without this extra tracking I think.

> 

> Sure, but how do you suggest we fix the problems we are facing with

> the current design? Patrick had a completely different proposal for

> solving those problems, which I didn't like very much. This patchset

> replaced these patches from Patrick:


Please use the normal link format:

  https://lkml.kernel.org/r/$MSGID

Then I can find them without having to resort to a frigging browser
thing.

I'll try and dig through the email I have.

> > Also, we can look at the rq state if we want to, we don't need to

> > duplicate that state.

> 

> Well that also looks fine to me, and that would mean this:

> 

> - We remove SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL flags, but still

>   call the utilization callbacks from RT and DL classes.


Didn't juri have patches to make DL do something sane? But yes, I think
those flags are part of the problem.

> - From the utilization handler, we check runqueues of all three sched

>   classes to see if they have some work pending (this can be done

>   smartly by checking only RT first and skipping other checks if RT

>   has some work).


No that's wrong. DL should provide a minimum required based on existing
reservations, we can add the expected CFS average on top and request
that.

And for RT all we need to know is if current is of that class, otherwise
we don't care.
Viresh Kumar Dec. 20, 2017, 8:48 a.m. | #17
On 20-12-17, 09:31, Peter Zijlstra wrote:
> On Wed, Dec 20, 2017 at 09:34:46AM +0530, Viresh Kumar wrote:


> Please use the normal link format:

> 

>   https://lkml.kernel.org/r/$MSGID

> 

> Then I can find them without having to resort to a frigging browser

> thing.


Sure, and that would be much easier for me as well as that's how I got
to those links. Here they are again ..

https://lkml.kernel.org/r/20171130114723.29210-2-patrick.bellasi@arm.com
https://lkml.kernel.org/r/20171130114723.29210-3-patrick.bellasi@arm.com
https://lkml.kernel.org/r/20171130114723.29210-7-patrick.bellasi@arm.com

> I'll try and dig through the email I have.


Thanks.

> > Well that also looks fine to me, and that would mean this:

> > 

> > - We remove SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL flags, but still

> >   call the utilization callbacks from RT and DL classes.

> 

> Didn't juri have patches to make DL do something sane? But yes, I think

> those flags are part of the problem.


Sure, DL will be more like CFS going forward. I was just commenting
based on what we have upstream today.

> > - From the utilization handler, we check runqueues of all three sched

> >   classes to see if they have some work pending (this can be done

> >   smartly by checking only RT first and skipping other checks if RT

> >   has some work).

> 

> No that's wrong. DL should provide a minimum required based on existing

> reservations, we can add the expected CFS average on top and request

> that.


Right, that should be the case after Juri's patches.

> And for RT all we need to know is if current is of that class, otherwise

> we don't care.


What about this case: A CFS task is running currently and an RT task
is enqueued.

- Is it always the case that the CFS task is preempted immediately and
  the CPU is given to RT task ? I was talking to Vincent earlier and
  he told me that for certain configurations the CFS task may keep
  running until the next tick.

- What if the CFS task has disabled preemption ?

- More corner cases like this ?

Above cases may not let schedutil to raise frequency to MAX even when
we have RT stuff enqueued. And that's why I tried to track all sched
classes for which we have work enqueued for. There are great chances
that my understanding here is wrong though :)

-- 
viresh
Peter Zijlstra Dec. 20, 2017, 9:17 a.m. | #18
On Wed, Dec 20, 2017 at 02:18:59PM +0530, Viresh Kumar wrote:
> On 20-12-17, 09:31, Peter Zijlstra wrote:

> What about this case: A CFS task is running currently and an RT task

> is enqueued.

> 

> - Is it always the case that the CFS task is preempted immediately and

>   the CPU is given to RT task ? I was talking to Vincent earlier and

>   he told me that for certain configurations the CFS task may keep

>   running until the next tick.

> 

> - What if the CFS task has disabled preemption ?

> 

> - More corner cases like this ?


If there is a runnable RT task we'll schedule to it ASAP. This means
for:

 CONFIG_PREEMPT=y

   either immediately or when next preempt_enable() hits a 0
   preempt_count.

 CONFIG_PREEMPT=n

   on any next kernel preemption point or when returning to userspace.

But this is not something we should worry about.
Patrick Bellasi Dec. 20, 2017, 12:55 p.m. | #19
On 20-Dec 09:31, Peter Zijlstra wrote:
> On Wed, Dec 20, 2017 at 09:34:46AM +0530, Viresh Kumar wrote:

> > On 19-12-17, 20:25, Peter Zijlstra wrote:

> > > Yeah, not happy about this either; we had code that did the right thing

> > > without this extra tracking I think.

> > 

> > Sure, but how do you suggest we fix the problems we are facing with

> > the current design? Patrick had a completely different proposal for

> > solving those problems, which I didn't like very much. This patchset

> > replaced these patches from Patrick:

> 

> Please use the normal link format:

> 

>   https://lkml.kernel.org/r/$MSGID

> 

> Then I can find them without having to resort to a frigging browser

> thing.

> 

> I'll try and dig through the email I have.

> 

> > > Also, we can look at the rq state if we want to, we don't need to

> > > duplicate that state.

> > 

> > Well that also looks fine to me, and that would mean this:

> > 

> > - We remove SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL flags, but still

> >   call the utilization callbacks from RT and DL classes.

> 

> Didn't juri have patches to make DL do something sane? But yes, I think

> those flags are part of the problem.


He recently reposted them here:

  https://lkml.kernel.org/r/20171204102325.5110-1-juri.lelli@redhat.com
 
> > - From the utilization handler, we check runqueues of all three sched

> >   classes to see if they have some work pending (this can be done

> >   smartly by checking only RT first and skipping other checks if RT

> >   has some work).

> 

> No that's wrong. DL should provide a minimum required based on existing

> reservations, we can add the expected CFS average on top and request

> that.

> 

> And for RT all we need to know is if current is of that class, otherwise

> we don't care.


So, this:

   https://marc.info/?i=20171130114723.29210-3-patrick.bellasi%40arm.com

was actually going in this direction, although still working on top of
flags to not change the existing interface too much.

IMO, the advantage of flags is that they are a sort-of "pro-active"
approach, where the scheduler notify sensible events to schedutil.
But keep adding flags seems to overkilling to me too.

If we remove flags then we have to query the scheduler classes "on
demand"... but, as Peter suggests, once we have DL bits Juri posted,
the only issue if to know if an RT task is running.
This the patch above can be just good enough, with no flags at all and
with just a check for current being RT (or DL for the time being).


-- 
#include <best/regards.h>

Patrick Bellasi
Peter Zijlstra Dec. 20, 2017, 1:28 p.m. | #20
On Wed, Dec 20, 2017 at 12:55:46PM +0000, Patrick Bellasi wrote:
> On 20-Dec 09:31, Peter Zijlstra wrote:


> > Didn't juri have patches to make DL do something sane? But yes, I think

> > those flags are part of the problem.

> 

> He recently reposted them here:

> 

>   https://lkml.kernel.org/r/20171204102325.5110-1-juri.lelli@redhat.com


Yeah, just found them and actually munged them into my queue; did all
the modifications you suggested too. Lets see if it comes apart.

> > > - From the utilization handler, we check runqueues of all three sched

> > >   classes to see if they have some work pending (this can be done

> > >   smartly by checking only RT first and skipping other checks if RT

> > >   has some work).

> > 

> > No that's wrong. DL should provide a minimum required based on existing

> > reservations, we can add the expected CFS average on top and request

> > that.

> > 

> > And for RT all we need to know is if current is of that class, otherwise

> > we don't care.

> 

> So, this:

> 

>    https://marc.info/?i=20171130114723.29210-3-patrick.bellasi%40arm.com


Right, I was actually looking for those patches, but I'm searching
backwards and hit upon Juri's patches first.

> was actually going in this direction, although still working on top of

> flags to not change the existing interface too much.

> 

> IMO, the advantage of flags is that they are a sort-of "pro-active"

> approach, where the scheduler notify sensible events to schedutil.

> But keep adding flags seems to overkilling to me too.

> 

> If we remove flags then we have to query the scheduler classes "on

> demand"... but, as Peter suggests, once we have DL bits Juri posted,

> the only issue if to know if an RT task is running.

> This the patch above can be just good enough, with no flags at all and

> with just a check for current being RT (or DL for the time being).


Well, we still need flags for crap like IO-WAIT IIRC. That's sugov
internal state and not something the scheduler actually already knows.

But let me continue searching for patches..

Ooh, I found patches from Brendan... should be very close to yours
though, going by that msgid you posted on Nov 30th and I'm now on Dec
1st, soooon... :-)
Peter Zijlstra Dec. 20, 2017, 2:51 p.m. | #21
On Wed, Dec 20, 2017 at 03:47:23PM +0100, Rafael J. Wysocki wrote:
> > Well, we still need flags for crap like IO-WAIT IIRC. That's sugov

> > internal state and not something the scheduler actually already knows.

> 

> Not only sugov to be precise, but yes.


Oh, right, intel_pstate also consumes that one..
Rafael J. Wysocki Dec. 20, 2017, 2:52 p.m. | #22
On Wednesday, December 20, 2017 3:31:00 PM CET Patrick Bellasi wrote:
> On 20-Dec 14:28, Peter Zijlstra wrote:

> > On Wed, Dec 20, 2017 at 12:55:46PM +0000, Patrick Bellasi wrote:

> > > On 20-Dec 09:31, Peter Zijlstra wrote:

> > 

> > > > Didn't juri have patches to make DL do something sane? But yes, I think

> > > > those flags are part of the problem.

> > > 

> > > He recently reposted them here:

> > > 

> > >   https://lkml.kernel.org/r/20171204102325.5110-1-juri.lelli@redhat.com

> > 

> > Yeah, just found them and actually munged them into my queue; did all

> > the modifications you suggested too. Lets see if it comes apart.

> > 

> > > > > - From the utilization handler, we check runqueues of all three sched

> > > > >   classes to see if they have some work pending (this can be done

> > > > >   smartly by checking only RT first and skipping other checks if RT

> > > > >   has some work).

> > > > 

> > > > No that's wrong. DL should provide a minimum required based on existing

> > > > reservations, we can add the expected CFS average on top and request

> > > > that.

> > > > 

> > > > And for RT all we need to know is if current is of that class, otherwise

> > > > we don't care.

> > > 

> > > So, this:

> > > 

> > >    https://marc.info/?i=20171130114723.29210-3-patrick.bellasi%40arm.com

> > 

> > Right, I was actually looking for those patches, but I'm searching

> > backwards and hit upon Juri's patches first.

> > 

> > > was actually going in this direction, although still working on top of

> > > flags to not change the existing interface too much.

> > > 

> > > IMO, the advantage of flags is that they are a sort-of "pro-active"

> > > approach, where the scheduler notify sensible events to schedutil.

> > > But keep adding flags seems to overkilling to me too.

> > > 

> > > If we remove flags then we have to query the scheduler classes "on

> > > demand"... but, as Peter suggests, once we have DL bits Juri posted,

> > > the only issue if to know if an RT task is running.

> > > This the patch above can be just good enough, with no flags at all and

> > > with just a check for current being RT (or DL for the time being).

> > 

> > Well, we still need flags for crap like IO-WAIT IIRC. That's sugov

> > internal state and not something the scheduler actually already knows.

> 

> Right, that flag is set from:

> 

>     core.c::io_schedule_prepare()

> 

> for the current task, which is going to be dequeued soon.

> 

> Once it wakes up the next time, at enqueue time we trigger a boosting

> by passing schedutil that flag.

> 

> Thus, unless we are happy to delay the boosting until the task is

> actually picked for execution (don't think so), then we need to keep

> the flag and signal schedutil at enqueue time.

> 

> However, was wondering one thing: should't we already have a vruntime

> bonus for IO sleeping tasks? Because in that case, the task is likely

> to be on CPU quite soon... and thus, perhaps by removing the flag and

> moving the schedutil notification into core.c at the end of

> __schedule() should be working to detect both RT and FAIR::IOWAIT

> boosted tasks.


schedutil is not the only user of this flag.

Thanks,
Rafael
Patrick Bellasi Dec. 20, 2017, 3:01 p.m. | #23
On 20-Dec 15:52, Rafael J. Wysocki wrote:
> On Wednesday, December 20, 2017 3:31:00 PM CET Patrick Bellasi wrote:

> > On 20-Dec 14:28, Peter Zijlstra wrote:

> > > On Wed, Dec 20, 2017 at 12:55:46PM +0000, Patrick Bellasi wrote:

> > > > On 20-Dec 09:31, Peter Zijlstra wrote:

> > > 

> > > > > Didn't juri have patches to make DL do something sane? But yes, I think

> > > > > those flags are part of the problem.

> > > > 

> > > > He recently reposted them here:

> > > > 

> > > >   https://lkml.kernel.org/r/20171204102325.5110-1-juri.lelli@redhat.com

> > > 

> > > Yeah, just found them and actually munged them into my queue; did all

> > > the modifications you suggested too. Lets see if it comes apart.

> > > 

> > > > > > - From the utilization handler, we check runqueues of all three sched

> > > > > >   classes to see if they have some work pending (this can be done

> > > > > >   smartly by checking only RT first and skipping other checks if RT

> > > > > >   has some work).

> > > > > 

> > > > > No that's wrong. DL should provide a minimum required based on existing

> > > > > reservations, we can add the expected CFS average on top and request

> > > > > that.

> > > > > 

> > > > > And for RT all we need to know is if current is of that class, otherwise

> > > > > we don't care.

> > > > 

> > > > So, this:

> > > > 

> > > >    https://marc.info/?i=20171130114723.29210-3-patrick.bellasi%40arm.com

> > > 

> > > Right, I was actually looking for those patches, but I'm searching

> > > backwards and hit upon Juri's patches first.

> > > 

> > > > was actually going in this direction, although still working on top of

> > > > flags to not change the existing interface too much.

> > > > 

> > > > IMO, the advantage of flags is that they are a sort-of "pro-active"

> > > > approach, where the scheduler notify sensible events to schedutil.

> > > > But keep adding flags seems to overkilling to me too.

> > > > 

> > > > If we remove flags then we have to query the scheduler classes "on

> > > > demand"... but, as Peter suggests, once we have DL bits Juri posted,

> > > > the only issue if to know if an RT task is running.

> > > > This the patch above can be just good enough, with no flags at all and

> > > > with just a check for current being RT (or DL for the time being).

> > > 

> > > Well, we still need flags for crap like IO-WAIT IIRC. That's sugov

> > > internal state and not something the scheduler actually already knows.

> > 

> > Right, that flag is set from:

> > 

> >     core.c::io_schedule_prepare()

> > 

> > for the current task, which is going to be dequeued soon.

> > 

> > Once it wakes up the next time, at enqueue time we trigger a boosting

> > by passing schedutil that flag.

> > 

> > Thus, unless we are happy to delay the boosting until the task is

> > actually picked for execution (don't think so), then we need to keep

> > the flag and signal schedutil at enqueue time.

> > 

> > However, was wondering one thing: should't we already have a vruntime

> > bonus for IO sleeping tasks? Because in that case, the task is likely

> > to be on CPU quite soon... and thus, perhaps by removing the flag and

> > moving the schedutil notification into core.c at the end of

> > __schedule() should be working to detect both RT and FAIR::IOWAIT

> > boosted tasks.

> 

> schedutil is not the only user of this flag.


Sure, but with the idea above (not completely sure it makes sense)
intel_pstate_update_util() can still get the IIOWAIT information.

We just get that info from current->in_iowait instead of checking a
flag which is passed in via callback.

> Thanks,

> Rafael

> 


-- 
#include <best/regards.h>

Patrick Bellasi
Peter Zijlstra Dec. 20, 2017, 6:17 p.m. | #24
On Wed, Dec 20, 2017 at 06:27:17PM +0100, Juri Lelli wrote:

> Thanks Peter for taking the patches. I was actually waiting for the flag

> thing to be resolved to post again. :/


Yeah, I took them because it made sorting that easier, n/p.

Patch

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index d1ad3d825561..6f6641e61236 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -8,10 +8,15 @@ 
  * Interface between cpufreq drivers and the scheduler:
  */
 
+#define SCHED_CPUFREQ_CLEAR	(1U << 31)
 #define SCHED_CPUFREQ_RT	(1U << 0)
 #define SCHED_CPUFREQ_DL	(1U << 1)
-#define SCHED_CPUFREQ_IOWAIT	(1U << 2)
+#define SCHED_CPUFREQ_CFS	(1U << 2)
+#define SCHED_CPUFREQ_IOWAIT	(1U << 3)
 
+#define SCHED_CPUFREQ_RT_CLEAR	(SCHED_CPUFREQ_RT | SCHED_CPUFREQ_CLEAR)
+#define SCHED_CPUFREQ_DL_CLEAR	(SCHED_CPUFREQ_DL | SCHED_CPUFREQ_CLEAR)
+#define SCHED_CPUFREQ_CFS_CLEAR	(SCHED_CPUFREQ_CFS | SCHED_CPUFREQ_CLEAR)
 #define SCHED_CPUFREQ_RT_DL	(SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
 
 #ifdef CONFIG_CPU_FREQ
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e8ccfa30f01a..60a2dea4c8cc 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -191,6 +191,8 @@  static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
 				   unsigned int flags)
 {
 	if (flags & SCHED_CPUFREQ_IOWAIT) {
+		sg_cpu->flags &= ~SCHED_CPUFREQ_IOWAIT;
+
 		if (sg_cpu->iowait_boost_pending)
 			return;
 
@@ -264,6 +266,13 @@  static void sugov_update_single(struct update_util_data *hook, u64 time,
 	unsigned int next_f;
 	bool busy;
 
+	if (unlikely(flags & SCHED_CPUFREQ_CLEAR)) {
+		sg_cpu->flags &= ~flags;
+		return;
+	}
+
+	sg_cpu->flags |= flags;
+
 	sugov_set_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
 
@@ -272,7 +281,7 @@  static void sugov_update_single(struct update_util_data *hook, u64 time,
 
 	busy = sugov_cpu_is_busy(sg_cpu);
 
-	if (flags & SCHED_CPUFREQ_RT_DL) {
+	if (sg_cpu->flags & SCHED_CPUFREQ_RT_DL) {
 		next_f = policy->cpuinfo.max_freq;
 	} else {
 		sugov_get_util(&util, &max, sg_cpu->cpu);
@@ -345,15 +354,20 @@  static void sugov_update_shared(struct update_util_data *hook, u64 time,
 
 	raw_spin_lock(&sg_policy->update_lock);
 
+	if (unlikely(flags & SCHED_CPUFREQ_CLEAR)) {
+		sg_cpu->flags &= ~flags;
+		goto unlock;
+	}
+
 	sg_cpu->util = util;
 	sg_cpu->max = max;
-	sg_cpu->flags = flags;
+	sg_cpu->flags |= flags;
 
 	sugov_set_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
 
 	if (sugov_should_update_freq(sg_policy, time)) {
-		if (flags & SCHED_CPUFREQ_RT_DL)
+		if (sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
 			next_f = sg_policy->policy->cpuinfo.max_freq;
 		else
 			next_f = sugov_next_freq_shared(sg_cpu, time);
@@ -361,6 +375,7 @@  static void sugov_update_shared(struct update_util_data *hook, u64 time,
 		sugov_update_commit(sg_policy, time, next_f);
 	}
 
+unlock:
 	raw_spin_unlock(&sg_policy->update_lock);
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2473736c7616..d9c7c6887493 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1472,6 +1472,10 @@  static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	if (flags & DEQUEUE_SLEEP)
 		task_non_contending(p);
+
+	/* Clear cpufreq flags after last deadline task is dequeued */
+	if (!rq->dl.dl_nr_running)
+		cpufreq_update_util(rq, SCHED_CPUFREQ_DL_CLEAR);
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2915c0d95107..492188c3ee2d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3033,7 +3033,7 @@  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 		 *
 		 * See cpu_util().
 		 */
-		cpufreq_update_util(rq, 0);
+		cpufreq_update_util(rq, SCHED_CPUFREQ_CFS);
 	}
 }
 
@@ -5214,7 +5214,7 @@  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 * passed.
 	 */
 	if (p->in_iowait)
-		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
+		cpufreq_update_util(rq, SCHED_CPUFREQ_CFS | SCHED_CPUFREQ_IOWAIT);
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
@@ -5309,6 +5309,10 @@  static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		sub_nr_running(rq, 1);
 
 	hrtick_update(rq);
+
+	/* Clear cpufreq flags after last CFS task is dequeued */
+	if (!rq->cfs.nr_running)
+		cpufreq_update_util(rq, SCHED_CPUFREQ_CFS_CLEAR);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 862a513adca3..c9e8a8e5641b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1337,6 +1337,10 @@  static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	dequeue_rt_entity(rt_se, flags);
 
 	dequeue_pushable_task(rq, p);
+
+	/* Clear cpufreq flags after last rt task is dequeued */
+	if (!rq->rt.rt_nr_running)
+		cpufreq_update_util(rq, SCHED_CPUFREQ_RT_CLEAR);
 }
 
 /*