cpufreq: schedutil: Don't skip freq update when limits change

Message ID 8091ef83f264feb2feaa827fbeefe08348bcd05d.1563778071.git.viresh.kumar@linaro.org
State New
Headers show
Series
  • cpufreq: schedutil: Don't skip freq update when limits change
Related show

Commit Message

Viresh Kumar July 22, 2019, 6:51 a.m.
To avoid reducing the frequency of a CPU prematurely, we skip reducing
the frequency if the CPU had been busy recently.

This should not be done when the limits of the policy are changed, for
example due to thermal throttling. We should always get the frequency
within limits as soon as possible.

Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
Cc: v4.18+ <stable@vger.kernel.org> # v4.18+
Reported-by: Doug Smythies <doug.smythies@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

---
@Doug: Please try this patch, it must fix the issue you reported.

 kernel/sched/cpufreq_schedutil.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

-- 
2.21.0.rc0.269.g1a574e7a288b

Comments

Rafael J. Wysocki July 23, 2019, 9:13 a.m. | #1
On Tue, Jul 23, 2019 at 9:10 AM Doug Smythies <dsmythies@telus.net> wrote:
>

> On 2019.07.21 23:52 Viresh Kumar wrote:

>

> > To avoid reducing the frequency of a CPU prematurely, we skip reducing

> > the frequency if the CPU had been busy recently.

> >

> > This should not be done when the limits of the policy are changed, for

> > example due to thermal throttling. We should always get the frequency

> > within limits as soon as possible.

> >

> > Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")

> > Cc: v4.18+ <stable@vger.kernel.org> # v4.18+

> > Reported-by: Doug Smythies <doug.smythies@gmail.com>

> > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

> > ---

> > @Doug: Please try this patch, it must fix the issue you reported.

>

> It fixes the driver = acpi-cpufreq ; governor = schedutil test case

> It does not fix the driver = intel_cpufreq ; governor = schedutil test case


So what's the difference between them, with the patch applied?

> I have checked my results twice, but will check again in the day or two.


OK, thanks!
Viresh Kumar July 23, 2019, 9:15 a.m. | #2
On 23-07-19, 00:10, Doug Smythies wrote:
> On 2019.07.21 23:52 Viresh Kumar wrote:

> 

> > To avoid reducing the frequency of a CPU prematurely, we skip reducing

> > the frequency if the CPU had been busy recently.

> > 

> > This should not be done when the limits of the policy are changed, for

> > example due to thermal throttling. We should always get the frequency

> > within limits as soon as possible.

> >

> > Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")

> > Cc: v4.18+ <stable@vger.kernel.org> # v4.18+

> > Reported-by: Doug Smythies <doug.smythies@gmail.com>

> > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

> > ---

> > @Doug: Please try this patch, it must fix the issue you reported.

> 

> It fixes the driver = acpi-cpufreq ; governor = schedutil test case

> It does not fix the driver = intel_cpufreq ; governor = schedutil test case

> 

> I have checked my results twice, but will check again in the day or two.


The patch you tried to revert wasn't doing any driver specific stuff
but only schedutil. If that revert fixes your issue with both the
drivers, then this patch should do it as well.

I am clueless now on what can go wrong with intel_cpufreq driver with
schedutil now.

Though there is one difference between intel_cpufreq and acpi_cpufreq,
intel_cpufreq has fast_switch_possible=true and so it uses slightly
different path in schedutil. I tried to look from that perspective as
well but couldn't find anything wrong.

If you still find intel_cpufreq to be broken, even with this patch,
please set fast_switch_possible=false instead of true in
__intel_pstate_cpu_init() and try tests again. That shall make it very
much similar to acpi-cpufreq driver.

-- 
viresh
Viresh Kumar July 26, 2019, 6:57 a.m. | #3
On 25-07-19, 08:20, Doug Smythies wrote:
> I tried the patch ("patch2"). It did not fix the issue.

> 

> To summarize, all kernel 5.2 based, all intel_cpufreq driver and schedutil governor:

> 

> Test: Does a busy system respond to maximum CPU clock frequency reduction?

> 

> stock, unaltered: No.

> revert ecd2884291261e3fddbc7651ee11a20d596bb514: Yes

> viresh patch: No.

> fast_switch edit: No.

> viresh patch2: No.


Hmm, so I tried to reproduce your setup on my ARM board.
- booted only with CPU0 so I hit the sugov_update_single() routine
- And applied below diff to make CPU look permanently busy:

-------------------------8<-------------------------
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 2f382b0959e5..afb47490e5dc 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -121,6 +121,7 @@ static void sugov_fast_switch(struct sugov_policy *sg_policy, u64 time,
        if (!sugov_update_next_freq(sg_policy, time, next_freq))
                return;
 
+       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);
        next_freq = cpufreq_driver_fast_switch(policy, next_freq);
        if (!next_freq)
                return;
@@ -424,14 +425,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 #ifdef CONFIG_NO_HZ_COMMON
 static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
 {
-       unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
-       bool ret = idle_calls == sg_cpu->saved_idle_calls;
-
-       sg_cpu->saved_idle_calls = idle_calls;
-       return ret;
+       return true;
 }
 #else
-static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return true; }
 #endif /* CONFIG_NO_HZ_COMMON */
 
 /*
@@ -565,6 +562,7 @@ static void sugov_work(struct kthread_work *work)
        sg_policy->work_in_progress = false;
        raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
 
+       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);
        mutex_lock(&sg_policy->work_lock);
        __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
        mutex_unlock(&sg_policy->work_lock);

-------------------------8<-------------------------

Now, the frequency never gets down and so gets set to the maximum
possible after a bit.

- Then I did:

echo <any-low-freq-value> > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq

Without my patch applied:
        The print never gets printed and so frequency doesn't go down.

With my patch applied:
        The print gets printed immediately from sugov_work() and so
        the frequency reduces.

Can you try with this diff along with my Patch2 ? I suspect there may
be something wrong with the intel_cpufreq driver as the patch fixes
the only path we have in the schedutil governor which takes busyness
of a CPU into account.

-- 
viresh
Viresh Kumar July 29, 2019, 8:32 a.m. | #4
On 29-07-19, 00:55, Doug Smythies wrote:
> On 2019.07.25 23:58 Viresh Kumar wrote:

> > Hmm, so I tried to reproduce your setup on my ARM board.

> > - booted only with CPU0 so I hit the sugov_update_single() routine

> > - And applied below diff to make CPU look permanently busy:

> >

> > -------------------------8<-------------------------

> >diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > index 2f382b0959e5..afb47490e5dc 100644

> > --- a/kernel/sched/cpufreq_schedutil.c

> > +++ b/kernel/sched/cpufreq_schedutil.c

> > @@ -121,6 +121,7 @@ static void sugov_fast_switch(struct sugov_policy *sg_policy, u64 time,

> >         if (!sugov_update_next_freq(sg_policy, time, next_freq))

> >                return;

> > 

> > +       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);

> 

> ?? there is no "freq" variable here, and so this doesn't compile. However this works:

> 

> +       pr_info("%s: %d: %u\n", __func__, __LINE__, next_freq);


There are two paths we can take to change the frequency, normal
sleep-able path (sugov_work) or fast path. Only one of them is taken
by any driver ever. In your case it is the fast path always and in
mine it was the slow path.

I only tested the diff with slow-path and copy pasted to fast path
while giving out to you and so the build issue. Sorry about that.

Also make sure that the print is added after sugov_update_next_freq()
is called, not before it.

> >         next_freq = cpufreq_driver_fast_switch(policy, next_freq);

> >        if (!next_freq)

> >                return;

> > @@ -424,14 +425,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,

> > #ifdef CONFIG_NO_HZ_COMMON

> > static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)

> > {

> > -       unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);

> > -       bool ret = idle_calls == sg_cpu->saved_idle_calls;

> > -

> > -       sg_cpu->saved_idle_calls = idle_calls;

> > -       return ret;

> > +       return true;

> >  }

> >  #else

> > -static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }

> > +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return true; }

> > #endif /* CONFIG_NO_HZ_COMMON */

> > 

> >  /*

> > @@ -565,6 +562,7 @@ static void sugov_work(struct kthread_work *work)

> >         sg_policy->work_in_progress = false;

> >         raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);

> > 

> > +       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);

> >         mutex_lock(&sg_policy->work_lock);

> >         __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);

> >         mutex_unlock(&sg_policy->work_lock);

> > 

> > -------------------------8<-------------------------

> >

> > Now, the frequency never gets down and so gets set to the maximum

> > possible after a bit.

> >

> > - Then I did:

> >

> > echo <any-low-freq-value> > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq

> >

> > Without my patch applied:

> >        The print never gets printed and so frequency doesn't go down.

> >

> > With my patch applied:

> >        The print gets printed immediately from sugov_work() and so

> >        the frequency reduces.

> > 

> > Can you try with this diff along with my Patch2 ? I suspect there may

> > be something wrong with the intel_cpufreq driver as the patch fixes

> > the only path we have in the schedutil governor which takes busyness

> > of a CPU into account.

> 

> With this diff along with your patch2 There is never a print message

> from sugov_work. There are from sugov_fast_switch.


Which is okay. sugov_work won't get hit in your case as I explained
above.

> Note that for the intel_cpufreq CPU scaling driver and the schedutil

> governor I adjust the maximum clock frequency this way:

> 

> echo <any-low-percent> > /sys/devices/system/cpu/intel_pstate/max_perf_pct


This should eventually call sugov_limits() in schedutil governor, this
can be easily checked with another print message.

> I also applied the pr_info messages to the reverted kernel, and re-did

> my tests (where everything works as expected). There is never a print

> message from sugov_work. There are from sugov_fast_switch.


that's fine.

> Notes:

> 

> I do not know if:

> /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq

> /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq

> Need to be accurate when using the intel_pstate driver in passive mode.

> They are not.

> The commit comment for 9083e4986124389e2a7c0ffca95630a4983887f0

> suggests that they might need to be representative.

> I wonder if something similar to that commit is needed

> for other global changes, such as max_perf_pct and min_perf_pct?


We are already calling intel_pstate_update_policies() in that case, so
it should be fine I believe.

> intel_cpufreq/ondemand doesn't work properly on the reverted kernel.


reverted kernel ? The patch you reverted was only for schedutil and it
shouldn't have anything to do with ondemand.

> (just discovered, not investigated)

> I don't know about other governors.


When you do:

echo <any-low-percent> > /sys/devices/system/cpu/intel_pstate/max_perf_pct

How soon does the print from sugov_fast_switch() gets printed ?
Immediately ? Check with both the kernels, with my patch and with the
reverted patch.

Also see if there is any difference in the next_freq value in both the
kernels when you change max_perf_pct.

FWIW, we now know the difference between intel-pstate and
acpi-cpufreq/my testcase and why we see differences here. In the cases
where my patch fixed the issue (acpi/ARM), we were really changing the
limits, i.e. policy->min/max. This happened because we touched
scaling_max_freq directly.

For the case of intel-pstate, you are changing max_perf_pct which
doesn't change policy->max directly. I am not very sure how all of it
work really, but at least schedutil will not see policy->max changing.

@Rafael: Do you understand why things don't work properly with
intel_cpufreq driver ?

-- 
viresh
Viresh Kumar Aug. 1, 2019, 7:55 a.m. | #5
On 01-08-19, 09:47, Rafael J. Wysocki wrote:
> On Thu, Aug 1, 2019 at 8:17 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:

> >

> > On 31-07-19, 17:20, Doug Smythies wrote:

> > > Hi Viresh,

> > >

> > > Summary:

> > >

> > > The old way, using UINT_MAX had two purposes: first,

> > > as a "need to do a frequency update" flag; but also second, to

> > > force any subsequent old/new frequency comparison to NOT be "the same,

> > > so why bother actually updating" (see: sugov_update_next_freq). All

> > > patches so far have been dealing with the flag, but only partially

> > > the comparisons. In a busy system, and when schedutil.c doesn't actually

> > > know the currently set system limits, the new frequency is dominated by

> > > values the same as the old frequency. So, when sugov_fast_switch calls

> > > sugov_update_next_freq, false is usually returned.

> >

> > And finally we know "Why" :)

> >

> > Good work Doug. Thanks for taking it to the end.

> >

> > > However, if we move the resetting of the flag and add another condition

> > > to the "no need to actually update" decision, then perhaps this patch

> > > version 1 will be O.K. It seems to be. (see way later in this e-mail).

> >

> > > With all this new knowledge, how about going back to

> > > version 1 of this patch, and then adding this:

> > >

> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > index 808d32b..f9156db 100644

> > > --- a/kernel/sched/cpufreq_schedutil.c

> > > +++ b/kernel/sched/cpufreq_schedutil.c

> > > @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)

> > >  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,

> > >                                    unsigned int next_freq)

> > >  {

> > > -       if (sg_policy->next_freq == next_freq)

> > > +       /*

> > > +        * Always force an update if the flag is set, regardless.

> > > +        * In some implementations (intel_cpufreq) the frequency is clamped

> > > +        * further downstream, and might not actually be different here.

> > > +        */

> > > +       if (sg_policy->next_freq == next_freq && !sg_policy->need_freq_update)

> > >                 return false;

> >

> > This is not correct because this is an optimization we have in place

> > to make things more efficient. And it was working by luck earlier and

> > my patch broke it for good :)

> 

> OK, so since we know why it was wrong now, why don't we just revert

> it?  Plus maybe add some comment explaining the rationale in there?


Because the patch [1] which caused these issues was almost correct,
just that it missed the busy accounting for single CPU case.

The main idea behind the original patch [1] was to avoid any
unwanted/hidden side-affects by overriding the value of next_freq.
What we see above is exactly the case for that. Because we override
the value of next_freq, we made intel-pstate work by chance,
unintentionally. Which is wrong. And who knows what other side affects
it had, we already found two (this one and the one fixed by [1]).

I would strongly suggest that we don't override the value of next_freq
with special meaning, as it is used at so many places we don't know
what it may result in.

-- 
viresh

[1] ecd288429126 cpufreq: schedutil: Don't set next_freq to UINT_MAX
Viresh Kumar Aug. 2, 2019, 3:48 a.m. | #6
On 01-08-19, 10:57, Doug Smythies wrote:
> On 2019.07.31 23:17 Viresh Kumar wrote:

> > On 31-07-19, 17:20, Doug Smythies wrote:

> >> Summary:

> >> 

> >> The old way, using UINT_MAX had two purposes: first,

> >> as a "need to do a frequency update" flag; but also second, to

> >> force any subsequent old/new frequency comparison to NOT be "the same,

> >> so why bother actually updating" (see: sugov_update_next_freq). All

> >> patches so far have been dealing with the flag, but only partially

> >> the comparisons. In a busy system, and when schedutil.c doesn't actually

> >> know the currently set system limits, the new frequency is dominated by

> >> values the same as the old frequency. So, when sugov_fast_switch calls 

> >> sugov_update_next_freq, false is usually returned.

> >

> > And finally we know "Why" :)

> >

> > Good work Doug. Thanks for taking it to the end.

> >

> >> However, if we move the resetting of the flag and add another condition

> >> to the "no need to actually update" decision, then perhaps this patch

> >> version 1 will be O.K. It seems to be. (see way later in this e-mail).

> >

> >> With all this new knowledge, how about going back to

> >> version 1 of this patch, and then adding this:

> >> 

> >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> >> index 808d32b..f9156db 100644

> >> --- a/kernel/sched/cpufreq_schedutil.c

> >> +++ b/kernel/sched/cpufreq_schedutil.c

> >> @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)

> >>  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,

> >>                                    unsigned int next_freq)

> >>  {

> >> -       if (sg_policy->next_freq == next_freq)

> >> +       /*

> >> +        * Always force an update if the flag is set, regardless.

> >> +        * In some implementations (intel_cpufreq) the frequency is clamped

> >> +        * further downstream, and might not actually be different here.

> >> +        */

> >> +       if (sg_policy->next_freq == next_freq && !sg_policy->need_freq_update)

> >>                 return false;

> >

> > This is not correct because this is an optimization we have in place

> > to make things more efficient. And it was working by luck earlier and

> > my patch broke it for good :)

> 

> Disagree.

> All I did was use a flag where it used to be set to UNIT_MAX, to basically

> implement the same thing.


And the earlier code wasn't fully correct as well, that's why we tried
to fix it earlier. So introducing the UINT_MAX thing again would be
wrong, even if it fixes the problem for you.

Also this won't fix the issue for rest of the governors but just
schedutil. Because this is a driver only problem and there is no point
trying to fix that in a governor.

> > Things need to get a bit more synchronized and something like this may

> > help (completely untested):

> >

> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c

> > index cc27d4c59dca..2d84361fbebc 100644

> > --- a/drivers/cpufreq/intel_pstate.c

> > +++ b/drivers/cpufreq/intel_pstate.c

> > @@ -2314,6 +2314,18 @@ static int intel_cpufreq_target(struct cpufreq_policy *policy,

> >        return 0;

> > }

> > 

> > +static unsigned int intel_cpufreq_resolve_freq(struct cpufreq_policy *policy,

> > +                                              unsigned int target_freq)

> > +{

> > +       struct cpudata *cpu = all_cpu_data[policy->cpu];

> > +       int target_pstate;

> > +

> > +       target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);

> > +       target_pstate = intel_pstate_prepare_request(cpu, target_pstate);

> > +

> > +       return target_pstate * cpu->pstate.scaling;

> > +}

> > +

> >  static unsigned int intel_cpufreq_fast_switch(struct cpufreq_policy *policy,

> >                                               unsigned int target_freq)

> >  {

> > @@ -2350,6 +2362,7 @@ static struct cpufreq_driver intel_cpufreq = {

> >         .verify         = intel_cpufreq_verify_policy,

> >         .target         = intel_cpufreq_target,

> >         .fast_switch    = intel_cpufreq_fast_switch,

> > +       .resolve_freq   = intel_cpufreq_resolve_freq,

> >         .init           = intel_cpufreq_cpu_init,

> >         .exit           = intel_pstate_cpu_exit,

> >         .stop_cpu       = intel_cpufreq_stop_cpu,

> > 

> > -------------------------8<-------------------------

> >

> > Please try this with my patch 2.

> 

> O.K.

> 

> > We need patch 2 instead of 1 because

> > of another race condition Rafael noticed.

> 

> Disagree.

> Notice that my modifications to your patch1 addresses

> that condition by moving the clearing of "need_freq_update"

> to sometime later.


As I said above, that isn't the right way of fixing a driver issue in
governor.
 
> > cpufreq_schedutil calls driver specific resolve_freq() to find the new

> > target frequency and this is where the limits should get applied IMO.

> 

> Oh! I didn't know. But yes, that makes sense.


The thing is that the governors, schedutil or others, need to get the
frequency limits from cpufreq core and the core tries to get it using
resolve_freq() in this particular case. If you don't have that
implemented, all the governors may end up having this issue.

> > Rafael can help with reviewing this diff but it would be great if you

> > can give this a try Doug.

> 

> Anyway, I added the above code (I am calling it patch3) to patch2, as

> you asked, and it does work. I also added it to my modified patch1,

> additionally removing the extra condition check that I added

> (i.e. all that remains of my patch1 modifications is the moved

> clearing of "need_freq_update") That kernel also worked for both

> intel_cpufreq/schedutil and acpi-cpufreq/schedutil.


Great, thanks.

> Again, I do not know how to test the original issue that led

> to the change away from UINT_MAX in the first place,

> ecd2884291261e3fddbc7651ee11a20d596bb514, which should be

> tested in case of some introduced regression.


The problem then was with overriding next_freq with UINT_MAX and since
we aren't going that path again, there is no need to test it again.

I will send the two patches to be applied now. Your tested-by would be
helpful to get them merged.

-- 
viresh
Rafael J. Wysocki Aug. 2, 2019, 9:19 a.m. | #7
On Fri, Aug 2, 2019 at 11:11 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>

> On Friday, August 2, 2019 5:48:19 AM CEST Viresh Kumar wrote:

> > On 01-08-19, 10:57, Doug Smythies wrote:

> > > On 2019.07.31 23:17 Viresh Kumar wrote:

> > > > On 31-07-19, 17:20, Doug Smythies wrote:

> > > >> Summary:

> > > >>

> > > >> The old way, using UINT_MAX had two purposes: first,

> > > >> as a "need to do a frequency update" flag; but also second, to

> > > >> force any subsequent old/new frequency comparison to NOT be "the same,

> > > >> so why bother actually updating" (see: sugov_update_next_freq). All

> > > >> patches so far have been dealing with the flag, but only partially

> > > >> the comparisons. In a busy system, and when schedutil.c doesn't actually

> > > >> know the currently set system limits, the new frequency is dominated by

> > > >> values the same as the old frequency. So, when sugov_fast_switch calls

> > > >> sugov_update_next_freq, false is usually returned.

> > > >

> > > > And finally we know "Why" :)

> > > >

> > > > Good work Doug. Thanks for taking it to the end.

> > > >

> > > >> However, if we move the resetting of the flag and add another condition

> > > >> to the "no need to actually update" decision, then perhaps this patch

> > > >> version 1 will be O.K. It seems to be. (see way later in this e-mail).

> > > >

> > > >> With all this new knowledge, how about going back to

> > > >> version 1 of this patch, and then adding this:

> > > >>

> > > >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > >> index 808d32b..f9156db 100644

> > > >> --- a/kernel/sched/cpufreq_schedutil.c

> > > >> +++ b/kernel/sched/cpufreq_schedutil.c

> > > >> @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)

> > > >>  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,

> > > >>                                    unsigned int next_freq)

> > > >>  {

> > > >> -       if (sg_policy->next_freq == next_freq)

> > > >> +       /*

> > > >> +        * Always force an update if the flag is set, regardless.

> > > >> +        * In some implementations (intel_cpufreq) the frequency is clamped

> > > >> +        * further downstream, and might not actually be different here.

> > > >> +        */

> > > >> +       if (sg_policy->next_freq == next_freq && !sg_policy->need_freq_update)

> > > >>                 return false;

> > > >

> > > > This is not correct because this is an optimization we have in place

> > > > to make things more efficient. And it was working by luck earlier and

> > > > my patch broke it for good :)

> > >

> > > Disagree.

> > > All I did was use a flag where it used to be set to UNIT_MAX, to basically

> > > implement the same thing.

> >

> > And the earlier code wasn't fully correct as well, that's why we tried

> > to fix it earlier.

>

> Your argument seems to be "There was an earlier problem related to this, which

> was fixed, so it is fragile and I'd rather avoid it".  Still, you are claiming that the

> code was in fact incorrect and you are not giving convincing arguments to

> support that.

>

> > So introducing the UINT_MAX thing again would be

> > wrong, even if it fixes the problem for you.

>

> Would it be wrong, because it would reintroduce the fragile code, or would it

> be wrong, because it would re-introduce a bug?  What bug if so?

>

> > Also this won't fix the issue for rest of the governors but just

> > schedutil. Because this is a driver only problem and there is no point

> > trying to fix that in a governor.

>

> Well, I'm not convinced that this is a driver problem yet.


I stand corrected, this is a driver problem, but IMO it needs to be
addressed differently.
Viresh Kumar Aug. 6, 2019, 4 a.m. | #8
On 02-08-19, 11:11, Rafael J. Wysocki wrote:
> On Friday, August 2, 2019 5:48:19 AM CEST Viresh Kumar wrote:

> > On 01-08-19, 10:57, Doug Smythies wrote:

> > > Disagree.

> > > All I did was use a flag where it used to be set to UNIT_MAX, to basically

> > > implement the same thing.

> > 

> > And the earlier code wasn't fully correct as well, that's why we tried

> > to fix it earlier.

> 

> Your argument seems to be "There was an earlier problem related to this, which

> was fixed, so it is fragile and I'd rather avoid it".  Still, you are claiming that the

> code was in fact incorrect and you are not giving convincing arguments to

> support that.

> 

> > So introducing the UINT_MAX thing again would be

> > wrong, even if it fixes the problem for you.

> 

> Would it be wrong, because it would reintroduce the fragile code, or would it

> be wrong, because it would re-introduce a bug?  What bug if so?


There will be two issues here if that patch is reintroduced:

- It will cause the BUG to reappear, which was fixed by the earlier
  commit. The commit log of ecd28842912 explains the bug in detail.

- And overriding next_freq as a flag will make the code fragile and we
  may have similar bugs coming up.

But yeah, lets continue discussion on the intel-pstate patch now.

-- 
viresh

Patch

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 636ca6f88c8e..b53c4f02b0f1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -447,7 +447,7 @@  static void sugov_update_single(struct update_util_data *hook, u64 time,
 	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
 	unsigned long util, max;
 	unsigned int next_f;
-	bool busy;
+	bool busy = false;
 
 	sugov_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
@@ -457,7 +457,9 @@  static void sugov_update_single(struct update_util_data *hook, u64 time,
 	if (!sugov_should_update_freq(sg_policy, time))
 		return;
 
-	busy = sugov_cpu_is_busy(sg_cpu);
+	/* Limits may have changed, don't skip frequency update */
+	if (!sg_policy->need_freq_update)
+		busy = sugov_cpu_is_busy(sg_cpu);
 
 	util = sugov_get_util(sg_cpu);
 	max = sg_cpu->max;