diff mbox

cpufreq: schedutil: govern how frequently we change frequency with rate_limit

Message ID 4c9afe0a2cda5016b342f777f244c89c03cdc524.1487178939.git.viresh.kumar@linaro.org
State New
Headers show

Commit Message

Viresh Kumar Feb. 15, 2017, 5:15 p.m. UTC
For an ideal system (where frequency change doesn't incur any penalty)
we would like to change the frequency as soon as the load changes for a
CPU. But the systems we have to work with are far from ideal and it
takes time to change the frequency of a CPU. For many ARM platforms
specially, it is at least 1 ms. In order to not spend too much time
changing frequency, we have earlier introduced a sysfs controlled
tunable for the schedutil governor: rate_limit_us.

Currently, rate_limit_us controls how frequently we reevaluate frequency
for a set of CPUs controlled by a cpufreq policy. But that may not be
the ideal behavior we want.

Consider for example the following scenario. The rate_limit_us tunable
is set to 10 ms. The CPU has a constant load X and that requires the
frequency to be set to Y. The schedutil governor changes the frequency
to Y, updates last_freq_update_time and we wait for 10 ms to reevaluate
the frequency again. After 10 ms, the schedutil governor reevaluates the
load and finds it to be the same. And so it doesn't update the
frequency, but updates last_freq_update_time before returning. Right
after this point, the scheduler puts more load on the CPU and the CPU
needs to go to a higher frequency Z. Because last_freq_update_time was
updated just now, the schedutil governor waits for additional 10ms
before reevaluating the load again.

Normally, the time it takes to reevaluate the frequency is negligible
compared to the time it takes to change the frequency. And considering
that in the above scenario, as we haven't updated the frequency for over
10ms, we should have changed the frequency as soon as the load changed.

This patch changes the way rate_limit_us is used, i.e. It now governs
"How frequently we change the frequency" instead of "How frequently we
reevaluate the frequency".

One may think that this change may have increased the number of times we
reevaluate the frequency after a period of rate_limit_us has expired
since the last change, if the load isn't changing. But that is protected
by the scheduler as normally it doesn't call into the schedutil governor
before 1 ms (Hint: "decayed" in update_cfs_rq_load_avg()) since the
last call.

Tests were performed with this patch on a Dual cluster (same frequency
domain), octa-core ARM64 platform (Hikey). Hackbench (Debian) and
Vellamo/Galleryfling (Android) didn't had much difference in
performance w/ or w/o this patch.

Its difficult to create a test case (tried rt-app as well) where this
patch will show a lot of improvements as the target of this patch is a
real corner case. I.e. Current load is X (resulting in freq change),
load after rate_limit_us is also X, but right after that load becomes Y.
Undoubtedly this patch would improve the responsiveness in such cases.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

---
 kernel/sched/cpufreq_schedutil.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

-- 
2.7.1.410.g6faf27b

Comments

Rafael J. Wysocki Feb. 15, 2017, 10:35 p.m. UTC | #1
On Wednesday, February 15, 2017 10:45:47 PM Viresh Kumar wrote:

First of all, [RFC] pretty please on things like this.

> For an ideal system (where frequency change doesn't incur any penalty)

> we would like to change the frequency as soon as the load changes for a

> CPU. But the systems we have to work with are far from ideal and it

> takes time to change the frequency of a CPU. For many ARM platforms

> specially, it is at least 1 ms. In order to not spend too much time

> changing frequency, we have earlier introduced a sysfs controlled

> tunable for the schedutil governor: rate_limit_us.

> 

> Currently, rate_limit_us controls how frequently we reevaluate frequency

> for a set of CPUs controlled by a cpufreq policy. But that may not be

> the ideal behavior we want.

> 

> Consider for example the following scenario. The rate_limit_us tunable

> is set to 10 ms. The CPU has a constant load X and that requires the

> frequency to be set to Y. The schedutil governor changes the frequency

> to Y, updates last_freq_update_time and we wait for 10 ms to reevaluate

> the frequency again. After 10 ms, the schedutil governor reevaluates the

> load and finds it to be the same. And so it doesn't update the

> frequency, but updates last_freq_update_time before returning. Right

> after this point, the scheduler puts more load on the CPU and the CPU

> needs to go to a higher frequency Z. Because last_freq_update_time was

> updated just now, the schedutil governor waits for additional 10ms

> before reevaluating the load again.

> 

> Normally, the time it takes to reevaluate the frequency is negligible

> compared to the time it takes to change the frequency.


This should be "the time it takes to reevaluate the load is negligible
relative to the time it takes to change the frequency" I suppose?

Specifically, the "to reevaluate the frequency" phrase is ambiguous.

> And considering

> that in the above scenario, as we haven't updated the frequency for over

> 10ms, we should have changed the frequency as soon as the load changed.


Why should we?

> This patch changes the way rate_limit_us is used, i.e. It now governs

> "How frequently we change the frequency" instead of "How frequently we

> reevaluate the frequency".


That's questionable IMO.

> One may think that this change may have increased the number of times we

> reevaluate the frequency after a period of rate_limit_us has expired

> since the last change, if the load isn't changing. But that is protected

> by the scheduler as normally it doesn't call into the schedutil governor

> before 1 ms (Hint: "decayed" in update_cfs_rq_load_avg()) since the

> last call.

> 

> Tests were performed with this patch on a Dual cluster (same frequency

> domain), octa-core ARM64 platform (Hikey). Hackbench (Debian) and

> Vellamo/Galleryfling (Android) didn't had much difference in

> performance w/ or w/o this patch.

> 

> Its difficult to create a test case (tried rt-app as well) where this

> patch will show a lot of improvements as the target of this patch is a

> real corner case. I.e. Current load is X (resulting in freq change),

> load after rate_limit_us is also X, but right after that load becomes Y.

> Undoubtedly this patch would improve the responsiveness in such cases.


But can that be demonstrated somehow?

Otherwise it is rather not "undoubtedly", but "theoretically".

Thanks,
Rafael
Viresh Kumar Feb. 16, 2017, 6:27 a.m. UTC | #2
On 15-02-17, 23:35, Rafael J. Wysocki wrote:
> On Wednesday, February 15, 2017 10:45:47 PM Viresh Kumar wrote:

> 

> First of all, [RFC] pretty please on things like this.


Sure.

> > Normally, the time it takes to reevaluate the frequency is negligible

> > compared to the time it takes to change the frequency.

> 

> This should be "the time it takes to reevaluate the load is negligible

> relative to the time it takes to change the frequency" I suppose?

> 

> Specifically, the "to reevaluate the frequency" phrase is ambiguous.


Actually we reevaluate both load and frequency, but I am fine with what you have
suggested here.

> > And considering

> > that in the above scenario, as we haven't updated the frequency for over

> > 10ms, we should have changed the frequency as soon as the load changed.

> 

> Why should we?


I will modify above as:

... we should have changed the frequency as soon as the load changed in order to
be more responsive to the load on the CPUs.

Is that the missing part you are pointing at ?

> > Its difficult to create a test case (tried rt-app as well) where this

> > patch will show a lot of improvements as the target of this patch is a

> > real corner case. I.e. Current load is X (resulting in freq change),

> > load after rate_limit_us is also X, but right after that load becomes Y.

> > Undoubtedly this patch would improve the responsiveness in such cases.

> 

> But can that be demonstrated somehow?


I hope you are talking about demonstrating performance enhancement here? I am
not sure if we can actually have a testcase to show that. Can you or others give
some testcases where we can hit similar situation again and again ?

Though I can surely try to get some traces which show that such cases do exist
and we are able to change the frequency before waiting for another 10ms. But
such an outcome is quite obvious here.

> Otherwise it is rather not "undoubtedly", but "theoretically".


Do you want to see the traces to confirm that we actually changed the frequency
earlier due to this change ?

-- 
viresh
Viresh Kumar Feb. 16, 2017, 10:12 a.m. UTC | #3
On 16-02-17, 01:02, Rafael J. Wysocki wrote:
> More precisely, while the governor computations are less costly than updating

> the CPU state, they are not zero-cost, so do we really want to run them on

> every governor callback invocation until the CPU state is updated?

> 

> We may end up running them very often in some cases after the change you are

> proposing.


I was worried about that initially and kept an additional patch:

 {
 	struct cpufreq_policy *policy = sg_policy->policy;
 
+	sg_policy->last_freq_eval_time = time;
+
 	if (policy->fast_switch_enabled) {
 		if (sg_policy->next_freq == next_freq) {
 			trace_cpu_frequency(policy->cur, smp_processor_id());
@@ -576,6 +586,7 @@ static int sugov_start(struct cpufreq_policy *policy)
 
 	sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
 	sg_policy->last_freq_update_time = 0;
+	sg_policy->last_freq_eval_time = 0;
 	sg_policy->next_freq = UINT_MAX;
 	sg_policy->work_in_progress = false;
 	sg_policy->need_freq_update = false;


But when I discussed this with Vincent, he suggested that it may not be required
at all as the scheduler (with the helped of "decayed") doesn't call into
schedutil too often, i.e. at least 1 ms. And if the CPUs are stable enough (i.e.
no interruptions to the running task), we wouldn't reevaluate before the next
tick.

And so with this change, in the worst case we will reevaluate load every 1 ms
after a time interval of rate_limit_us has passed since the last time frequency
was changed. And as soon as the frequency is changed, we will wait again for
rate_limit_us time.

As I remember, one of the motivations behind moving the cpufreq decision making
logic close to the scheduler core was to get early responses instead of waiting
for a period (like sampling rate) before making a frequency change. And that's
how we can improve performance of the tasks by giving them the bump when they
need it the most.

Yes, this change will surely add some more burden on the CPUs but that is kind
of required to make sure we don't penalize tasks unnecessarily, even when we
don't have to.

-- 
vireshdiff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 306d97e7b57c..03d9dc1d1661 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -19,6 +19,7 @@
 #include "sched.h"
 
 #define SUGOV_KTHREAD_PRIORITY	50
+#define SUGOV_MIN_REEVALUATION_DELAY_NS	(1000 * NSEC_PER_USEC)
 
 struct sugov_tunables {
 	struct gov_attr_set attr_set;
@@ -33,6 +34,7 @@ struct sugov_policy {
 
 	raw_spinlock_t update_lock;  /* For shared policies */
 	u64 last_freq_update_time;
+	u64 last_freq_eval_time;
 	s64 freq_update_delay_ns;
 	unsigned int next_freq;
 
@@ -83,8 +85,14 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 		return true;
 	}
 
 	delta_ns = time - sg_policy->last_freq_update_time;
-	return delta_ns >= sg_policy->freq_update_delay_ns;
+	if (delta_ns < sg_policy->freq_update_delay_ns)
+		return false;
+
+	/* Don't reevaluate frequency too frequently */
+	delta_ns = time - sg_policy->last_freq_eval_time;
+	return delta_ns >= SUGOV_MIN_REEVALUATION_DELAY_NS;
 }
 
 static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
@@ -92,6 +100,8 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,

Rafael J. Wysocki Feb. 17, 2017, 12:15 p.m. UTC | #4
On Thursday, February 16, 2017 01:36:05 PM Peter Zijlstra wrote:
> On Thu, Feb 16, 2017 at 03:42:10PM +0530, Viresh Kumar wrote:

> > But when I discussed this with Vincent, he suggested that it may not be required

> > at all as the scheduler (with the helped of "decayed") doesn't call into

> > schedutil too often, i.e. at least 1 ms. And if the CPUs are stable enough (i.e.

> > no interruptions to the running task), we wouldn't reevaluate before the next

> > tick.

> 

> There are still the attach/detach callers to cfs_rq_util_change() that

> kick in for fork/exit and migration.

> 

> But yes, barring those we shouldn't end up calling it at silly rates.


OK

Does this mean that running governor computations every time its callback
is invoked by the scheduler would be fine?

Thanks,
Rafael
Peter Zijlstra Feb. 17, 2017, 12:48 p.m. UTC | #5
On Fri, Feb 17, 2017 at 01:15:56PM +0100, Rafael J. Wysocki wrote:
> On Thursday, February 16, 2017 01:36:05 PM Peter Zijlstra wrote:

> > On Thu, Feb 16, 2017 at 03:42:10PM +0530, Viresh Kumar wrote:

> > > But when I discussed this with Vincent, he suggested that it may not be required

> > > at all as the scheduler (with the helped of "decayed") doesn't call into

> > > schedutil too often, i.e. at least 1 ms. And if the CPUs are stable enough (i.e.

> > > no interruptions to the running task), we wouldn't reevaluate before the next

> > > tick.

> > 

> > There are still the attach/detach callers to cfs_rq_util_change() that

> > kick in for fork/exit and migration.

> > 

> > But yes, barring those we shouldn't end up calling it at silly rates.

> 

> OK

> 

> Does this mean that running governor computations every time its callback

> is invoked by the scheduler would be fine?


I'd say yes right up till the point someone reports a regression ;-)
Rafael J. Wysocki Feb. 20, 2017, 1:49 p.m. UTC | #6
On Monday, February 20, 2017 03:28:03 PM Viresh Kumar wrote:
> On 17-02-17, 13:48, Peter Zijlstra wrote:

> > On Fri, Feb 17, 2017 at 01:15:56PM +0100, Rafael J. Wysocki wrote:

> > > On Thursday, February 16, 2017 01:36:05 PM Peter Zijlstra wrote:

> > > > On Thu, Feb 16, 2017 at 03:42:10PM +0530, Viresh Kumar wrote:

> > > > > But when I discussed this with Vincent, he suggested that it may not be required

> > > > > at all as the scheduler (with the helped of "decayed") doesn't call into

> > > > > schedutil too often, i.e. at least 1 ms. And if the CPUs are stable enough (i.e.

> > > > > no interruptions to the running task), we wouldn't reevaluate before the next

> > > > > tick.

> > > > 

> > > > There are still the attach/detach callers to cfs_rq_util_change() that

> > > > kick in for fork/exit and migration.

> > > > 

> > > > But yes, barring those we shouldn't end up calling it at silly rates.

> > > 

> > > OK

> > > 

> > > Does this mean that running governor computations every time its callback

> > > is invoked by the scheduler would be fine?

> > 

> > I'd say yes right up till the point someone reports a regression ;-)

> 

> @Rafael: Do you want me to send a V2 with the changes you suggested in

> commit log?


Yes, in general, but I have more suggestions regarding that. :-)

I'll send them shortly.

Thanks,
Rafael
Rafael J. Wysocki Feb. 20, 2017, 2:53 p.m. UTC | #7
On Monday, February 20, 2017 02:49:18 PM Rafael J. Wysocki wrote:
> On Monday, February 20, 2017 03:28:03 PM Viresh Kumar wrote:

> > On 17-02-17, 13:48, Peter Zijlstra wrote:

> > > On Fri, Feb 17, 2017 at 01:15:56PM +0100, Rafael J. Wysocki wrote:

> > > > On Thursday, February 16, 2017 01:36:05 PM Peter Zijlstra wrote:

> > > > > On Thu, Feb 16, 2017 at 03:42:10PM +0530, Viresh Kumar wrote:

> > > > > > But when I discussed this with Vincent, he suggested that it may not be required

> > > > > > at all as the scheduler (with the helped of "decayed") doesn't call into

> > > > > > schedutil too often, i.e. at least 1 ms. And if the CPUs are stable enough (i.e.

> > > > > > no interruptions to the running task), we wouldn't reevaluate before the next

> > > > > > tick.

> > > > > 

> > > > > There are still the attach/detach callers to cfs_rq_util_change() that

> > > > > kick in for fork/exit and migration.

> > > > > 

> > > > > But yes, barring those we shouldn't end up calling it at silly rates.

> > > > 

> > > > OK

> > > > 

> > > > Does this mean that running governor computations every time its callback

> > > > is invoked by the scheduler would be fine?

> > > 

> > > I'd say yes right up till the point someone reports a regression ;-)

> > 

> > @Rafael: Do you want me to send a V2 with the changes you suggested in

> > commit log?

> 

> Yes, in general, but I have more suggestions regarding that. :-)

> 

> I'll send them shortly.


OK, so I think that the patch subject should be something like "Redefine the
rate_limit_ns" tunable, because that's what really is going on here.

Next, IMO the changelog should say something like this:

"The rate_limit_ns tunable is intended to reduce the possible overhead from
running the schedutil governor.  However, that overhead can be divided into two
separate parts: the governor computations and the invocation of the scaling
driver to set the CPU frequency.  The former is much less expensive in terms of
execution time and running it every time the governor callback is invoked by
the scheduler would not be a problem, so the latter is where the real overhead
comes from.

For this reason, redefine the rate_limit_ns tunable so that it means the
minimum time that has to pass between two consecutive invocations of the
scaling driver by the schedutil governor (to set the CPU frequency)."

Thanks,
Rafael
diff mbox

Patch

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index fd4659313640..306d97e7b57c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -92,14 +92,13 @@  static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
 {
 	struct cpufreq_policy *policy = sg_policy->policy;
 
-	sg_policy->last_freq_update_time = time;
-
 	if (policy->fast_switch_enabled) {
 		if (sg_policy->next_freq == next_freq) {
 			trace_cpu_frequency(policy->cur, smp_processor_id());
 			return;
 		}
 		sg_policy->next_freq = next_freq;
+		sg_policy->last_freq_update_time = time;
 		next_freq = cpufreq_driver_fast_switch(policy, next_freq);
 		if (next_freq == CPUFREQ_ENTRY_INVALID)
 			return;
@@ -108,6 +107,7 @@  static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
 		trace_cpu_frequency(next_freq, smp_processor_id());
 	} else if (sg_policy->next_freq != next_freq) {
 		sg_policy->next_freq = next_freq;
+		sg_policy->last_freq_update_time = time;
 		sg_policy->work_in_progress = true;
 		irq_work_queue(&sg_policy->irq_work);
 	}