diff mbox series

[RFC,5/7] sched/schedutil: Add a new tunable to dictate response time

Message ID 20230827233203.1315953-6-qyousef@layalina.io
State New
Headers show
Series sched: cpufreq: Remove magic margins | expand

Commit Message

Qais Yousef Aug. 27, 2023, 11:32 p.m. UTC
The new tunable, response_time_ms,  allow us to speed up or slow down
the response time of the policy to meet the perf, power and thermal
characteristic desired by the user/sysadmin. There's no single universal
trade-off that we can apply for all systems even if they use the same
SoC. The form factor of the system, the dominant use case, and in case
of battery powered systems, the size of the battery and presence or
absence of active cooling can play a big role on what would be best to
use.

The new tunable provides sensible defaults, but yet gives the power to
control the response time to the user/sysadmin, if they wish to.

This tunable is applied when we map the util into frequency.

TODO: to retain previous behavior, we must multiply default time with
80%..

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 Documentation/admin-guide/pm/cpufreq.rst | 19 ++++++-
 kernel/sched/cpufreq_schedutil.c         | 70 +++++++++++++++++++++++-
 2 files changed, 87 insertions(+), 2 deletions(-)

Comments

Dietmar Eggemann Sept. 6, 2023, 9:13 p.m. UTC | #1
On 28/08/2023 01:32, Qais Yousef wrote:

[...]

> @@ -427,6 +427,23 @@ This governor exposes only one tunable:
>  	The purpose of this tunable is to reduce the scheduler context overhead
>  	of the governor which might be excessive without it.
>  
> +``respone_time_ms``
> +	Amount of time (in milliseconds) required to ramp the policy from
> +	lowest to highest frequency. Can be decreased to speed up the
> +	responsiveness of the system, or increased to slow the system down in
> +	hope to save power. The best perf/watt will depend on the system
> +	characteristics and the dominant workload you expect to run. For
> +	userspace that has smart context on the type of workload running (like
> +	in Android), one can tune this to suite the demand of that workload.
> +
> +	Note that when slowing the response down, you can end up effectively
> +	chopping off the top frequencies for that policy as the util is capped
> +	to 1024. On HMP systems where some CPUs have a capacity less than 1024,

HMP isn't used in mainline AFAIK. IMHO, the term `asymmetric CPU
capacity` systems is used.

[...]

> @@ -59,6 +61,45 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>  
>  /************************ Governor internals ***********************/
>  
> +static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
> +{
> +	int cpu = cpumask_first(sg_policy->policy->cpus);
> +	unsigned long cap = capacity_orig_of(cpu);
> +
> +	return approximate_runtime(cap);
> +}

I can see the potential issue of schedutil being earlier initialized
than the `max frequency scaling of cpu_capacity_orig` happens in
drivers/base/arch_topology.c.

So the response_time_ms setup for a little CPU on Juno-r0 wouldn't
happen on cpu_capacity_orig = 446 -> 26ms but on on the raw capacity
value from dt:

    capacity-dmips-mhz = <578>

So I would expect to see t = 32ms * ln(1 - 578/1024)/ln(0.5) = 38ms instead.

We have a similar dependency between `max frequency scaled
cpu_capacity_orig` and the EM setup code.

[...]
Qais Yousef Sept. 6, 2023, 9:52 p.m. UTC | #2
On 09/06/23 23:13, Dietmar Eggemann wrote:
> On 28/08/2023 01:32, Qais Yousef wrote:
> 
> [...]
> 
> > @@ -427,6 +427,23 @@ This governor exposes only one tunable:
> >  	The purpose of this tunable is to reduce the scheduler context overhead
> >  	of the governor which might be excessive without it.
> >  
> > +``respone_time_ms``
> > +	Amount of time (in milliseconds) required to ramp the policy from
> > +	lowest to highest frequency. Can be decreased to speed up the
> > +	responsiveness of the system, or increased to slow the system down in
> > +	hope to save power. The best perf/watt will depend on the system
> > +	characteristics and the dominant workload you expect to run. For
> > +	userspace that has smart context on the type of workload running (like
> > +	in Android), one can tune this to suite the demand of that workload.
> > +
> > +	Note that when slowing the response down, you can end up effectively
> > +	chopping off the top frequencies for that policy as the util is capped
> > +	to 1024. On HMP systems where some CPUs have a capacity less than 1024,
> 
> HMP isn't used in mainline AFAIK. IMHO, the term `asymmetric CPU
> capacity` systems is used.

It's a shorter name and less mouthful and typeful; I think we should start to
use it :)

> 
> [...]
> 
> > @@ -59,6 +61,45 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> >  
> >  /************************ Governor internals ***********************/
> >  
> > +static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
> > +{
> > +	int cpu = cpumask_first(sg_policy->policy->cpus);
> > +	unsigned long cap = capacity_orig_of(cpu);
> > +
> > +	return approximate_runtime(cap);
> > +}
> 
> I can see the potential issue of schedutil being earlier initialized
> than the `max frequency scaling of cpu_capacity_orig` happens in
> drivers/base/arch_topology.c.
> 
> So the response_time_ms setup for a little CPU on Juno-r0 wouldn't
> happen on cpu_capacity_orig = 446 -> 26ms but on on the raw capacity
> value from dt:
> 
>     capacity-dmips-mhz = <578>
> 
> So I would expect to see t = 32ms * ln(1 - 578/1024)/ln(0.5) = 38ms instead.
> 
> We have a similar dependency between `max frequency scaled
> cpu_capacity_orig` and the EM setup code.

Hmm thanks for the pointer! That might help explain why I see wrong values for
the big core in my setup.

Should using arch_scale_cpu_capacity() help instead? Or I need to find a way to
plug the race instead?


Thanks!

--
Qais Yousef
Peter Zijlstra Sept. 7, 2023, 11:44 a.m. UTC | #3
On Mon, Aug 28, 2023 at 12:32:01AM +0100, Qais Yousef wrote:
> +static inline unsigned long
> +sugov_apply_response_time(struct sugov_policy *sg_policy, unsigned long util)
> +{
> +	unsigned long mult;
> +
> +	if (sg_policy->freq_response_time_ms == sg_policy->tunables->response_time_ms)
> +		return util;
> +
> +	mult = sg_policy->freq_response_time_ms * SCHED_CAPACITY_SCALE;
> +	mult /=	sg_policy->tunables->response_time_ms;
> +	mult *= util;
> +
> +	return mult >> SCHED_CAPACITY_SHIFT;
> +}
> +
>  static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>  {
>  	s64 delta_ns;
> @@ -143,6 +184,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>  	unsigned int freq = arch_scale_freq_invariant() ?
>  				policy->cpuinfo.max_freq : policy->cur;
>  
> +	util = sugov_apply_response_time(sg_policy, util);
>  	freq = map_util_freq(util, freq, max);
>  
>  	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)

Urgh, so instead of caching the multiplier you keep computing what is
essentially a constant over and over and over and over again :/

That is, compute the whole 'freq_response_time_ms * SCHED_CAPACITY_SCALE
/ response_time_ms' thing *once*, when that file is written to, and then
reduce the whole thing to:

	return (freq_response_mult * util) >> SCHED_CAPACITY_SHIFT;

No need for that special case, no need for divisions, just go.
Qais Yousef Sept. 10, 2023, 7:25 p.m. UTC | #4
On 09/07/23 13:44, Peter Zijlstra wrote:
> On Mon, Aug 28, 2023 at 12:32:01AM +0100, Qais Yousef wrote:
> > +static inline unsigned long
> > +sugov_apply_response_time(struct sugov_policy *sg_policy, unsigned long util)
> > +{
> > +	unsigned long mult;
> > +
> > +	if (sg_policy->freq_response_time_ms == sg_policy->tunables->response_time_ms)
> > +		return util;
> > +
> > +	mult = sg_policy->freq_response_time_ms * SCHED_CAPACITY_SCALE;
> > +	mult /=	sg_policy->tunables->response_time_ms;
> > +	mult *= util;
> > +
> > +	return mult >> SCHED_CAPACITY_SHIFT;
> > +}
> > +
> >  static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> >  {
> >  	s64 delta_ns;
> > @@ -143,6 +184,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> >  	unsigned int freq = arch_scale_freq_invariant() ?
> >  				policy->cpuinfo.max_freq : policy->cur;
> >  
> > +	util = sugov_apply_response_time(sg_policy, util);
> >  	freq = map_util_freq(util, freq, max);
> >  
> >  	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> 
> Urgh, so instead of caching the multiplier you keep computing what is
> essentially a constant over and over and over and over again :/
> 
> That is, compute the whole 'freq_response_time_ms * SCHED_CAPACITY_SCALE
> / response_time_ms' thing *once*, when that file is written to, and then
> reduce the whole thing to:
> 
> 	return (freq_response_mult * util) >> SCHED_CAPACITY_SHIFT;
> 
> No need for that special case, no need for divisions, just go.

Yes! I was too focused am I doing the right thing, I didn't stop to think this
is actually a constant and can be done once too. I will fix it if this knobs
ends up hanging around.


Thanks!

--
Qais Yousef
diff mbox series

Patch

diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index 6adb7988e0eb..c43df0e716a7 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -417,7 +417,7 @@  is passed by the scheduler to the governor callback which causes the frequency
 to go up to the allowed maximum immediately and then draw back to the value
 returned by the above formula over time.
 
-This governor exposes only one tunable:
+This governor exposes two tunables:
 
 ``rate_limit_us``
 	Minimum time (in microseconds) that has to pass between two consecutive
@@ -427,6 +427,23 @@  This governor exposes only one tunable:
 	The purpose of this tunable is to reduce the scheduler context overhead
 	of the governor which might be excessive without it.
 
+``respone_time_ms``
+	Amount of time (in milliseconds) required to ramp the policy from
+	lowest to highest frequency. Can be decreased to speed up the
+	responsiveness of the system, or increased to slow the system down in
+	hope to save power. The best perf/watt will depend on the system
+	characteristics and the dominant workload you expect to run. For
+	userspace that has smart context on the type of workload running (like
+	in Android), one can tune this to suite the demand of that workload.
+
+	Note that when slowing the response down, you can end up effectively
+	chopping off the top frequencies for that policy as the util is capped
+	to 1024. On HMP systems where some CPUs have a capacity less than 1024,
+	unless affinity is used, the task would have probably migrated to
+	a bigger core before you reach the max performance of the policy. If
+	they're locked to that policy, then they should reach the max
+	performance after the specified time.
+
 This governor generally is regarded as a replacement for the older `ondemand`_
 and `conservative`_ governors (described below), as it is simpler and more
 tightly integrated with the CPU scheduler, its overhead in terms of CPU context
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 04aa06846f31..42f4c4100902 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -11,6 +11,7 @@ 
 struct sugov_tunables {
 	struct gov_attr_set	attr_set;
 	unsigned int		rate_limit_us;
+	unsigned int		response_time_ms;
 };
 
 struct sugov_policy {
@@ -22,6 +23,7 @@  struct sugov_policy {
 	raw_spinlock_t		update_lock;
 	u64			last_freq_update_time;
 	s64			freq_update_delay_ns;
+	unsigned int		freq_response_time_ms;
 	unsigned int		next_freq;
 	unsigned int		cached_raw_freq;
 
@@ -59,6 +61,45 @@  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
 
 /************************ Governor internals ***********************/
 
+static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
+{
+	int cpu = cpumask_first(sg_policy->policy->cpus);
+	unsigned long cap = capacity_orig_of(cpu);
+
+	return approximate_runtime(cap);
+}
+
+/*
+ * Shrink or expand how long it takes to reach the maximum performance of the
+ * policy.
+ *
+ * sg_policy->freq_response_time_ms is a constant value defined by PELT
+ * HALFLIFE and the capacity of the policy (assuming HMP systems).
+ *
+ * sg_policy->tunables->response_time_ms is a user defined response time. By
+ * setting it lower than sg_policy->freq_response_time_ms, the system will
+ * respond faster to changes in util, which will result in reaching maximum
+ * performance point quicker. By setting it higher, it'll slow down the amount
+ * of time required to reach the maximum OPP.
+ *
+ * This should be applied when selecting the frequency. By default no
+ * conversion is done and we should return util as-is.
+ */
+static inline unsigned long
+sugov_apply_response_time(struct sugov_policy *sg_policy, unsigned long util)
+{
+	unsigned long mult;
+
+	if (sg_policy->freq_response_time_ms == sg_policy->tunables->response_time_ms)
+		return util;
+
+	mult = sg_policy->freq_response_time_ms * SCHED_CAPACITY_SCALE;
+	mult /=	sg_policy->tunables->response_time_ms;
+	mult *= util;
+
+	return mult >> SCHED_CAPACITY_SHIFT;
+}
+
 static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 {
 	s64 delta_ns;
@@ -143,6 +184,7 @@  static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
 
+	util = sugov_apply_response_time(sg_policy, util);
 	freq = map_util_freq(util, freq, max);
 
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
@@ -539,8 +581,32 @@  rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
 
 static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
 
+static ssize_t response_time_ms_show(struct gov_attr_set *attr_set, char *buf)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+	return sprintf(buf, "%u\n", tunables->response_time_ms);
+}
+
+static ssize_t
+response_time_ms_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+	unsigned int response_time_ms;
+
+	if (kstrtouint(buf, 10, &response_time_ms))
+		return -EINVAL;
+
+	tunables->response_time_ms = response_time_ms;
+
+	return count;
+}
+
+static struct governor_attr response_time_ms = __ATTR_RW(response_time_ms);
+
 static struct attribute *sugov_attrs[] = {
 	&rate_limit_us.attr,
+	&response_time_ms.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(sugov);
@@ -704,6 +770,7 @@  static int sugov_init(struct cpufreq_policy *policy)
 	}
 
 	tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
+	tunables->response_time_ms = sugov_calc_freq_response_ms(sg_policy);
 
 	policy->governor_data = sg_policy;
 	sg_policy->tunables = tunables;
@@ -763,7 +830,8 @@  static int sugov_start(struct cpufreq_policy *policy)
 	void (*uu)(struct update_util_data *data, u64 time, unsigned int flags);
 	unsigned int cpu;
 
-	sg_policy->freq_update_delay_ns	= sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+	sg_policy->freq_update_delay_ns		= sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+	sg_policy->freq_response_time_ms	= sugov_calc_freq_response_ms(sg_policy);
 	sg_policy->last_freq_update_time	= 0;
 	sg_policy->next_freq			= 0;
 	sg_policy->work_in_progress		= false;