diff mbox series

[v2,02/10] cpufreq: provide data for frequency-invariant load-tracking support

Message ID 20170706094948.8779-3-dietmar.eggemann@arm.com
State New
Headers show
Series arm, arm64: frequency- and cpu-invariant accounting support for task scheduler | expand

Commit Message

Dietmar Eggemann July 6, 2017, 9:49 a.m. UTC
A frequency-invariant load-tracking solution based on cpufreq transition
notifier will not work for future fast frequency switching policies.
That is why a different solution is presented with this patch.

Let cpufreq call the function arch_set_freq_scale() to pass the current
frequency, the max supported frequency and the cpumask of the related
cpus to a consumer (an arch) which defines arch_set_freq_scale().

The consumer has to associate arch_set_freq_scale with the name of its
own implementation foo_set_freq_scale() to overwrite the empty standard
definition in drivers/cpufreq/cpufreq.c.
An arch could do this in one of its arch-specific header files
(e.g. arch/$ARCH/include/asm/topology.h) which gets included in
drivers/cpufreq/cpufreq.c.

In case arch_set_freq_scale() is not defined (and because of the
pr_debug() drivers/cpufreq/cpufreq.c is not compiled with -DDEBUG) the
function cpufreq_set_freq_scale() gets compiled out.

Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

---
 drivers/cpufreq/cpufreq.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

-- 
2.11.0

Comments

Viresh Kumar July 6, 2017, 10:40 a.m. UTC | #1
On 06-07-17, 10:49, Dietmar Eggemann wrote:
> A frequency-invariant load-tracking solution based on cpufreq transition

> notifier will not work for future fast frequency switching policies.

> That is why a different solution is presented with this patch.

> 

> Let cpufreq call the function arch_set_freq_scale() to pass the current

> frequency, the max supported frequency and the cpumask of the related

> cpus to a consumer (an arch) which defines arch_set_freq_scale().

> 

> The consumer has to associate arch_set_freq_scale with the name of its

> own implementation foo_set_freq_scale() to overwrite the empty standard

> definition in drivers/cpufreq/cpufreq.c.

> An arch could do this in one of its arch-specific header files

> (e.g. arch/$ARCH/include/asm/topology.h) which gets included in

> drivers/cpufreq/cpufreq.c.

> 

> In case arch_set_freq_scale() is not defined (and because of the

> pr_debug() drivers/cpufreq/cpufreq.c is not compiled with -DDEBUG)


The line within () needs to be improved to convey a clear message.

> the

> function cpufreq_set_freq_scale() gets compiled out.

> 

> Cc: Rafael J. Wysocki <rjw@rjwysocki.net>

> Cc: Viresh Kumar <viresh.kumar@linaro.org>

> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

> ---

>  drivers/cpufreq/cpufreq.c | 26 ++++++++++++++++++++++++++

>  1 file changed, 26 insertions(+)

> 

> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c

> index 9bf97a366029..a04c5886a5ce 100644

> --- a/drivers/cpufreq/cpufreq.c

> +++ b/drivers/cpufreq/cpufreq.c

> @@ -347,6 +347,28 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,

>  	}

>  }

>  

> +/*********************************************************************

> + *           FREQUENCY INVARIANT CPU CAPACITY SUPPORT                *

> + *********************************************************************/

> +

> +#ifndef arch_set_freq_scale

> +static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,

> +				unsigned long max_freq)

> +{}

> +#endif

> +

> +static void cpufreq_set_freq_scale(struct cpufreq_policy *policy,

> +				   struct cpufreq_freqs *freqs)

> +{

> +	unsigned long cur_freq = freqs ? freqs->new : policy->cur;

> +	unsigned long max_freq = policy->cpuinfo.max_freq;

> +

> +	pr_debug("cpus %*pbl cur/cur max freq %lu/%lu kHz\n",

> +		 cpumask_pr_args(policy->related_cpus), cur_freq, max_freq);

> +

> +	arch_set_freq_scale(policy->related_cpus, cur_freq, max_freq);


I am not sure why all these are required to be sent here and will come back to
it later on after going through other patches.

> +}

> +

>  /**

>   * cpufreq_notify_transition - call notifier chain and adjust_jiffies

>   * on frequency transition.

> @@ -405,6 +427,8 @@ void cpufreq_freq_transition_begin(struct cpufreq_policy *policy,

>  

>  	spin_unlock(&policy->transition_lock);

>  

> +	cpufreq_set_freq_scale(policy, freqs);

> +


Why do this before even changing the frequency ? We may fail while changing it.

IMHO, you should call this routine whenever we update policy->cur and that
happens regularly in __cpufreq_notify_transition() and few other places..

>  	cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);

>  }

>  EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin);

> @@ -2203,6 +2227,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,

>  	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,

>  			CPUFREQ_NOTIFY, new_policy);

>  

> +	cpufreq_set_freq_scale(new_policy, NULL);


Why added it here ? To get it initialized ? If yes, then we should do that in
cpufreq_online() where we first initialize policy->cur.

Apart from this, you also need to update this in the schedutil governor (if you
haven't done that in this series later) as that also updates policy->cur in the
fast path.

-- 
viresh
Rafael J. Wysocki July 7, 2017, 4:18 p.m. UTC | #2
On Fri, Jul 7, 2017 at 6:01 PM, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
> On 06/07/17 11:40, Viresh Kumar wrote:

>> On 06-07-17, 10:49, Dietmar Eggemann wrote:

>

> [...]

>

>>> In case arch_set_freq_scale() is not defined (and because of the

>>> pr_debug() drivers/cpufreq/cpufreq.c is not compiled with -DDEBUG)

>>

>> The line within () needs to be improved to convey a clear message.

>

> Probably not needed anymore. See below.

>

> [...]

>

>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c

>>> index 9bf97a366029..a04c5886a5ce 100644

>>> --- a/drivers/cpufreq/cpufreq.c

>>> +++ b/drivers/cpufreq/cpufreq.c

>>> @@ -347,6 +347,28 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,

>>>      }

>>>  }

>>>

>>> +/*********************************************************************

>>> + *           FREQUENCY INVARIANT CPU CAPACITY SUPPORT                *

>>> + *********************************************************************/

>>> +

>>> +#ifndef arch_set_freq_scale

>>> +static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,

>>> +                            unsigned long max_freq)

>>> +{}

>>> +#endif

>>> +

>>> +static void cpufreq_set_freq_scale(struct cpufreq_policy *policy,

>>> +                               struct cpufreq_freqs *freqs)

>>> +{

>>> +    unsigned long cur_freq = freqs ? freqs->new : policy->cur;

>>> +    unsigned long max_freq = policy->cpuinfo.max_freq;

>>> +

>>> +    pr_debug("cpus %*pbl cur/cur max freq %lu/%lu kHz\n",

>>> +             cpumask_pr_args(policy->related_cpus), cur_freq, max_freq);

>>> +

>>> +    arch_set_freq_scale(policy->related_cpus, cur_freq, max_freq);

>>

>> I am not sure why all these are required to be sent here and will come back to

>> it later on after going through other patches.

>

> See below.

>

>>> +}

>>> +

>>>  /**

>>>   * cpufreq_notify_transition - call notifier chain and adjust_jiffies

>>>   * on frequency transition.

>>> @@ -405,6 +427,8 @@ void cpufreq_freq_transition_begin(struct cpufreq_policy *policy,

>>>

>>>      spin_unlock(&policy->transition_lock);

>>>

>>> +    cpufreq_set_freq_scale(policy, freqs);

>>> +

>>

>> Why do this before even changing the frequency ? We may fail while changing it.

>>

>> IMHO, you should call this routine whenever we update policy->cur and that

>> happens regularly in __cpufreq_notify_transition() and few other places..

>

> See below.

>

>>>      cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);

>>>  }

>>>  EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin);

>>> @@ -2203,6 +2227,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,

>>>      blocking_notifier_call_chain(&cpufreq_policy_notifier_list,

>>>                      CPUFREQ_NOTIFY, new_policy);

>>>

>>> +    cpufreq_set_freq_scale(new_policy, NULL);

>>

>> Why added it here ? To get it initialized ? If yes, then we should do that in

>> cpufreq_online() where we first initialize policy->cur.

>

> I agree. This can go away. Initialization is not really needed here. We initialize

> the scale values to SCHED_CAPACITY_SCALE at boot-time.

>

>> Apart from this, you also need to update this in the schedutil governor (if you

>> haven't done that in this series later) as that also updates policy->cur in the

>> fast path.

>

> So what about I call arch_set_freq_scale() in __cpufreq_notify_transition() in the

> CPUFREQ_POSTCHANGE case for slow-switching and in cpufreq_driver_fast_switch() for

> fast-switching?


Why don't you do this in drivers instead of in the core?

Ultimately, the driver knows what frequency it has requested, so why
can't it call arch_set_freq_scale()?

Thanks,
Rafael
Viresh Kumar July 11, 2017, 6:01 a.m. UTC | #3
On 10-07-17, 13:02, Dietmar Eggemann wrote:
> Yes, I will change this. The #define approach is not really necessary

> here since we're not in the scheduler hot-path and inlining is not

> really required here.


It would be part of scheduler hot-path for the fast-switching case, isn't it ?
(I am not arguing against using weak functions, just wanted to correct above
statement).

-- 
viresh
Rafael J. Wysocki July 11, 2017, 2:59 p.m. UTC | #4
On Tuesday, July 11, 2017 04:06:01 PM Dietmar Eggemann wrote:
> On 11/07/17 07:01, Viresh Kumar wrote:

> > On 10-07-17, 13:02, Dietmar Eggemann wrote:

> >> Yes, I will change this. The #define approach is not really necessary

> >> here since we're not in the scheduler hot-path and inlining is not

> >> really required here.

> > 

> > It would be part of scheduler hot-path for the fast-switching case, isn't it ?

> > (I am not arguing against using weak functions, just wanted to correct above

> > statement).

> 

> Yes you're right here.

> 

> But in the meantime we're convinced that cpufreq_driver_fast_switch() is

> not the right place to call arch_set_freq_scale() since for (future)

> arm/arm64 fast-switch driver, the return value of

> cpufreq_driver->fast_switch() does not give us the information that the

> frequency value did actually change.

> 

> So we probably have to do this soemwhere in the cpufreq driver(s) to

> support fast-switching until we have aperf/mperf like counters.


If that's the case, I'd say call arch_set_freq_scale() from drivers in all
cases or it will get *really* confusing.

Thanks,
Rafael
Dietmar Eggemann July 11, 2017, 3:06 p.m. UTC | #5
On 11/07/17 07:01, Viresh Kumar wrote:
> On 10-07-17, 13:02, Dietmar Eggemann wrote:

>> Yes, I will change this. The #define approach is not really necessary

>> here since we're not in the scheduler hot-path and inlining is not

>> really required here.

> 

> It would be part of scheduler hot-path for the fast-switching case, isn't it ?

> (I am not arguing against using weak functions, just wanted to correct above

> statement).


Yes you're right here.

But in the meantime we're convinced that cpufreq_driver_fast_switch() is
not the right place to call arch_set_freq_scale() since for (future)
arm/arm64 fast-switch driver, the return value of
cpufreq_driver->fast_switch() does not give us the information that the
frequency value did actually change.

So we probably have to do this soemwhere in the cpufreq driver(s) to
support fast-switching until we have aperf/mperf like counters.
Dietmar Eggemann July 11, 2017, 3:12 p.m. UTC | #6
On 11/07/17 15:59, Rafael J. Wysocki wrote:
> On Tuesday, July 11, 2017 04:06:01 PM Dietmar Eggemann wrote:

>> On 11/07/17 07:01, Viresh Kumar wrote:

>>> On 10-07-17, 13:02, Dietmar Eggemann wrote:

>>>> Yes, I will change this. The #define approach is not really necessary

>>>> here since we're not in the scheduler hot-path and inlining is not

>>>> really required here.

>>>

>>> It would be part of scheduler hot-path for the fast-switching case, isn't it ?

>>> (I am not arguing against using weak functions, just wanted to correct above

>>> statement).

>>

>> Yes you're right here.

>>

>> But in the meantime we're convinced that cpufreq_driver_fast_switch() is

>> not the right place to call arch_set_freq_scale() since for (future)

>> arm/arm64 fast-switch driver, the return value of

>> cpufreq_driver->fast_switch() does not give us the information that the

>> frequency value did actually change.

>>

>> So we probably have to do this soemwhere in the cpufreq driver(s) to

>> support fast-switching until we have aperf/mperf like counters.

> 

> If that's the case, I'd say call arch_set_freq_scale() from drivers in all

> cases or it will get *really* confusing.


Agreed, we should do it for slow-switching drivers from within the
driver as well in this case.
Viresh Kumar July 12, 2017, 4:09 a.m. UTC | #7
On 11-07-17, 16:06, Dietmar Eggemann wrote:
> But in the meantime we're convinced that cpufreq_driver_fast_switch() is

> not the right place to call arch_set_freq_scale() since for (future)

> arm/arm64 fast-switch driver, the return value of

> cpufreq_driver->fast_switch() does not give us the information that the

> frequency value did actually change.


Yeah, I saw your discussion with Peter on #linux-rt IRC and TBH I wasn't aware
that we are going to do fast switching that way. Just trying to get
understanding of that idea a bit..

So we will do fast switching from scheduler's point of view, i.e. we wouldn't
schedule a kthread to change the frequency. But the real hardware still can't do
that without sleeping, like if we have I2C somewhere in between. AFAIU, we will
still have some kind of *software* bottom half to do that work, isn't it? And it
wouldn't be that we have pushed some instructions to the hardware, which it can
do a bit later.

For example, the regulator may be accessed via I2C and we need to program that
before changing the clock. So, it will be done by some software code only.

And now I am wondering on why that would be any better than the kthread in
schedutil. Sorry, I haven't understood the idea completely yet :(

-- 
viresh
Peter Zijlstra July 12, 2017, 8:31 a.m. UTC | #8
On Wed, Jul 12, 2017 at 09:39:17AM +0530, Viresh Kumar wrote:
> Yeah, I saw your discussion with Peter on #linux-rt IRC and TBH I wasn't aware

> that we are going to do fast switching that way. Just trying to get

> understanding of that idea a bit..

> 

> So we will do fast switching from scheduler's point of view, i.e. we wouldn't

> schedule a kthread to change the frequency. But the real hardware still can't do

> that without sleeping, like if we have I2C somewhere in between. AFAIU, we will

> still have some kind of *software* bottom half to do that work, isn't it? And it

> wouldn't be that we have pushed some instructions to the hardware, which it can

> do a bit later.

> 

> For example, the regulator may be accessed via I2C and we need to program that

> before changing the clock. So, it will be done by some software code only.

> 

> And now I am wondering on why that would be any better than the kthread in

> schedutil. Sorry, I haven't understood the idea completely yet :(


So the problem with the thread is two-fold; one the one hand we like the
scheduler to directly set frequency, but then we need to schedule a task
to change the frequency, which will change the frequency and around we
go.

On the other hand, there's very nasty issues with PI. This thread would
have very high priority (otherwise the SCHED_DEADLINE stuff won't work)
but that then means this thread needs to boost the owner of the i2c
mutex. And that then creates a massive bandwidth accounting hole.


The advantage of using an interrupt driven state machine is that all
those issues go away.

But yes, whichever way around you turn things, its crap. But given the
hardware its the best we can do.
Sudeep Holla July 13, 2017, 2:04 p.m. UTC | #9
On 12/07/17 12:14, Peter Zijlstra wrote:
> On Wed, Jul 12, 2017 at 02:57:55PM +0530, Viresh Kumar wrote:

>> On 12-07-17, 10:31, Peter Zijlstra wrote:

>>> So the problem with the thread is two-fold; one the one hand we like the

>>> scheduler to directly set frequency, but then we need to schedule a task

>>> to change the frequency, which will change the frequency and around we

>>> go.

>>>

>>> On the other hand, there's very nasty issues with PI. This thread would

>>> have very high priority (otherwise the SCHED_DEADLINE stuff won't work)

>>> but that then means this thread needs to boost the owner of the i2c

>>> mutex. And that then creates a massive bandwidth accounting hole.

>>>

>>>

>>> The advantage of using an interrupt driven state machine is that all

>>> those issues go away.

>>>

>>> But yes, whichever way around you turn things, its crap. But given the

>>> hardware its the best we can do.

>>

>> Thanks for the explanation Peter.

>>

>> IIUC, it will take more time to change the frequency eventually with

>> the interrupt-driven state machine as there may be multiple bottom

>> halves involved here, for supply, clk, etc, which would run at normal

>> priorities now. And those were boosted currently due to the high

>> priority sugov thread. And we are fine with that (from performance

>> point of view) ?

> 

> I'm not sure what you mean; bottom halves as in softirq? From what I can

> tell an i2c bus does clk_prepare_enable() on registration and from that

> point on clk_enable() is usable from atomic contexts. But afaict clk

> stuff doesn't do interrupts at all.

> 

> (with a note that I absolutely hate the clk locking)

> 


Agreed. Juri pointed out this as a blocker a while ago and when we
started implementing the new and shiny ARM SCMI specification, I dropped
the whole clock layer interaction for the CPUFreq driver. However, I
still have to deal with some mailbox locking(still experimenting currently)

> I think the interrupt driven thing can actually be faster than the

> 'regular' task waiting on the mutex. The regulator message can be

> locklessly queued (it only ever makes sense to have 1 such message

> pending, any later one will invalidate a prior one).

> 


Ah OK, I just asked the same in the other thread, you have already
answered me. Good we can ignore.

> Then the i2c interrupt can detect the availability of this pending

> message and splice it into the transfer queue at an opportune moment.

> 

> (of course, the current i2c bits don't support any of that)

> 

>> Coming back to where we started from (where should we call

>> arch_set_freq_scale() from ?).

> 

> The drivers.. the core cpufreq doesn't know when (if any) transition is

> completed.

> 

>> I think we would still need some kind of synchronization between

>> cpufreq core and the cpufreq drivers to make sure we don't start

>> another freq change before the previous one is complete. Otherwise

>> the cpufreq drivers would be required to have similar support with

>> proper locking in place.

> 

> Not sure what you mean; also not sure why. On x86 we never know, cannot

> know. So why would this stuff be any different.

> 


Good, I was under the same assumption that it's okay to override the old
request with new.

>> And if the core is going to get notified about successful freq changes

>> (which it should IMHO), then it may still be better to call

>> arch_set_freq_scale() from the core itself and not from individual

>> drivers.

> 

> I would not involve the core. All we want from the core is a unified

> interface towards requesting DVFS changes. Everything that happens after

> is not its business.

> 


The question is whether we *need* to know the completion of frequency
transition. What is the impact of absence of it ? I am considering
platforms which may take up to a ms or more to do the actual transition
in the firmware.

-- 
Regards,
Sudeep
Peter Zijlstra July 13, 2017, 2:42 p.m. UTC | #10
On Thu, Jul 13, 2017 at 03:04:09PM +0100, Sudeep Holla wrote:

> The question is whether we *need* to know the completion of frequency

> transition. What is the impact of absence of it ? I am considering

> platforms which may take up to a ms or more to do the actual transition

> in the firmware.


So on x86 we can recover from not knowing by means of the APERF/MPERF
thing, which gives us the average effective frequency over the last
period.

If you lack that you need something to update the actual effective
frequency.

Changing the effective frequency at request time might confuse things --
esp. if the request might not be honoured at all or can take a
significant time to complete. Not to mention that _IF_ you rely on the
effective frequency to set other clocks things can come unstuck.

So unless you go the whole distance and do APERF/MPERF like things, I
think it would be very good to have a notification of completion (and
possibly a read-back of the effective frequency that is now set).
diff mbox series

Patch

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 9bf97a366029..a04c5886a5ce 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -347,6 +347,28 @@  static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
 	}
 }
 
+/*********************************************************************
+ *           FREQUENCY INVARIANT CPU CAPACITY SUPPORT                *
+ *********************************************************************/
+
+#ifndef arch_set_freq_scale
+static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
+				unsigned long max_freq)
+{}
+#endif
+
+static void cpufreq_set_freq_scale(struct cpufreq_policy *policy,
+				   struct cpufreq_freqs *freqs)
+{
+	unsigned long cur_freq = freqs ? freqs->new : policy->cur;
+	unsigned long max_freq = policy->cpuinfo.max_freq;
+
+	pr_debug("cpus %*pbl cur/cur max freq %lu/%lu kHz\n",
+		 cpumask_pr_args(policy->related_cpus), cur_freq, max_freq);
+
+	arch_set_freq_scale(policy->related_cpus, cur_freq, max_freq);
+}
+
 /**
  * cpufreq_notify_transition - call notifier chain and adjust_jiffies
  * on frequency transition.
@@ -405,6 +427,8 @@  void cpufreq_freq_transition_begin(struct cpufreq_policy *policy,
 
 	spin_unlock(&policy->transition_lock);
 
+	cpufreq_set_freq_scale(policy, freqs);
+
 	cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);
 }
 EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin);
@@ -2203,6 +2227,8 @@  static int cpufreq_set_policy(struct cpufreq_policy *policy,
 	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
 			CPUFREQ_NOTIFY, new_policy);
 
+	cpufreq_set_freq_scale(new_policy, NULL);
+
 	policy->min = new_policy->min;
 	policy->max = new_policy->max;