[v3,3/6] cpufreq: Add an interface to mark inefficient frequencies

Message ID 1622804761-126737-4-git-send-email-vincent.donnefort@arm.com
State Superseded
Headers show
Series
  • EM / PM: Inefficient OPPs
Related show

Commit Message

Vincent Donnefort June 4, 2021, 11:05 a.m.
Some SoCs such as the sd855 have OPPs within the same policy whose cost is
higher than others with a higher frequency. Those OPPs are inefficients and
it might be interesting for a governor to not use them.

Adding a flag, CPUFREQ_INEFFICIENT_FREQ, to mark such OPPs into the
frequency table, as well as a new cpufreq_frequency_table member
"efficient". This new member will allow a governor to quickly resolve an
inefficient frequency to an efficient one.

Efficient OPPs point to themselves. Governors must also ensure that the
efficiency resolution does not break the policy maximum.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

Comments

Viresh Kumar June 7, 2021, 5:02 a.m. | #1
On 04-06-21, 12:05, Vincent Donnefort wrote:
> Some SoCs such as the sd855 have OPPs within the same policy whose cost is

> higher than others with a higher frequency. Those OPPs are inefficients and

> it might be interesting for a governor to not use them.

> 

> Adding a flag, CPUFREQ_INEFFICIENT_FREQ, to mark such OPPs into the

> frequency table, as well as a new cpufreq_frequency_table member

> "efficient". This new member will allow a governor to quickly resolve an

> inefficient frequency to an efficient one.

> 

> Efficient OPPs point to themselves. Governors must also ensure that the

> efficiency resolution does not break the policy maximum.

> 

> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>


I was thinking about a different approach when I suggested it.

- The cpufreq table instead of an index, will have "efficient" as bool.

- EM will set an OPP as efficient or inefficient, based on the OPP
  table, there will be a flag for this in the OPP table.

- The cpufreq table is then created from OPP table and this
  information is automatically retrieved.

-- 
viresh
Lukasz Luba June 7, 2021, 10:14 a.m. | #2
On 6/7/21 6:02 AM, Viresh Kumar wrote:
> On 04-06-21, 12:05, Vincent Donnefort wrote:

>> Some SoCs such as the sd855 have OPPs within the same policy whose cost is

>> higher than others with a higher frequency. Those OPPs are inefficients and

>> it might be interesting for a governor to not use them.

>>

>> Adding a flag, CPUFREQ_INEFFICIENT_FREQ, to mark such OPPs into the

>> frequency table, as well as a new cpufreq_frequency_table member

>> "efficient". This new member will allow a governor to quickly resolve an

>> inefficient frequency to an efficient one.

>>

>> Efficient OPPs point to themselves. Governors must also ensure that the

>> efficiency resolution does not break the policy maximum.

>>

>> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

> 

> I was thinking about a different approach when I suggested it.

> 

> - The cpufreq table instead of an index, will have "efficient" as bool.

> 

> - EM will set an OPP as efficient or inefficient, based on the OPP

>    table, there will be a flag for this in the OPP table.

> 

> - The cpufreq table is then created from OPP table and this

>    information is automatically retrieved.

> 


There are some issues with your proposed approach:
The EM doesn't depend on OPP framework (even no header from opp.h)
and we don't want to add this dependency in EM.

The Vincent's proposed implementation doesn't introduce this
dependency.

Regards,
Lukasz
Viresh Kumar June 14, 2021, 7:28 a.m. | #3
On 04-06-21, 12:05, Vincent Donnefort wrote:
>  int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)

>  {

>  	int ret;

> @@ -362,7 +409,13 @@ int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)

>  	if (ret)

>  		return ret;

>  

> -	return set_freq_table_sorted(policy);

> +	ret = set_freq_table_sorted(policy);

> +	if (ret)

> +		return ret;

> +

> +	set_freq_table_efficiencies(policy);

> +

> +	return ret;

>  }


Lets provide a single callback from the cpufreq core to do all
settings related to efficient frequencies and call it from
em_dev_register_perf_domain() ?

So we only do them for the cpufreq drivers that register themselves
with EM.

-- 
viresh
Vincent Donnefort June 14, 2021, 1:35 p.m. | #4
On Mon, Jun 14, 2021 at 12:58:35PM +0530, Viresh Kumar wrote:
> On 04-06-21, 12:05, Vincent Donnefort wrote:

> >  int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)

> >  {

> >  	int ret;

> > @@ -362,7 +409,13 @@ int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)

> >  	if (ret)

> >  		return ret;

> >  

> > -	return set_freq_table_sorted(policy);

> > +	ret = set_freq_table_sorted(policy);

> > +	if (ret)

> > +		return ret;

> > +

> > +	set_freq_table_efficiencies(policy);

> > +

> > +	return ret;

> >  }

> 

> Lets provide a single callback from the cpufreq core to do all

> settings related to efficient frequencies and call it from

> em_dev_register_perf_domain() ?


Sadly we cannot do it in em_dev_register_perf_domain(). This function is called
from the cpufreq driver init, when the table isn't available yet.

> 

> So we only do them for the cpufreq drivers that register themselves

> with EM.


Currently, only the EM would set those inefficient OPPs. But it also gives the
possibility for individual drivers to set them, even if they do not use the EM.

> 

> -- 

> viresh
Viresh Kumar June 15, 2021, 5:02 a.m. | #5
On 14-06-21, 14:35, Vincent Donnefort wrote:
> On Mon, Jun 14, 2021 at 12:58:35PM +0530, Viresh Kumar wrote:

> > On 04-06-21, 12:05, Vincent Donnefort wrote:

> > >  int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)

> > >  {

> > >  	int ret;

> > > @@ -362,7 +409,13 @@ int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)

> > >  	if (ret)

> > >  		return ret;

> > >  

> > > -	return set_freq_table_sorted(policy);

> > > +	ret = set_freq_table_sorted(policy);

> > > +	if (ret)

> > > +		return ret;

> > > +

> > > +	set_freq_table_efficiencies(policy);

> > > +

> > > +	return ret;

> > >  }

> > 

> > Lets provide a single callback from the cpufreq core to do all

> > settings related to efficient frequencies and call it from

> > em_dev_register_perf_domain() ?

> 

> Sadly we cannot do it in em_dev_register_perf_domain(). This function is called

> from the cpufreq driver init, when the table isn't available yet.


I looked at all the users of em_dev_register_perf_domain() and each
one of them have the freq table ready at that point of time.

> > 

> > So we only do them for the cpufreq drivers that register themselves

> > with EM.

> 

> Currently, only the EM would set those inefficient OPPs. But it also gives the

> possibility for individual drivers to set them, even if they do not use the EM.


This is exactly why I want those parts of the kernel to call a
specific API to get this done. This shouldn't be done automatically by
the cpufreq core.

-- 
viresh
Vincent Donnefort June 15, 2021, 8:44 a.m. | #6
On Tue, Jun 15, 2021 at 10:32:11AM +0530, Viresh Kumar wrote:
> On 14-06-21, 14:35, Vincent Donnefort wrote:

> > On Mon, Jun 14, 2021 at 12:58:35PM +0530, Viresh Kumar wrote:

> > > On 04-06-21, 12:05, Vincent Donnefort wrote:

> > > >  int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)

> > > >  {

> > > >  	int ret;

> > > > @@ -362,7 +409,13 @@ int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)

> > > >  	if (ret)

> > > >  		return ret;

> > > >  

> > > > -	return set_freq_table_sorted(policy);

> > > > +	ret = set_freq_table_sorted(policy);

> > > > +	if (ret)

> > > > +		return ret;

> > > > +

> > > > +	set_freq_table_efficiencies(policy);

> > > > +

> > > > +	return ret;

> > > >  }

> > > 

> > > Lets provide a single callback from the cpufreq core to do all

> > > settings related to efficient frequencies and call it from

> > > em_dev_register_perf_domain() ?

> > 

> > Sadly we cannot do it in em_dev_register_perf_domain(). This function is called

> > from the cpufreq driver init, when the table isn't available yet.

> 

> I looked at all the users of em_dev_register_perf_domain() and each

> one of them have the freq table ready at that point of time.


The cpufreq_policy is accessed through percpu data. I originally tried to get it
with cpufreq_cpu_get_raw(). But when we init the cpufreq driver (and by
extension when we call em_dev_register_perf_domain()), the percpu data isn't
updated yet.

I guess the only way at that moment would be to propagate the cpufreq_policy
pointer through the functions parameters, which is a bit cumbersome, especially
as the EM can be used with devfreq as well and that em_dev_register_perf_domain()
can be called from dev_pm_opp_of_register_em()

Would we be ok with that?

I could use em_data_callback as well ... but that doesn't change the fact some
registration is going through dev_pm_opp_of_register_em() which has no knowledge
about the cpufreq_policy. Doesn't look a better approach.

> 

> > > 

> > > So we only do them for the cpufreq drivers that register themselves

> > > with EM.

> > 

> > Currently, only the EM would set those inefficient OPPs. But it also gives the

> > possibility for individual drivers to set them, even if they do not use the EM.

> 

> This is exactly why I want those parts of the kernel to call a

> specific API to get this done. This shouldn't be done automatically by

> the cpufreq core.

> 

> -- 

> viresh
Viresh Kumar June 15, 2021, 10:17 a.m. | #7
On 15-06-21, 09:44, Vincent Donnefort wrote:
> The cpufreq_policy is accessed through percpu data. I originally tried to get it

> with cpufreq_cpu_get_raw(). But when we init the cpufreq driver (and by

> extension when we call em_dev_register_perf_domain()), the percpu data isn't

> updated yet.


Right.

> I guess the only way at that moment would be to propagate the cpufreq_policy

> pointer through the functions parameters, which is a bit cumbersome, especially

> as the EM can be used with devfreq as well and that em_dev_register_perf_domain()

> can be called from dev_pm_opp_of_register_em()

> 

> Would we be ok with that?


No.

You only talked about freq_table earlier and that's all I checked :)

> I could use em_data_callback as well ... but that doesn't change the fact some

> registration is going through dev_pm_opp_of_register_em() which has no knowledge

> about the cpufreq_policy. Doesn't look a better approach.


The point is that I don't want cpufreq to carry this for users, we
have EM today, tomorrow we may want to mark a frequency as inefficient
from somewhere else. The call need to initiate from EM core.

And this isn't a cpufreq only thing, but is going to be generic along
with other device types.

This is exactly why I asked you earlier to play with OPP core for
this. That is the central place for data for all such users.

If this information is present at the OPP table (somehow), then we can
just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core
as well.

This is the sequence that is followed in cpufreq drivers today:

dev_pm_opp_of_cpumask_add_table();
dev_pm_opp_init_cpufreq_table();
dev_pm_opp_of_register_em();

What about changing this to:

dev_pm_opp_of_cpumask_add_table();

/* Mark OPPs are inefficient here */
dev_pm_opp_of_register_em();

/* This should automatically pick the right set */
dev_pm_opp_init_cpufreq_table();

Will this break anything ?

-- 
viresh
Vincent Donnefort June 15, 2021, 5:15 p.m. | #8
On Tue, Jun 15, 2021 at 03:47:06PM +0530, Viresh Kumar wrote:
> On 15-06-21, 09:44, Vincent Donnefort wrote:

> > The cpufreq_policy is accessed through percpu data. I originally tried to get it

> > with cpufreq_cpu_get_raw(). But when we init the cpufreq driver (and by

> > extension when we call em_dev_register_perf_domain()), the percpu data isn't

> > updated yet.

> 

> Right.

> 

> > I guess the only way at that moment would be to propagate the cpufreq_policy

> > pointer through the functions parameters, which is a bit cumbersome, especially

> > as the EM can be used with devfreq as well and that em_dev_register_perf_domain()

> > can be called from dev_pm_opp_of_register_em()

> > 

> > Would we be ok with that?

> 

> No.

> 

> You only talked about freq_table earlier and that's all I checked :)

> 

> > I could use em_data_callback as well ... but that doesn't change the fact some

> > registration is going through dev_pm_opp_of_register_em() which has no knowledge

> > about the cpufreq_policy. Doesn't look a better approach.

> 

> The point is that I don't want cpufreq to carry this for users, we

> have EM today, tomorrow we may want to mark a frequency as inefficient

> from somewhere else. The call need to initiate from EM core.


In the current version of this patchset, any driver can mark inefficiencies
without relying on the EM, just by adding the flag CPUFREQ_INEFFICIENT_FREQ in
cpufreq_frequency_table.

Populating cpufreq_frequency_table from the EM in cpufreq was just an attempt to
a less intrusive set of changes.

> 

> And this isn't a cpufreq only thing, but is going to be generic along

> with other device types.

> 

> This is exactly why I asked you earlier to play with OPP core for

> this. That is the central place for data for all such users.

> 

> If this information is present at the OPP table (somehow), then we can

> just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core

> as well.

> 

> This is the sequence that is followed in cpufreq drivers today:

> 

> dev_pm_opp_of_cpumask_add_table();

> dev_pm_opp_init_cpufreq_table();

> dev_pm_opp_of_register_em();

> 

> What about changing this to:

> 

> dev_pm_opp_of_cpumask_add_table();

> 

> /* Mark OPPs are inefficient here */

> dev_pm_opp_of_register_em();

> 

> /* This should automatically pick the right set */

> dev_pm_opp_init_cpufreq_table();

> 

> Will this break anything ?


Probably not, but with this approach I'll have to modify all the cpufreq drivers
that are registering the EM, which I tried to avoid as much as possible so far.

But if we sum-up:

1. em_dev_register_perf_domain() find inefficiencies

2. dev_pm_opp_of_register_em() apply EM inefficiencies into the OPP structures

   Note: scmi-cpufreq would need special treatment, as it doesn't rely
   dev_pm_opp_of_register_em().

3. dev_pm_opp_init_cpufreq_table() marks the cpufreq table with the OPP
   inefficiencies 

Guess it would ease the adoption by other OPP clients. However this patchset
will clearly get bigger.

Would you agree with that?
Viresh Kumar June 16, 2021, 7:35 a.m. | #9
On 15-06-21, 18:15, Vincent Donnefort wrote:
> On Tue, Jun 15, 2021 at 03:47:06PM +0530, Viresh Kumar wrote:

> > The point is that I don't want cpufreq to carry this for users, we

> > have EM today, tomorrow we may want to mark a frequency as inefficient

> > from somewhere else. The call need to initiate from EM core.

> 

> In the current version of this patchset, any driver can mark inefficiencies

> without relying on the EM, just by adding the flag CPUFREQ_INEFFICIENT_FREQ in

> cpufreq_frequency_table.


Yeah, I wasn't really talking about cpufreq drivers but external
entities, like EM.

> Populating cpufreq_frequency_table from the EM in cpufreq was just an attempt to

> a less intrusive set of changes.

 
> > And this isn't a cpufreq only thing, but is going to be generic along

> > with other device types.

> > 

> > This is exactly why I asked you earlier to play with OPP core for

> > this. That is the central place for data for all such users.

> > 

> > If this information is present at the OPP table (somehow), then we can

> > just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core

> > as well.

> > 

> > This is the sequence that is followed in cpufreq drivers today:

> > 

> > dev_pm_opp_of_cpumask_add_table();

> > dev_pm_opp_init_cpufreq_table();

> > dev_pm_opp_of_register_em();

> > 

> > What about changing this to:

> > 

> > dev_pm_opp_of_cpumask_add_table();

> > 

> > /* Mark OPPs are inefficient here */

> > dev_pm_opp_of_register_em();

> > 

> > /* This should automatically pick the right set */

> > dev_pm_opp_init_cpufreq_table();

> > 

> > Will this break anything ?

> 

> Probably not, but with this approach I'll have to modify all the cpufreq drivers

> that are registering the EM, which I tried to avoid as much as possible so far.


Hmm. You are right as well, but I just want to get the right API in
place which lives a longer life :)

> But if we sum-up:

> 

> 1. em_dev_register_perf_domain() find inefficiencies

> 

> 2. dev_pm_opp_of_register_em() apply EM inefficiencies into the OPP structures


I was looking to add a new API to the OPP core
(dev_pm_opp_mark_inefficient()) to mark an OPP inefficient. And then
get it called from em_create_perf_table().

But I now see that EM core rather has callbacks to call into and with
that I think you should rather add another callback
(.mark_inefficient()) in struct em_data_callback, to set inefficient
frequencies.

>    Note: scmi-cpufreq would need special treatment, as it doesn't rely

>    dev_pm_opp_of_register_em().


For both dev_pm_opp_of_register_em() and scmi case, you can then set
this callback to dev_pm_opp_mark_inefficient().

> 3. dev_pm_opp_init_cpufreq_table() marks the cpufreq table with the OPP

>    inefficiencies 


Yes.

> Guess it would ease the adoption by other OPP clients. However this patchset

> will clearly get bigger.

> 

> Would you agree with that?


Yes, it is fine.

-- 
viresh
Lukasz Luba June 16, 2021, 9:03 a.m. | #10
On 6/16/21 8:35 AM, Viresh Kumar wrote:
> On 15-06-21, 18:15, Vincent Donnefort wrote:

>> On Tue, Jun 15, 2021 at 03:47:06PM +0530, Viresh Kumar wrote:

>>> The point is that I don't want cpufreq to carry this for users, we

>>> have EM today, tomorrow we may want to mark a frequency as inefficient

>>> from somewhere else. The call need to initiate from EM core.

>>

>> In the current version of this patchset, any driver can mark inefficiencies

>> without relying on the EM, just by adding the flag CPUFREQ_INEFFICIENT_FREQ in

>> cpufreq_frequency_table.

> 

> Yeah, I wasn't really talking about cpufreq drivers but external

> entities, like EM.

> 

>> Populating cpufreq_frequency_table from the EM in cpufreq was just an attempt to

>> a less intrusive set of changes.

>   

>>> And this isn't a cpufreq only thing, but is going to be generic along

>>> with other device types.

>>>

>>> This is exactly why I asked you earlier to play with OPP core for

>>> this. That is the central place for data for all such users.

>>>

>>> If this information is present at the OPP table (somehow), then we can

>>> just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core

>>> as well.

>>>

>>> This is the sequence that is followed in cpufreq drivers today:

>>>

>>> dev_pm_opp_of_cpumask_add_table();

>>> dev_pm_opp_init_cpufreq_table();

>>> dev_pm_opp_of_register_em();

>>>

>>> What about changing this to:

>>>

>>> dev_pm_opp_of_cpumask_add_table();

>>>

>>> /* Mark OPPs are inefficient here */

>>> dev_pm_opp_of_register_em();

>>>

>>> /* This should automatically pick the right set */

>>> dev_pm_opp_init_cpufreq_table();

>>>

>>> Will this break anything ?

>>

>> Probably not, but with this approach I'll have to modify all the cpufreq drivers

>> that are registering the EM, which I tried to avoid as much as possible so far.

> 

> Hmm. You are right as well, but I just want to get the right API in

> place which lives a longer life :)

> 

>> But if we sum-up:

>>

>> 1. em_dev_register_perf_domain() find inefficiencies

>>

>> 2. dev_pm_opp_of_register_em() apply EM inefficiencies into the OPP structures

> 

> I was looking to add a new API to the OPP core

> (dev_pm_opp_mark_inefficient()) to mark an OPP inefficient. And then

> get it called from em_create_perf_table().

> 

> But I now see that EM core rather has callbacks to call into and with


Exactly, that's what I was trying to stress.

> that I think you should rather add another callback

> (.mark_inefficient()) in struct em_data_callback, to set inefficient

> frequencies.


I disagree. That's why I prefer Vincent's approach in this patch set.
This proposal would cause more mess.

Vincent proposed a small and clean modification. This modification
can be done easily in this cpufreq place because we have EM in
device cpu struct.

Let's don't over-engineering. The inefficient information is only valid
for schedutil, thus IMHO it can live like this patch set made - in the
cpufreq table.

Compare the v1 (I still don't understand why it was blocked), this v3
and your proposal.
Viresh Kumar June 16, 2021, 9:31 a.m. | #11
On 16-06-21, 10:03, Lukasz Luba wrote:
> On 6/16/21 8:35 AM, Viresh Kumar wrote:

> > On 15-06-21, 18:15, Vincent Donnefort wrote:

> > > But if we sum-up:

> > > 

> > > 1. em_dev_register_perf_domain() find inefficiencies

> > > 

> > > 2. dev_pm_opp_of_register_em() apply EM inefficiencies into the OPP structures

> > 

> > I was looking to add a new API to the OPP core

> > (dev_pm_opp_mark_inefficient()) to mark an OPP inefficient. And then

> > get it called from em_create_perf_table().

> > 

> > But I now see that EM core rather has callbacks to call into and with

> 

> Exactly, that's what I was trying to stress.

> 

> > that I think you should rather add another callback

> > (.mark_inefficient()) in struct em_data_callback, to set inefficient

> > frequencies.

> 

> I disagree. That's why I prefer Vincent's approach in this patch set.

> This proposal would cause more mess.

> 

> Vincent proposed a small and clean modification. This modification

> can be done easily in this cpufreq place because we have EM in

> device cpu struct.


This may look clean to you, but not to me, sorry about that.

Clean is not lesser number of lines for me, but rather having the
right ownership of such things.

For example this patch:

https://lore.kernel.org/linux-pm/1622804761-126737-6-git-send-email-vincent.donnefort@arm.com/

tries to add EM stuff in cpufreq core. Cpufreq core doesn't care about
EM and it shouldn't. And this piece of code doesn't belong here.

Would you guys like to add a cpufreq specific call into the EM core? I
won't, that's not a place for cpufreq stuff. It is the EM core. I was
fine with not including OPP core into this, and I gave up that
argument earlier, but then we realized that the cpufreq core isn't
ready at the time we register with EM core.

Honestly, OPP core looks to be a better place holder for such stuff.
This is exactly the purpose of the OPP core. Moreover, we can apply
the same logic to devfreq or other devices later, with or without EM
core. Again, OPP core fits better.

The cpufreq core already has the relevant APIs in place to the OPP
core and this won't require a new API there.

> Let's don't over-engineering. The inefficient information is only valid

> for schedutil, thus IMHO it can live like this patch set made - in the

> cpufreq table.


For now, yes. There is no guarantee though that we won't have more in
future.

> Compare the v1 (I still don't understand why it was blocked),


IIRC, it required more work to be done in the hotpath, i.e. traversing
the freq list twice.

> this v3 and your proposal.


IMHO, adding such callbacks to the EM core, like .mark_efficient(),
will only make this easier to handle for all different frameworks, and
not otherwise. The code will look much cleaner everywhere..

-- 
viresh
Lukasz Luba June 16, 2021, 10:33 a.m. | #12
On 6/16/21 10:31 AM, Viresh Kumar wrote:
> On 16-06-21, 10:03, Lukasz Luba wrote:

>> On 6/16/21 8:35 AM, Viresh Kumar wrote:

>>> On 15-06-21, 18:15, Vincent Donnefort wrote:

>>>> But if we sum-up:

>>>>

>>>> 1. em_dev_register_perf_domain() find inefficiencies

>>>>

>>>> 2. dev_pm_opp_of_register_em() apply EM inefficiencies into the OPP structures

>>>

>>> I was looking to add a new API to the OPP core

>>> (dev_pm_opp_mark_inefficient()) to mark an OPP inefficient. And then

>>> get it called from em_create_perf_table().

>>>

>>> But I now see that EM core rather has callbacks to call into and with

>>

>> Exactly, that's what I was trying to stress.

>>

>>> that I think you should rather add another callback

>>> (.mark_inefficient()) in struct em_data_callback, to set inefficient

>>> frequencies.

>>

>> I disagree. That's why I prefer Vincent's approach in this patch set.

>> This proposal would cause more mess.

>>

>> Vincent proposed a small and clean modification. This modification

>> can be done easily in this cpufreq place because we have EM in

>> device cpu struct.

> 

> This may look clean to you, but not to me, sorry about that.

> 

> Clean is not lesser number of lines for me, but rather having the

> right ownership of such things.


Some developers do like patches which removes more lines then adds ;)

> 

> For example this patch:

> 

> https://lore.kernel.org/linux-pm/1622804761-126737-6-git-send-email-vincent.donnefort@arm.com/

> 

> tries to add EM stuff in cpufreq core. Cpufreq core doesn't care about

> EM and it shouldn't. And this piece of code doesn't belong here.

> 

> Would you guys like to add a cpufreq specific call into the EM core? I

> won't, that's not a place for cpufreq stuff. It is the EM core. I was

> fine with not including OPP core into this, and I gave up that

> argument earlier, but then we realized that the cpufreq core isn't

> ready at the time we register with EM core.

> 

> Honestly, OPP core looks to be a better place holder for such stuff.

> This is exactly the purpose of the OPP core. Moreover, we can apply

> the same logic to devfreq or other devices later, with or without EM

> core. Again, OPP core fits better.

> 

> The cpufreq core already has the relevant APIs in place to the OPP

> core and this won't require a new API there.


I don't see an API function in the OPP framework or a field in the
OPP struct which gives information that this freq is inefficient.
Thus, it will require new API changes: cpufreq --> OPP.

> 

>> Let's don't over-engineering. The inefficient information is only valid

>> for schedutil, thus IMHO it can live like this patch set made - in the

>> cpufreq table.

> 

> For now, yes. There is no guarantee though that we won't have more in

> future.


And there won't be in near future. We don't build massive interfaces
because there *might* be potential *oneday*.
Even for this idea, it was a massive work to do the research and prove
it that this is worth to put mainline so all vendors will get it.

The GPUs are slightly different beasts and they have different
workloads (not util + SchedUtil driven AFAIK).

> 

>> Compare the v1 (I still don't understand why it was blocked),

> 

> IIRC, it required more work to be done in the hotpath, i.e. traversing

> the freq list twice.


In v1 there was LUT. IMHO we have too easily gave and said:
'Remove the Look-up table as the numbers weren't strong enough to 
justify the implementation.'
But it had other benefits, which are now pointed.

There was different issue, which we could fix now.
With this patch set [1] EAS could have the freq_max limit, which
SchedUtil has in the hotpath.

What could be the modified v1 [2]:
- LUT which holds two IDs: efficient, inefficient, take one
   according to the clamp f_max
- add new argument 'policy->max' to em_pd_get_efficient_freq()

freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, policy->max);

The problem was that EAS couldn't know the clamp freq_max,
which shouldn't be the blocker now.

> 

>> this v3 and your proposal.

> 

> IMHO, adding such callbacks to the EM core, like .mark_efficient(),

> will only make this easier to handle for all different frameworks, and

> not otherwise. The code will look much cleaner everywhere..

> 


What about coming back to the slightly modified v1 idea?
That was really self-contained modification for this
inefficient opps heuristic.


[1] https://lore.kernel.org/lkml/20210614185815.15136-1-lukasz.luba@arm.com/
[2] 
https://lore.kernel.org/lkml/1617901829-381963-2-git-send-email-vincent.donnefort@arm.com/
Viresh Kumar June 16, 2021, 10:53 a.m. | #13
On 16-06-21, 11:33, Lukasz Luba wrote:
> On 6/16/21 10:31 AM, Viresh Kumar wrote:

> > On 16-06-21, 10:03, Lukasz Luba wrote:

> > Clean is not lesser number of lines for me, but rather having the

> > right ownership of such things.

> 

> Some developers do like patches which removes more lines then adds ;)


:)

> > 

> > For example this patch:

> > 

> > https://lore.kernel.org/linux-pm/1622804761-126737-6-git-send-email-vincent.donnefort@arm.com/

> > 

> > tries to add EM stuff in cpufreq core. Cpufreq core doesn't care about

> > EM and it shouldn't. And this piece of code doesn't belong here.

> > 

> > Would you guys like to add a cpufreq specific call into the EM core? I

> > won't, that's not a place for cpufreq stuff. It is the EM core. I was

> > fine with not including OPP core into this, and I gave up that

> > argument earlier, but then we realized that the cpufreq core isn't

> > ready at the time we register with EM core.

> > 

> > Honestly, OPP core looks to be a better place holder for such stuff.

> > This is exactly the purpose of the OPP core. Moreover, we can apply

> > the same logic to devfreq or other devices later, with or without EM

> > core. Again, OPP core fits better.

> > 

> > The cpufreq core already has the relevant APIs in place to the OPP

> > core and this won't require a new API there.

> 

> I don't see an API function in the OPP framework or a field in the

> OPP struct which gives information that this freq is inefficient.

> Thus, it will require new API changes: cpufreq --> OPP.


dev_pm_opp_init_cpufreq_table() is all we need here, we just need to
change it to update one more field in the cpufreq table's entries.

> > 

> > > Let's don't over-engineering. The inefficient information is only valid

> > > for schedutil, thus IMHO it can live like this patch set made - in the

> > > cpufreq table.

> > 

> > For now, yes. There is no guarantee though that we won't have more in

> > future.

> 

> And there won't be in near future. We don't build massive interfaces

> because there *might* be potential *oneday*.


Yes, true.

> Even for this idea, it was a massive work to do the research and prove

> it that this is worth to put mainline so all vendors will get it.

> 

> The GPUs are slightly different beasts and they have different

> workloads (not util + SchedUtil driven AFAIK).


Right, but even if there is a single user for this, I think getting
this through the right layers is a more cleaner solution.

> In v1 there was LUT.


Oops, yes, I started looking from V2 and not V1.

> IMHO we have too easily gave and said:

> 'Remove the Look-up table as the numbers weren't strong enough to justify

> the implementation.'

> But it had other benefits, which are now pointed.

> 

> There was different issue, which we could fix now.

> With this patch set [1] EAS could have the freq_max limit, which

> SchedUtil has in the hotpath.

> 

> What could be the modified v1 [2]:

> - LUT which holds two IDs: efficient, inefficient, take one

>   according to the clamp f_max

> - add new argument 'policy->max' to em_pd_get_efficient_freq()

> 

> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, policy->max);

> 

> The problem was that EAS couldn't know the clamp freq_max,

> which shouldn't be the blocker now.


If you can do that without adding any EM specific stuff in the cpufreq
core, I will mostly be fine.

But honestly speaking, creating more data structures to keep related
information doesn't scale well.

We already have so many tables for keeping freq/voltage pairs, OPP,
cpufreq, EM. You tried to add one more in EM I think V1, not sure.

It is always better to consolidate and we almost reached to a point
where that could have been done very easily. I understand that you
didn't want to touch so many different parts, but anyway..
 
> > > this v3 and your proposal.

> > 

> > IMHO, adding such callbacks to the EM core, like .mark_efficient(),

> > will only make this easier to handle for all different frameworks, and

> > not otherwise. The code will look much cleaner everywhere..

> > 

> 

> What about coming back to the slightly modified v1 idea?

> That was really self-contained modification for this

> inefficient opps heuristic.


I am not sure if I really understand what that would be, but again
adding another table is going to create more problems then it should.

Anyway, that's my view, which can be wrong as well.

Rafael: You have any suggestions here ?

-- 
viresh
Lukasz Luba June 16, 2021, 12:45 p.m. | #14
On 6/16/21 11:53 AM, Viresh Kumar wrote:
> On 16-06-21, 11:33, Lukasz Luba wrote:

>> On 6/16/21 10:31 AM, Viresh Kumar wrote:

>>> On 16-06-21, 10:03, Lukasz Luba wrote:

>>> Clean is not lesser number of lines for me, but rather having the

>>> right ownership of such things.

>>

>> Some developers do like patches which removes more lines then adds ;)

> 

> :)

> 


[snip]

>>

>> What could be the modified v1 [2]:

>> - LUT which holds two IDs: efficient, inefficient, take one

>>    according to the clamp f_max

>> - add new argument 'policy->max' to em_pd_get_efficient_freq()

>>

>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, policy->max);

>>

>> The problem was that EAS couldn't know the clamp freq_max,

>> which shouldn't be the blocker now.

> 

> If you can do that without adding any EM specific stuff in the cpufreq

> core, I will mostly be fine.

> 

> But honestly speaking, creating more data structures to keep related

> information doesn't scale well.

> 

> We already have so many tables for keeping freq/voltage pairs, OPP,

> cpufreq, EM. You tried to add one more in EM I think V1, not sure.

> 

> It is always better to consolidate and we almost reached to a point

> where that could have been done very easily. I understand that you

> didn't want to touch so many different parts, but anyway..


Yes, I don't want to touch some many generic parts because they
are intended to be generic. This heuristic is only for EM platforms,
which are arm, arm64 battery powered (not servers or other).
Thus, I wanted to keep it locally. The cost of EM extra structures
(the LUT) will only be for platforms for which EM discovers that
they have inefficient performance levels.
The code would even not be compiled in for x86, ppc, etc, in hot path.

>   

>>>> this v3 and your proposal.

>>>

>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(),

>>> will only make this easier to handle for all different frameworks, and

>>> not otherwise. The code will look much cleaner everywhere..

>>>

>>

>> What about coming back to the slightly modified v1 idea?

>> That was really self-contained modification for this

>> inefficient opps heuristic.

> 

> I am not sure if I really understand what that would be, but again

> adding another table is going to create more problems then it should.


IMHO your proposals are more invasive for the generic code, while
not always being even used. Plenty of architectures don't even set EM,
even arm64 for servers doesn't use it. You and Rafael would have to
maintain these modifications in generic code. It might be hard to remove
it. While I recommend to keep this heuristic feature inside the EM and
we will maintain it. If we decide after a few years that new arm64
platforms use some smarter FW for performance level, we might just
disable this heuristic.

> 

> Anyway, that's my view, which can be wrong as well.

> 

> Rafael: You have any suggestions here ?

> 


I would also like to hear Rafael's opinion :)
Quentin Perret June 22, 2021, 9:01 a.m. | #15
On Tuesday 15 Jun 2021 at 15:47:06 (+0530), Viresh Kumar wrote:
> On 15-06-21, 09:44, Vincent Donnefort wrote:

> > The cpufreq_policy is accessed through percpu data. I originally tried to get it

> > with cpufreq_cpu_get_raw(). But when we init the cpufreq driver (and by

> > extension when we call em_dev_register_perf_domain()), the percpu data isn't

> > updated yet.

> 

> Right.

> 

> > I guess the only way at that moment would be to propagate the cpufreq_policy

> > pointer through the functions parameters, which is a bit cumbersome, especially

> > as the EM can be used with devfreq as well and that em_dev_register_perf_domain()

> > can be called from dev_pm_opp_of_register_em()

> > 

> > Would we be ok with that?

> 

> No.

> 

> You only talked about freq_table earlier and that's all I checked :)

> 

> > I could use em_data_callback as well ... but that doesn't change the fact some

> > registration is going through dev_pm_opp_of_register_em() which has no knowledge

> > about the cpufreq_policy. Doesn't look a better approach.

> 

> The point is that I don't want cpufreq to carry this for users, we

> have EM today, tomorrow we may want to mark a frequency as inefficient

> from somewhere else. The call need to initiate from EM core.

> 

> And this isn't a cpufreq only thing, but is going to be generic along

> with other device types.

>

> This is exactly why I asked you earlier to play with OPP core for

> this. That is the central place for data for all such users.


The thing is, the very reason for the existence of PM_EM in the first
place is to NOT depend on PM_OPP which was deemed too Arm-specific. So
let's not create those dependencies now please.

> If this information is present at the OPP table (somehow), then we can

> just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core

> as well.


Honnestly, I don't think PM_OPP should have anything to do with this,
for the reason mentionned above.

I don't really see the problem with cpufreq core querying the EM data
directly, we already have core subsystems doing that (the scheduler) so
why would that be a problem? If the only concern is for non-CPU devices,
all we'd need is a matching feature in devfreq core and everything
should be good no?

Thanks,
Quentin
Viresh Kumar June 22, 2021, 9:25 a.m. | #16
On 22-06-21, 09:01, Quentin Perret wrote:
> On Tuesday 15 Jun 2021 at 15:47:06 (+0530), Viresh Kumar wrote:

> > The point is that I don't want cpufreq to carry this for users, we

> > have EM today, tomorrow we may want to mark a frequency as inefficient

> > from somewhere else. The call need to initiate from EM core.

> > 

> > And this isn't a cpufreq only thing, but is going to be generic along

> > with other device types.

> >

> > This is exactly why I asked you earlier to play with OPP core for

> > this. That is the central place for data for all such users.

> 

> The thing is, the very reason for the existence of PM_EM in the first

> place is to NOT depend on PM_OPP which was deemed too Arm-specific. So

> let's not create those dependencies now please.


I am not asking to create one here.

What I am saying is that users of EM, which eventually call
em_dev_register_perf_domain() have the necessary details about about
the performance states (like voltage/freq pairs, or power) and the EM
core should talk to those users directly.

The struct em_data_callback is already there for this reason so we can
keep EM core independent of its users. I suggested to add another
callback there to mark a frequency inefficient, that's all.

In the case of ARM platforms, that would automatically mean OPP core,
and for others it can be anything that has this knowledge.

> > If this information is present at the OPP table (somehow), then we can

> > just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core

> > as well.

> 

> Honnestly, I don't think PM_OPP should have anything to do with this,

> for the reason mentionned above.

> 

> I don't really see the problem with cpufreq core querying the EM data

> directly, we already have core subsystems doing that (the scheduler) so

> why would that be a problem?


If EM core is going to be the only entity ever which can provide this
information, then maybe we can get add some core support for this. But
if it is just another driver/layer which can add this information,
then it is always better to provide APIs which such users can call.
And so I was looking for cpufreq to provide an API, which can be
called from such users.

The problem here happened because we registered with the EM core from
policy's ->init() and we couldn't get back to cpufreq core from there
as the policy is only half initialized.

> If the only concern is for non-CPU devices,

> all we'd need is a matching feature in devfreq core and everything

> should be good no?


-- 
viresh
Lukasz Luba July 2, 2021, 2:21 p.m. | #17
Hi Rafael,

This is a gentle ping.
You have probably not seen this discussion thread.

On 6/16/21 1:45 PM, Lukasz Luba wrote:
> 

> 

> On 6/16/21 11:53 AM, Viresh Kumar wrote:

>> On 16-06-21, 11:33, Lukasz Luba wrote:

>>> On 6/16/21 10:31 AM, Viresh Kumar wrote:

>>>> On 16-06-21, 10:03, Lukasz Luba wrote:

>>>> Clean is not lesser number of lines for me, but rather having the

>>>> right ownership of such things.

>>>

>>> Some developers do like patches which removes more lines then adds ;)

>>

>> :)

>>

> 

> [snip]

> 

>>>

>>> What could be the modified v1 [2]:

>>> - LUT which holds two IDs: efficient, inefficient, take one

>>>    according to the clamp f_max

>>> - add new argument 'policy->max' to em_pd_get_efficient_freq()

>>>

>>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, 

>>> policy->max);

>>>

>>> The problem was that EAS couldn't know the clamp freq_max,

>>> which shouldn't be the blocker now.

>>

>> If you can do that without adding any EM specific stuff in the cpufreq

>> core, I will mostly be fine.

>>

>> But honestly speaking, creating more data structures to keep related

>> information doesn't scale well.

>>

>> We already have so many tables for keeping freq/voltage pairs, OPP,

>> cpufreq, EM. You tried to add one more in EM I think V1, not sure.

>>

>> It is always better to consolidate and we almost reached to a point

>> where that could have been done very easily. I understand that you

>> didn't want to touch so many different parts, but anyway..

> 

> Yes, I don't want to touch some many generic parts because they

> are intended to be generic. This heuristic is only for EM platforms,

> which are arm, arm64 battery powered (not servers or other).

> Thus, I wanted to keep it locally. The cost of EM extra structures

> (the LUT) will only be for platforms for which EM discovers that

> they have inefficient performance levels.

> The code would even not be compiled in for x86, ppc, etc, in hot path.

> 

>>>>> this v3 and your proposal.

>>>>

>>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(),

>>>> will only make this easier to handle for all different frameworks, and

>>>> not otherwise. The code will look much cleaner everywhere..

>>>>

>>>

>>> What about coming back to the slightly modified v1 idea?

>>> That was really self-contained modification for this

>>> inefficient opps heuristic.

>>

>> I am not sure if I really understand what that would be, but again

>> adding another table is going to create more problems then it should.

> 

> IMHO your proposals are more invasive for the generic code, while

> not always being even used. Plenty of architectures don't even set EM,

> even arm64 for servers doesn't use it. You and Rafael would have to

> maintain these modifications in generic code. It might be hard to remove

> it. While I recommend to keep this heuristic feature inside the EM and

> we will maintain it. If we decide after a few years that new arm64

> platforms use some smarter FW for performance level, we might just

> disable this heuristic.

> 

>>

>> Anyway, that's my view, which can be wrong as well.

>>

>> Rafael: You have any suggestions here ?

>>

> 

> I would also like to hear Rafael's opinion :)


It would be great if you could have a look.
I will be grateful for your time spend on it and opinion.

Regards,
Lukasz
Rafael J. Wysocki July 2, 2021, 3:46 p.m. | #18
On Fri, Jul 2, 2021 at 4:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>

> Hi Rafael,

>

> This is a gentle ping.

> You have probably not seen this discussion thread.


I have looked at it briefly for a few times, but not too much into detail.

> On 6/16/21 1:45 PM, Lukasz Luba wrote:

> >

> >

> > On 6/16/21 11:53 AM, Viresh Kumar wrote:

> >> On 16-06-21, 11:33, Lukasz Luba wrote:

> >>> On 6/16/21 10:31 AM, Viresh Kumar wrote:

> >>>> On 16-06-21, 10:03, Lukasz Luba wrote:

> >>>> Clean is not lesser number of lines for me, but rather having the

> >>>> right ownership of such things.

> >>>

> >>> Some developers do like patches which removes more lines then adds ;)

> >>

> >> :)

> >>

> >

> > [snip]

> >

> >>>

> >>> What could be the modified v1 [2]:

> >>> - LUT which holds two IDs: efficient, inefficient, take one

> >>>    according to the clamp f_max

> >>> - add new argument 'policy->max' to em_pd_get_efficient_freq()

> >>>

> >>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq,

> >>> policy->max);

> >>>

> >>> The problem was that EAS couldn't know the clamp freq_max,

> >>> which shouldn't be the blocker now.

> >>

> >> If you can do that without adding any EM specific stuff in the cpufreq

> >> core, I will mostly be fine.

> >>

> >> But honestly speaking, creating more data structures to keep related

> >> information doesn't scale well.

> >>

> >> We already have so many tables for keeping freq/voltage pairs, OPP,

> >> cpufreq, EM. You tried to add one more in EM I think V1, not sure.

> >>

> >> It is always better to consolidate and we almost reached to a point

> >> where that could have been done very easily. I understand that you

> >> didn't want to touch so many different parts, but anyway..

> >

> > Yes, I don't want to touch some many generic parts because they

> > are intended to be generic. This heuristic is only for EM platforms,

> > which are arm, arm64 battery powered (not servers or other).

> > Thus, I wanted to keep it locally. The cost of EM extra structures

> > (the LUT) will only be for platforms for which EM discovers that

> > they have inefficient performance levels.

> > The code would even not be compiled in for x86, ppc, etc, in hot path.

> >

> >>>>> this v3 and your proposal.

> >>>>

> >>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(),

> >>>> will only make this easier to handle for all different frameworks, and

> >>>> not otherwise. The code will look much cleaner everywhere..

> >>>>

> >>>

> >>> What about coming back to the slightly modified v1 idea?

> >>> That was really self-contained modification for this

> >>> inefficient opps heuristic.

> >>

> >> I am not sure if I really understand what that would be, but again

> >> adding another table is going to create more problems then it should.

> >

> > IMHO your proposals are more invasive for the generic code, while

> > not always being even used. Plenty of architectures don't even set EM,

> > even arm64 for servers doesn't use it. You and Rafael would have to

> > maintain these modifications in generic code. It might be hard to remove

> > it. While I recommend to keep this heuristic feature inside the EM and

> > we will maintain it. If we decide after a few years that new arm64

> > platforms use some smarter FW for performance level, we might just

> > disable this heuristic.

> >

> >>

> >> Anyway, that's my view, which can be wrong as well.

> >>

> >> Rafael: You have any suggestions here ?

> >>

> >

> > I would also like to hear Rafael's opinion :)

>

> It would be great if you could have a look.

> I will be grateful for your time spend on it and opinion.


First of all, IMO checking whether or not a given frequency is
"efficient" doesn't belong to cpufreq governors.  The governor knows
what the min and max supported frequencies are and selects one from
that range and note that it doesn't even check whether or not the
selected frequency is present in the frequency table.  That part
belongs to the driver or the general frequency table handling in the
cpufreq core.

So the governor doesn't care and it shouldn't care IMO.

Besides, in the cpufreq_driver_fast_switch() case the driver's
->fast_switch() callback is entirely responsible for applying the
selected frequency, so I'm not sure how this "efficient" thing is
going to work then?

Anyway, since we are talking about frequency tables, that would be the
__cpufreq_driver_target() case only and specifically in the
__target_index() case only (note how far away this is from the
governor).

Now, you may think about modifying cpufreq_frequency_table_target() to
skip "inefficient" frequencies, but then the question is why those
frequencies need to be there in the frequency table in the first
place?
Rafael J. Wysocki July 2, 2021, 4:04 p.m. | #19
On Fri, Jul 2, 2021 at 5:46 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>

> On Fri, Jul 2, 2021 at 4:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote:

> >

> > Hi Rafael,

> >

> > This is a gentle ping.

> > You have probably not seen this discussion thread.

>

> I have looked at it briefly for a few times, but not too much into detail.

>

> > On 6/16/21 1:45 PM, Lukasz Luba wrote:

> > >

> > >

> > > On 6/16/21 11:53 AM, Viresh Kumar wrote:

> > >> On 16-06-21, 11:33, Lukasz Luba wrote:

> > >>> On 6/16/21 10:31 AM, Viresh Kumar wrote:

> > >>>> On 16-06-21, 10:03, Lukasz Luba wrote:

> > >>>> Clean is not lesser number of lines for me, but rather having the

> > >>>> right ownership of such things.

> > >>>

> > >>> Some developers do like patches which removes more lines then adds ;)

> > >>

> > >> :)

> > >>

> > >

> > > [snip]

> > >

> > >>>

> > >>> What could be the modified v1 [2]:

> > >>> - LUT which holds two IDs: efficient, inefficient, take one

> > >>>    according to the clamp f_max

> > >>> - add new argument 'policy->max' to em_pd_get_efficient_freq()

> > >>>

> > >>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq,

> > >>> policy->max);

> > >>>

> > >>> The problem was that EAS couldn't know the clamp freq_max,

> > >>> which shouldn't be the blocker now.

> > >>

> > >> If you can do that without adding any EM specific stuff in the cpufreq

> > >> core, I will mostly be fine.

> > >>

> > >> But honestly speaking, creating more data structures to keep related

> > >> information doesn't scale well.

> > >>

> > >> We already have so many tables for keeping freq/voltage pairs, OPP,

> > >> cpufreq, EM. You tried to add one more in EM I think V1, not sure.

> > >>

> > >> It is always better to consolidate and we almost reached to a point

> > >> where that could have been done very easily. I understand that you

> > >> didn't want to touch so many different parts, but anyway..

> > >

> > > Yes, I don't want to touch some many generic parts because they

> > > are intended to be generic. This heuristic is only for EM platforms,

> > > which are arm, arm64 battery powered (not servers or other).

> > > Thus, I wanted to keep it locally. The cost of EM extra structures

> > > (the LUT) will only be for platforms for which EM discovers that

> > > they have inefficient performance levels.

> > > The code would even not be compiled in for x86, ppc, etc, in hot path.

> > >

> > >>>>> this v3 and your proposal.

> > >>>>

> > >>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(),

> > >>>> will only make this easier to handle for all different frameworks, and

> > >>>> not otherwise. The code will look much cleaner everywhere..

> > >>>>

> > >>>

> > >>> What about coming back to the slightly modified v1 idea?

> > >>> That was really self-contained modification for this

> > >>> inefficient opps heuristic.

> > >>

> > >> I am not sure if I really understand what that would be, but again

> > >> adding another table is going to create more problems then it should.

> > >

> > > IMHO your proposals are more invasive for the generic code, while

> > > not always being even used. Plenty of architectures don't even set EM,

> > > even arm64 for servers doesn't use it. You and Rafael would have to

> > > maintain these modifications in generic code. It might be hard to remove

> > > it. While I recommend to keep this heuristic feature inside the EM and

> > > we will maintain it. If we decide after a few years that new arm64

> > > platforms use some smarter FW for performance level, we might just

> > > disable this heuristic.

> > >

> > >>

> > >> Anyway, that's my view, which can be wrong as well.

> > >>

> > >> Rafael: You have any suggestions here ?

> > >>

> > >

> > > I would also like to hear Rafael's opinion :)

> >

> > It would be great if you could have a look.

> > I will be grateful for your time spend on it and opinion.

>

> First of all, IMO checking whether or not a given frequency is

> "efficient" doesn't belong to cpufreq governors.  The governor knows

> what the min and max supported frequencies are and selects one from

> that range and note that it doesn't even check whether or not the

> selected frequency is present in the frequency table.  That part

> belongs to the driver or the general frequency table handling in the

> cpufreq core.

>

> So the governor doesn't care and it shouldn't care IMO.

>

> Besides, in the cpufreq_driver_fast_switch() case the driver's

> ->fast_switch() callback is entirely responsible for applying the

> selected frequency, so I'm not sure how this "efficient" thing is

> going to work then?

>

> Anyway, since we are talking about frequency tables, that would be the

> __cpufreq_driver_target() case only and specifically in the

> __target_index() case only (note how far away this is from the

> governor).

>

> Now, you may think about modifying cpufreq_frequency_table_target() to

> skip "inefficient" frequencies, but then the question is why those

> frequencies need to be there in the frequency table in the first

> place?


I'm guessing that the problem is that cpufreq_cooling works by using
freq_qos_update_request() to update the max frequency limit and if
that is in effect you'd rather use the inefficient frequencies,
whereas when the governor selects an inefficient frequency  below the
policy limit, you'd rather use a higher-but-efficient frequency
instead (within the policy limit).

Am I guessing correctly?
Lukasz Luba July 2, 2021, 4:08 p.m. | #20
On 7/2/21 5:04 PM, Rafael J. Wysocki wrote:
> On Fri, Jul 2, 2021 at 5:46 PM Rafael J. Wysocki <rafael@kernel.org> wrote:

>>

>> On Fri, Jul 2, 2021 at 4:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote:

>>>

>>> Hi Rafael,

>>>

>>> This is a gentle ping.

>>> You have probably not seen this discussion thread.

>>

>> I have looked at it briefly for a few times, but not too much into detail.

>>

>>> On 6/16/21 1:45 PM, Lukasz Luba wrote:

>>>>

>>>>

>>>> On 6/16/21 11:53 AM, Viresh Kumar wrote:

>>>>> On 16-06-21, 11:33, Lukasz Luba wrote:

>>>>>> On 6/16/21 10:31 AM, Viresh Kumar wrote:

>>>>>>> On 16-06-21, 10:03, Lukasz Luba wrote:

>>>>>>> Clean is not lesser number of lines for me, but rather having the

>>>>>>> right ownership of such things.

>>>>>>

>>>>>> Some developers do like patches which removes more lines then adds ;)

>>>>>

>>>>> :)

>>>>>

>>>>

>>>> [snip]

>>>>

>>>>>>

>>>>>> What could be the modified v1 [2]:

>>>>>> - LUT which holds two IDs: efficient, inefficient, take one

>>>>>>     according to the clamp f_max

>>>>>> - add new argument 'policy->max' to em_pd_get_efficient_freq()

>>>>>>

>>>>>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq,

>>>>>> policy->max);

>>>>>>

>>>>>> The problem was that EAS couldn't know the clamp freq_max,

>>>>>> which shouldn't be the blocker now.

>>>>>

>>>>> If you can do that without adding any EM specific stuff in the cpufreq

>>>>> core, I will mostly be fine.

>>>>>

>>>>> But honestly speaking, creating more data structures to keep related

>>>>> information doesn't scale well.

>>>>>

>>>>> We already have so many tables for keeping freq/voltage pairs, OPP,

>>>>> cpufreq, EM. You tried to add one more in EM I think V1, not sure.

>>>>>

>>>>> It is always better to consolidate and we almost reached to a point

>>>>> where that could have been done very easily. I understand that you

>>>>> didn't want to touch so many different parts, but anyway..

>>>>

>>>> Yes, I don't want to touch some many generic parts because they

>>>> are intended to be generic. This heuristic is only for EM platforms,

>>>> which are arm, arm64 battery powered (not servers or other).

>>>> Thus, I wanted to keep it locally. The cost of EM extra structures

>>>> (the LUT) will only be for platforms for which EM discovers that

>>>> they have inefficient performance levels.

>>>> The code would even not be compiled in for x86, ppc, etc, in hot path.

>>>>

>>>>>>>> this v3 and your proposal.

>>>>>>>

>>>>>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(),

>>>>>>> will only make this easier to handle for all different frameworks, and

>>>>>>> not otherwise. The code will look much cleaner everywhere..

>>>>>>>

>>>>>>

>>>>>> What about coming back to the slightly modified v1 idea?

>>>>>> That was really self-contained modification for this

>>>>>> inefficient opps heuristic.

>>>>>

>>>>> I am not sure if I really understand what that would be, but again

>>>>> adding another table is going to create more problems then it should.

>>>>

>>>> IMHO your proposals are more invasive for the generic code, while

>>>> not always being even used. Plenty of architectures don't even set EM,

>>>> even arm64 for servers doesn't use it. You and Rafael would have to

>>>> maintain these modifications in generic code. It might be hard to remove

>>>> it. While I recommend to keep this heuristic feature inside the EM and

>>>> we will maintain it. If we decide after a few years that new arm64

>>>> platforms use some smarter FW for performance level, we might just

>>>> disable this heuristic.

>>>>

>>>>>

>>>>> Anyway, that's my view, which can be wrong as well.

>>>>>

>>>>> Rafael: You have any suggestions here ?

>>>>>

>>>>

>>>> I would also like to hear Rafael's opinion :)

>>>

>>> It would be great if you could have a look.

>>> I will be grateful for your time spend on it and opinion.

>>

>> First of all, IMO checking whether or not a given frequency is

>> "efficient" doesn't belong to cpufreq governors.  The governor knows

>> what the min and max supported frequencies are and selects one from

>> that range and note that it doesn't even check whether or not the

>> selected frequency is present in the frequency table.  That part

>> belongs to the driver or the general frequency table handling in the

>> cpufreq core.

>>

>> So the governor doesn't care and it shouldn't care IMO.

>>

>> Besides, in the cpufreq_driver_fast_switch() case the driver's

>> ->fast_switch() callback is entirely responsible for applying the

>> selected frequency, so I'm not sure how this "efficient" thing is

>> going to work then?

>>

>> Anyway, since we are talking about frequency tables, that would be the

>> __cpufreq_driver_target() case only and specifically in the

>> __target_index() case only (note how far away this is from the

>> governor).

>>

>> Now, you may think about modifying cpufreq_frequency_table_target() to

>> skip "inefficient" frequencies, but then the question is why those

>> frequencies need to be there in the frequency table in the first

>> place?

> 

> I'm guessing that the problem is that cpufreq_cooling works by using

> freq_qos_update_request() to update the max frequency limit and if

> that is in effect you'd rather use the inefficient frequencies,

> whereas when the governor selects an inefficient frequency  below the

> policy limit, you'd rather use a higher-but-efficient frequency

> instead (within the policy limit).

> 

> Am I guessing correctly?

> 


Yes, correct. Thermal would use all (efficient + inefficient), but
we in cpufreq governor would like to pick if possible the efficient
one (below the thermal limit).
Vincent Donnefort July 2, 2021, 4:13 p.m. | #21
[...]

Hi Rafael,

> >

> > It would be great if you could have a look.

> > I will be grateful for your time spend on it and opinion.

> 

> First of all, IMO checking whether or not a given frequency is

> "efficient" doesn't belong to cpufreq governors.  The governor knows

> what the min and max supported frequencies are and selects one from

> that range and note that it doesn't even check whether or not the

> selected frequency is present in the frequency table.  That part

> belongs to the driver or the general frequency table handling in the

> cpufreq core.

> 

> So the governor doesn't care and it shouldn't care IMO.


A governor such as userspace should probably be able select any frequency
with no regard to efficiency.

> 

> Besides, in the cpufreq_driver_fast_switch() case the driver's

> ->fast_switch() callback is entirely responsible for applying the

> selected frequency, so I'm not sure how this "efficient" thing is

> going to work then?


This shouldn't be a problem if a governor that leverages inefficient OPPs
resolves an inefficient one to an efficient one. The target_freq would simply
point to a frequency known as efficient.

> Anyway, since we are talking about frequency tables, that would be the

> __cpufreq_driver_target() case only and specifically in the

> __target_index() case only (note how far away this is from the

> governor).

> 

> Now, you may think about modifying cpufreq_frequency_table_target() to

> skip "inefficient" frequencies, but then the question is why those

> frequencies need to be there in the frequency table in the first

> place?


Cpufreq_cooling needs those frequencies, even inefficients, also there's
probably no reason to hide them from the userspace.

-- 
Vincent
Rafael J. Wysocki July 2, 2021, 5:38 p.m. | #22
On Fri, Jul 2, 2021 at 6:14 PM Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>

> [...]

>

> Hi Rafael,

>

> > >

> > > It would be great if you could have a look.

> > > I will be grateful for your time spend on it and opinion.

> >

> > First of all, IMO checking whether or not a given frequency is

> > "efficient" doesn't belong to cpufreq governors.  The governor knows

> > what the min and max supported frequencies are and selects one from

> > that range and note that it doesn't even check whether or not the

> > selected frequency is present in the frequency table.  That part

> > belongs to the driver or the general frequency table handling in the

> > cpufreq core.

> >

> > So the governor doesn't care and it shouldn't care IMO.

>

> A governor such as userspace should probably be able select any frequency

> with no regard to efficiency.


You seem to be arguing that the decision whether or not to use
inefficient frequencies needs to be made at the governor level,
because the driver must use whatever frequency is selected by the
userspace governor, since the user process driving it may be assuming
that to be the case.

That's why you want to hack schedutil to filter out the inefficient
frequencies, so you don't need to worry about them below that level.

That would be putting driver knowledge into the governor which cpufreq
was specifically designed to avoid.

There is a way to do that, though, which is to put a governor into the
driver and use ->setoplicy(), but the good news is that you don't even
need to do that.

So there are those CPUFREQ_RELATION_* symbols used by governors to
tell drivers what to do next (I wonder what they would be good for
after your changes for that matter) and you can add more of them.  For
example, you can define a CPUFREQ_RELATION_EFFICIENT bit to tell the
driver to avoid inefficient frequencies and pass that from schedutil.

Besides, I'm not sure why you still want to use inefficient
frequencies if they are selected by the ondemand or conservative
governors.

> >

> > Besides, in the cpufreq_driver_fast_switch() case the driver's

> > ->fast_switch() callback is entirely responsible for applying the

> > selected frequency, so I'm not sure how this "efficient" thing is

> > going to work then?

>

> This shouldn't be a problem if a governor that leverages inefficient OPPs

> resolves an inefficient one to an efficient one. The target_freq would simply

> point to a frequency known as efficient.


No, that doesn't help.  Moreover, it is harmful, because it changes
the interface between this particular governor and drivers in a subtle
and potentially confusing way.

Namely, drivers now can expect that the frequency passed to
->fast_switch() is the one corresponding to the current utilization
demand as seen by the governor, subject to the policy limits, not
the-closest-efficient-one-present-in-the-frequency-table.

> > Anyway, since we are talking about frequency tables, that would be the

> > __cpufreq_driver_target() case only and specifically in the

> > __target_index() case only (note how far away this is from the

> > governor).

> >

> > Now, you may think about modifying cpufreq_frequency_table_target() to

> > skip "inefficient" frequencies, but then the question is why those

> > frequencies need to be there in the frequency table in the first

> > place?

>

> Cpufreq_cooling needs those frequencies, even inefficients,


I've already figured that out.

> also there's probably no reason to hide them from the userspace.
Rafael J. Wysocki July 2, 2021, 5:53 p.m. | #23
On Fri, Jul 2, 2021 at 6:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>

>

>

> On 7/2/21 5:04 PM, Rafael J. Wysocki wrote:

> > On Fri, Jul 2, 2021 at 5:46 PM Rafael J. Wysocki <rafael@kernel.org> wrote:

> >>

> >> On Fri, Jul 2, 2021 at 4:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote:

> >>>

> >>> Hi Rafael,

> >>>

> >>> This is a gentle ping.

> >>> You have probably not seen this discussion thread.

> >>

> >> I have looked at it briefly for a few times, but not too much into detail.

> >>

> >>> On 6/16/21 1:45 PM, Lukasz Luba wrote:

> >>>>

> >>>>

> >>>> On 6/16/21 11:53 AM, Viresh Kumar wrote:

> >>>>> On 16-06-21, 11:33, Lukasz Luba wrote:

> >>>>>> On 6/16/21 10:31 AM, Viresh Kumar wrote:

> >>>>>>> On 16-06-21, 10:03, Lukasz Luba wrote:

> >>>>>>> Clean is not lesser number of lines for me, but rather having the

> >>>>>>> right ownership of such things.

> >>>>>>

> >>>>>> Some developers do like patches which removes more lines then adds ;)

> >>>>>

> >>>>> :)

> >>>>>

> >>>>

> >>>> [snip]

> >>>>

> >>>>>>

> >>>>>> What could be the modified v1 [2]:

> >>>>>> - LUT which holds two IDs: efficient, inefficient, take one

> >>>>>>     according to the clamp f_max

> >>>>>> - add new argument 'policy->max' to em_pd_get_efficient_freq()

> >>>>>>

> >>>>>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq,

> >>>>>> policy->max);

> >>>>>>

> >>>>>> The problem was that EAS couldn't know the clamp freq_max,

> >>>>>> which shouldn't be the blocker now.

> >>>>>

> >>>>> If you can do that without adding any EM specific stuff in the cpufreq

> >>>>> core, I will mostly be fine.

> >>>>>

> >>>>> But honestly speaking, creating more data structures to keep related

> >>>>> information doesn't scale well.

> >>>>>

> >>>>> We already have so many tables for keeping freq/voltage pairs, OPP,

> >>>>> cpufreq, EM. You tried to add one more in EM I think V1, not sure.

> >>>>>

> >>>>> It is always better to consolidate and we almost reached to a point

> >>>>> where that could have been done very easily. I understand that you

> >>>>> didn't want to touch so many different parts, but anyway..

> >>>>

> >>>> Yes, I don't want to touch some many generic parts because they

> >>>> are intended to be generic. This heuristic is only for EM platforms,

> >>>> which are arm, arm64 battery powered (not servers or other).

> >>>> Thus, I wanted to keep it locally. The cost of EM extra structures

> >>>> (the LUT) will only be for platforms for which EM discovers that

> >>>> they have inefficient performance levels.

> >>>> The code would even not be compiled in for x86, ppc, etc, in hot path.

> >>>>

> >>>>>>>> this v3 and your proposal.

> >>>>>>>

> >>>>>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(),

> >>>>>>> will only make this easier to handle for all different frameworks, and

> >>>>>>> not otherwise. The code will look much cleaner everywhere..

> >>>>>>>

> >>>>>>

> >>>>>> What about coming back to the slightly modified v1 idea?

> >>>>>> That was really self-contained modification for this

> >>>>>> inefficient opps heuristic.

> >>>>>

> >>>>> I am not sure if I really understand what that would be, but again

> >>>>> adding another table is going to create more problems then it should.

> >>>>

> >>>> IMHO your proposals are more invasive for the generic code, while

> >>>> not always being even used. Plenty of architectures don't even set EM,

> >>>> even arm64 for servers doesn't use it. You and Rafael would have to

> >>>> maintain these modifications in generic code. It might be hard to remove

> >>>> it. While I recommend to keep this heuristic feature inside the EM and

> >>>> we will maintain it. If we decide after a few years that new arm64

> >>>> platforms use some smarter FW for performance level, we might just

> >>>> disable this heuristic.

> >>>>

> >>>>>

> >>>>> Anyway, that's my view, which can be wrong as well.

> >>>>>

> >>>>> Rafael: You have any suggestions here ?

> >>>>>

> >>>>

> >>>> I would also like to hear Rafael's opinion :)

> >>>

> >>> It would be great if you could have a look.

> >>> I will be grateful for your time spend on it and opinion.

> >>

> >> First of all, IMO checking whether or not a given frequency is

> >> "efficient" doesn't belong to cpufreq governors.  The governor knows

> >> what the min and max supported frequencies are and selects one from

> >> that range and note that it doesn't even check whether or not the

> >> selected frequency is present in the frequency table.  That part

> >> belongs to the driver or the general frequency table handling in the

> >> cpufreq core.

> >>

> >> So the governor doesn't care and it shouldn't care IMO.

> >>

> >> Besides, in the cpufreq_driver_fast_switch() case the driver's

> >> ->fast_switch() callback is entirely responsible for applying the

> >> selected frequency, so I'm not sure how this "efficient" thing is

> >> going to work then?

> >>

> >> Anyway, since we are talking about frequency tables, that would be the

> >> __cpufreq_driver_target() case only and specifically in the

> >> __target_index() case only (note how far away this is from the

> >> governor).

> >>

> >> Now, you may think about modifying cpufreq_frequency_table_target() to

> >> skip "inefficient" frequencies, but then the question is why those

> >> frequencies need to be there in the frequency table in the first

> >> place?

> >

> > I'm guessing that the problem is that cpufreq_cooling works by using

> > freq_qos_update_request() to update the max frequency limit and if

> > that is in effect you'd rather use the inefficient frequencies,

> > whereas when the governor selects an inefficient frequency  below the

> > policy limit, you'd rather use a higher-but-efficient frequency

> > instead (within the policy limit).

> >

> > Am I guessing correctly?

> >

>

> Yes, correct. Thermal would use all (efficient + inefficient), but

> we in cpufreq governor would like to pick if possible the efficient

> one (below the thermal limit).


To address that, you need to pass more information from schedutil to
__cpufreq_driver_target() that down the road can be used by
cpufreq_frequency_table_target() to decide whether or not to skip the
inefficient frequencies.

For example, you can define CPUFREQ_RELATION_EFFICIENT and pass it
from schedutil to __cpufreq_driver_target() in the "relation"
argument, and clear it if the target frequency is above the max policy
limit, or if ->target() is to be called.
Lukasz Luba July 2, 2021, 7:04 p.m. | #24
On 7/2/21 6:53 PM, Rafael J. Wysocki wrote:
> On Fri, Jul 2, 2021 at 6:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote:

>>

>>

>>

>> On 7/2/21 5:04 PM, Rafael J. Wysocki wrote:

>>> On Fri, Jul 2, 2021 at 5:46 PM Rafael J. Wysocki <rafael@kernel.org> wrote:

>>>>

>>>> On Fri, Jul 2, 2021 at 4:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote:

>>>>>

>>>>> Hi Rafael,

>>>>>

>>>>> This is a gentle ping.

>>>>> You have probably not seen this discussion thread.

>>>>

>>>> I have looked at it briefly for a few times, but not too much into detail.

>>>>

>>>>> On 6/16/21 1:45 PM, Lukasz Luba wrote:

>>>>>>

>>>>>>

>>>>>> On 6/16/21 11:53 AM, Viresh Kumar wrote:

>>>>>>> On 16-06-21, 11:33, Lukasz Luba wrote:

>>>>>>>> On 6/16/21 10:31 AM, Viresh Kumar wrote:

>>>>>>>>> On 16-06-21, 10:03, Lukasz Luba wrote:

>>>>>>>>> Clean is not lesser number of lines for me, but rather having the

>>>>>>>>> right ownership of such things.

>>>>>>>>

>>>>>>>> Some developers do like patches which removes more lines then adds ;)

>>>>>>>

>>>>>>> :)

>>>>>>>

>>>>>>

>>>>>> [snip]

>>>>>>

>>>>>>>>

>>>>>>>> What could be the modified v1 [2]:

>>>>>>>> - LUT which holds two IDs: efficient, inefficient, take one

>>>>>>>>      according to the clamp f_max

>>>>>>>> - add new argument 'policy->max' to em_pd_get_efficient_freq()

>>>>>>>>

>>>>>>>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq,

>>>>>>>> policy->max);

>>>>>>>>

>>>>>>>> The problem was that EAS couldn't know the clamp freq_max,

>>>>>>>> which shouldn't be the blocker now.

>>>>>>>

>>>>>>> If you can do that without adding any EM specific stuff in the cpufreq

>>>>>>> core, I will mostly be fine.

>>>>>>>

>>>>>>> But honestly speaking, creating more data structures to keep related

>>>>>>> information doesn't scale well.

>>>>>>>

>>>>>>> We already have so many tables for keeping freq/voltage pairs, OPP,

>>>>>>> cpufreq, EM. You tried to add one more in EM I think V1, not sure.

>>>>>>>

>>>>>>> It is always better to consolidate and we almost reached to a point

>>>>>>> where that could have been done very easily. I understand that you

>>>>>>> didn't want to touch so many different parts, but anyway..

>>>>>>

>>>>>> Yes, I don't want to touch some many generic parts because they

>>>>>> are intended to be generic. This heuristic is only for EM platforms,

>>>>>> which are arm, arm64 battery powered (not servers or other).

>>>>>> Thus, I wanted to keep it locally. The cost of EM extra structures

>>>>>> (the LUT) will only be for platforms for which EM discovers that

>>>>>> they have inefficient performance levels.

>>>>>> The code would even not be compiled in for x86, ppc, etc, in hot path.

>>>>>>

>>>>>>>>>> this v3 and your proposal.

>>>>>>>>>

>>>>>>>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(),

>>>>>>>>> will only make this easier to handle for all different frameworks, and

>>>>>>>>> not otherwise. The code will look much cleaner everywhere..

>>>>>>>>>

>>>>>>>>

>>>>>>>> What about coming back to the slightly modified v1 idea?

>>>>>>>> That was really self-contained modification for this

>>>>>>>> inefficient opps heuristic.

>>>>>>>

>>>>>>> I am not sure if I really understand what that would be, but again

>>>>>>> adding another table is going to create more problems then it should.

>>>>>>

>>>>>> IMHO your proposals are more invasive for the generic code, while

>>>>>> not always being even used. Plenty of architectures don't even set EM,

>>>>>> even arm64 for servers doesn't use it. You and Rafael would have to

>>>>>> maintain these modifications in generic code. It might be hard to remove

>>>>>> it. While I recommend to keep this heuristic feature inside the EM and

>>>>>> we will maintain it. If we decide after a few years that new arm64

>>>>>> platforms use some smarter FW for performance level, we might just

>>>>>> disable this heuristic.

>>>>>>

>>>>>>>

>>>>>>> Anyway, that's my view, which can be wrong as well.

>>>>>>>

>>>>>>> Rafael: You have any suggestions here ?

>>>>>>>

>>>>>>

>>>>>> I would also like to hear Rafael's opinion :)

>>>>>

>>>>> It would be great if you could have a look.

>>>>> I will be grateful for your time spend on it and opinion.

>>>>

>>>> First of all, IMO checking whether or not a given frequency is

>>>> "efficient" doesn't belong to cpufreq governors.  The governor knows

>>>> what the min and max supported frequencies are and selects one from

>>>> that range and note that it doesn't even check whether or not the

>>>> selected frequency is present in the frequency table.  That part

>>>> belongs to the driver or the general frequency table handling in the

>>>> cpufreq core.

>>>>

>>>> So the governor doesn't care and it shouldn't care IMO.

>>>>

>>>> Besides, in the cpufreq_driver_fast_switch() case the driver's

>>>> ->fast_switch() callback is entirely responsible for applying the

>>>> selected frequency, so I'm not sure how this "efficient" thing is

>>>> going to work then?

>>>>

>>>> Anyway, since we are talking about frequency tables, that would be the

>>>> __cpufreq_driver_target() case only and specifically in the

>>>> __target_index() case only (note how far away this is from the

>>>> governor).

>>>>

>>>> Now, you may think about modifying cpufreq_frequency_table_target() to

>>>> skip "inefficient" frequencies, but then the question is why those

>>>> frequencies need to be there in the frequency table in the first

>>>> place?

>>>

>>> I'm guessing that the problem is that cpufreq_cooling works by using

>>> freq_qos_update_request() to update the max frequency limit and if

>>> that is in effect you'd rather use the inefficient frequencies,

>>> whereas when the governor selects an inefficient frequency  below the

>>> policy limit, you'd rather use a higher-but-efficient frequency

>>> instead (within the policy limit).

>>>

>>> Am I guessing correctly?

>>>

>>

>> Yes, correct. Thermal would use all (efficient + inefficient), but

>> we in cpufreq governor would like to pick if possible the efficient

>> one (below the thermal limit).

> 

> To address that, you need to pass more information from schedutil to

> __cpufreq_driver_target() that down the road can be used by

> cpufreq_frequency_table_target() to decide whether or not to skip the

> inefficient frequencies.

> 

> For example, you can define CPUFREQ_RELATION_EFFICIENT and pass it

> from schedutil to __cpufreq_driver_target() in the "relation"

> argument, and clear it if the target frequency is above the max policy

> limit, or if ->target() is to be called.

> 


Thank you Rafael for valuable comments. We will have to experiment
with that option and come back with implementation based on it.

Regards,
Lukasz
Vincent Donnefort July 2, 2021, 7:17 p.m. | #25
[...]

> > >

> > > I'm guessing that the problem is that cpufreq_cooling works by using

> > > freq_qos_update_request() to update the max frequency limit and if

> > > that is in effect you'd rather use the inefficient frequencies,

> > > whereas when the governor selects an inefficient frequency  below the

> > > policy limit, you'd rather use a higher-but-efficient frequency

> > > instead (within the policy limit).

> > >

> > > Am I guessing correctly?

> > >

> >

> > Yes, correct. Thermal would use all (efficient + inefficient), but

> > we in cpufreq governor would like to pick if possible the efficient

> > one (below the thermal limit).

> 

> To address that, you need to pass more information from schedutil to

> __cpufreq_driver_target() that down the road can be used by

> cpufreq_frequency_table_target() to decide whether or not to skip the

> inefficient frequencies.

> 

> For example, you can define CPUFREQ_RELATION_EFFICIENT and pass it

> from schedutil to __cpufreq_driver_target() in the "relation"

> argument, and clear it if the target frequency is above the max policy

> limit, or if ->target() is to be called.


What about a cpufreq_policy option that if sets would make
cpufreq_frequency_table_target() skip inefficient OPPs while staying within
the limit of max policy? Each governor could decide to set it or not, but
it would hide the efficiency resolution to the governor and allow drivers
that implements ->target() to also implements support for inefficient OPPs.

That flag could be set according to a new cpufreq_governor flag
CPUFREQ_GOV_SKIP_INEFFICIENCIES?

That could though modify behaviors like powersave_bias from ondemand. But if
a frequency is inefficient, there's probably no power saving anyway.

-- 
Vincent
Rafael J. Wysocki July 5, 2021, 2:09 p.m. | #26
On Fri, Jul 2, 2021 at 9:17 PM Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>

> [...]

>

> > > >

> > > > I'm guessing that the problem is that cpufreq_cooling works by using

> > > > freq_qos_update_request() to update the max frequency limit and if

> > > > that is in effect you'd rather use the inefficient frequencies,

> > > > whereas when the governor selects an inefficient frequency  below the

> > > > policy limit, you'd rather use a higher-but-efficient frequency

> > > > instead (within the policy limit).

> > > >

> > > > Am I guessing correctly?

> > > >

> > >

> > > Yes, correct. Thermal would use all (efficient + inefficient), but

> > > we in cpufreq governor would like to pick if possible the efficient

> > > one (below the thermal limit).

> >

> > To address that, you need to pass more information from schedutil to

> > __cpufreq_driver_target() that down the road can be used by

> > cpufreq_frequency_table_target() to decide whether or not to skip the

> > inefficient frequencies.

> >

> > For example, you can define CPUFREQ_RELATION_EFFICIENT and pass it

> > from schedutil to __cpufreq_driver_target() in the "relation"

> > argument, and clear it if the target frequency is above the max policy

> > limit, or if ->target() is to be called.

>

> What about a cpufreq_policy option that if sets would make

> cpufreq_frequency_table_target() skip inefficient OPPs while staying within

> the limit of max policy?


That would work too, ->

> Each governor could decide to set it or not, but

> it would hide the efficiency resolution to the governor and allow drivers

> that implements ->target() to also implements support for inefficient OPPs.


-> but alternatively there could be an additional cpufreq driver flag
to be set by the drivers implementing ->target() and wanting to deal
with CPUFREQ_RELATION_EFFICIENT themselves (an opt-in of sorts).

So the governors that want it may pass CPUFREQ_RELATION_EFFICIENT to
__cpufreq_driver_target() and then it will be passed to ->target()
depending on whether or not the new driver flag is set.

> That flag could be set according to a new cpufreq_governor flag

> CPUFREQ_GOV_SKIP_INEFFICIENCIES?

>

> That could though modify behaviors like powersave_bias from ondemand. But if

> a frequency is inefficient, there's probably no power saving anyway.


AFAICS, the userspace governor aside, using inefficient frequencies
only works with the powersave governor.  In the other cases,
RELATION_L (say) can be interpreted as "the closest efficient
frequency equal to or above the target" with the max policy limit
possibly causing inefficient frequencies to be used if they are closer
to the limit than the next efficient one.

As a rule, the governors don't assume that there are any inefficient
frequencies in the table.  In fact, they don't make any assumptions
regarding the contents of the frequency table at all.  They don't even
assume that the driver uses a frequency table in the first place.
Vincent Donnefort July 6, 2021, 8:12 a.m. | #27
[...]

> >

> > What about a cpufreq_policy option that if sets would make

> > cpufreq_frequency_table_target() skip inefficient OPPs while staying within

> > the limit of max policy?

> 

> That would work too, ->

> 

> > Each governor could decide to set it or not, but

> > it would hide the efficiency resolution to the governor and allow drivers

> > that implements ->target() to also implements support for inefficient OPPs.

> 

> -> but alternatively there could be an additional cpufreq driver flag

> to be set by the drivers implementing ->target() and wanting to deal

> with CPUFREQ_RELATION_EFFICIENT themselves (an opt-in of sorts).

> 

> So the governors that want it may pass CPUFREQ_RELATION_EFFICIENT to

> __cpufreq_driver_target() and then it will be passed to ->target()

> depending on whether or not the new driver flag is set.


Of course, I can implement this instead of a cpufreq_policy flag in v4.
I suppose then right fallback for CPUFREQ_RELATION_EFFICIENT in case the
driver doesn't opt-in is CPUFREQ_RELATION_L.

> 

> > That flag could be set according to a new cpufreq_governor flag

> > CPUFREQ_GOV_SKIP_INEFFICIENCIES?

> >

> > That could though modify behaviors like powersave_bias from ondemand. But if

> > a frequency is inefficient, there's probably no power saving anyway.

> 

> AFAICS, the userspace governor aside, using inefficient frequencies

> only works with the powersave governor.  In the other cases,

> RELATION_L (say) can be interpreted as "the closest efficient

> frequency equal to or above the target" with the max policy limit

> possibly causing inefficient frequencies to be used if they are closer

> to the limit than the next efficient one.

> 

> As a rule, the governors don't assume that there are any inefficient

> frequencies in the table.  In fact, they don't make any assumptions

> regarding the contents of the frequency table at all.  They don't even

> assume that the driver uses a frequency table in the first place.


So all the governors, beside powersave and userspace would replace their
RELATION_L with RELATION_EFFICIENT. I'll add the changes in v4.

So if I sum-up: new RELATION_EFFICIENT that resolves RELATION_L to an higher
efficient frequency (if necessary) within the limits of policy->max.
CPUfreq drivers can opt-in by setting an appropriate flag. If they do not,
RELATION_EFFICIENT will be rewritten in RELATION_L. All governors but userspace
and powersave would use RELATION_EFFICIENT instead of RELATION_L.

If that works for you, I'll implement this in a v4, as well as some
improvements for the CPUfreq/EM registration following the discussion with
Viresh.

-- 
Vincent
Viresh Kumar July 6, 2021, 8:37 a.m. | #28
On 06-07-21, 09:12, Vincent Donnefort wrote:
> Of course, I can implement this instead of a cpufreq_policy flag in v4.


Why should this ever be a policy flag and not governor's flag ?

-- 
viresh
Vincent Donnefort July 6, 2021, 8:43 a.m. | #29
On Tue, Jul 06, 2021 at 02:07:41PM +0530, Viresh Kumar wrote:
> On 06-07-21, 09:12, Vincent Donnefort wrote:

> > Of course, I can implement this instead of a cpufreq_policy flag in v4.

> 

> Why should this ever be a policy flag and not governor's flag ?


I was referring to how letting know a driver which registers ->target() that we
want or not inefficient frequencies. My proposal was to rely on a
cpufreq_policy's flag that the driver can read. 

> 

> -- 

> viresh
Viresh Kumar July 6, 2021, 8:50 a.m. | #30
On 06-07-21, 09:43, Vincent Donnefort wrote:
> I was referring to how letting know a driver which registers ->target() that we

> want or not inefficient frequencies. My proposal was to rely on a

> cpufreq_policy's flag that the driver can read. 


I am bit confused at this point, lets see how it looks eventually as patches :)

-- 
viresh
Rafael J. Wysocki July 6, 2021, 12:11 p.m. | #31
On Tue, Jul 6, 2021 at 10:13 AM Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>

> [...]

>

> > >

> > > What about a cpufreq_policy option that if sets would make

> > > cpufreq_frequency_table_target() skip inefficient OPPs while staying within

> > > the limit of max policy?

> >

> > That would work too, ->

> >

> > > Each governor could decide to set it or not, but

> > > it would hide the efficiency resolution to the governor and allow drivers

> > > that implements ->target() to also implements support for inefficient OPPs.

> >

> > -> but alternatively there could be an additional cpufreq driver flag

> > to be set by the drivers implementing ->target() and wanting to deal

> > with CPUFREQ_RELATION_EFFICIENT themselves (an opt-in of sorts).

> >

> > So the governors that want it may pass CPUFREQ_RELATION_EFFICIENT to

> > __cpufreq_driver_target() and then it will be passed to ->target()

> > depending on whether or not the new driver flag is set.

>

> Of course, I can implement this instead of a cpufreq_policy flag in v4.

> I suppose then right fallback for CPUFREQ_RELATION_EFFICIENT in case the

> driver doesn't opt-in is CPUFREQ_RELATION_L.

>

> >

> > > That flag could be set according to a new cpufreq_governor flag

> > > CPUFREQ_GOV_SKIP_INEFFICIENCIES?

> > >

> > > That could though modify behaviors like powersave_bias from ondemand. But if

> > > a frequency is inefficient, there's probably no power saving anyway.

> >

> > AFAICS, the userspace governor aside, using inefficient frequencies

> > only works with the powersave governor.  In the other cases,

> > RELATION_L (say) can be interpreted as "the closest efficient

> > frequency equal to or above the target" with the max policy limit

> > possibly causing inefficient frequencies to be used if they are closer

> > to the limit than the next efficient one.

> >

> > As a rule, the governors don't assume that there are any inefficient

> > frequencies in the table.  In fact, they don't make any assumptions

> > regarding the contents of the frequency table at all.  They don't even

> > assume that the driver uses a frequency table in the first place.

>

> So all the governors, beside powersave and userspace would replace their

> RELATION_L with RELATION_EFFICIENT. I'll add the changes in v4.

>

> So if I sum-up: new RELATION_EFFICIENT that resolves RELATION_L to an higher

> efficient frequency (if necessary) within the limits of policy->max.


Yes.

It can be called RELATION_E for brevity.

> CPUfreq drivers can opt-in by setting an appropriate flag. If they do not,

> RELATION_EFFICIENT will be rewritten in RELATION_L.


Yes, and cpufreq_frequency_table_target() will take RELATION_E into
account if set.

> All governors but userspace and powersave would use RELATION_EFFICIENT instead of RELATION_L.


Yes.

> If that works for you, I'll implement this in a v4, as well as some

> improvements for the CPUfreq/EM registration following the discussion with

> Viresh.


Sounds good, thanks!

Patch

diff --git a/drivers/cpufreq/freq_table.c b/drivers/cpufreq/freq_table.c
index 67e56cf..0756d7d6 100644
--- a/drivers/cpufreq/freq_table.c
+++ b/drivers/cpufreq/freq_table.c
@@ -351,6 +351,53 @@  static int set_freq_table_sorted(struct cpufreq_policy *policy)
 	return 0;
 }
 
+static void set_freq_table_efficiencies(struct cpufreq_policy *policy)
+{
+	struct cpufreq_frequency_table *pos, *table = policy->freq_table;
+	enum cpufreq_table_sorting sort = policy->freq_table_sorted;
+	int efficient, idx;
+
+	/* Not supported */
+	if (sort == CPUFREQ_TABLE_UNSORTED) {
+		cpufreq_for_each_entry_idx(pos, table, idx)
+			pos->efficient = idx;
+		return;
+	}
+
+	/* The highest frequency is always efficient */
+	cpufreq_for_each_entry_idx(pos, table, idx) {
+		if (pos->frequency == CPUFREQ_ENTRY_INVALID)
+			continue;
+
+		efficient = idx;
+
+		if (sort == CPUFREQ_TABLE_SORTED_DESCENDING)
+			break;
+	}
+
+	for (;;) {
+		pos = &table[idx];
+
+		if (pos->frequency == CPUFREQ_ENTRY_INVALID)
+			continue;
+
+		if (pos->flags & CPUFREQ_INEFFICIENT_FREQ) {
+			pos->efficient = efficient;
+		} else {
+			pos->efficient = idx;
+			efficient = idx;
+		}
+
+		if (sort == CPUFREQ_TABLE_SORTED_ASCENDING) {
+			if (--idx < 0)
+				break;
+		} else {
+			if (table[++idx].frequency == CPUFREQ_TABLE_END)
+				break;
+		}
+	}
+}
+
 int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)
 {
 	int ret;
@@ -362,7 +409,13 @@  int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)
 	if (ret)
 		return ret;
 
-	return set_freq_table_sorted(policy);
+	ret = set_freq_table_sorted(policy);
+	if (ret)
+		return ret;
+
+	set_freq_table_efficiencies(policy);
+
+	return ret;
 }
 
 MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>");
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 353969c..d10784c 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -666,13 +666,15 @@  struct governor_attr {
 #define CPUFREQ_ENTRY_INVALID	~0u
 #define CPUFREQ_TABLE_END	~1u
 /* Special Values of .flags field */
-#define CPUFREQ_BOOST_FREQ	(1 << 0)
+#define CPUFREQ_BOOST_FREQ	 (1 << 0)
+#define CPUFREQ_INEFFICIENT_FREQ (1 << 1)
 
 struct cpufreq_frequency_table {
 	unsigned int	flags;
 	unsigned int	driver_data; /* driver specific data, not used by core */
 	unsigned int	frequency; /* kHz - doesn't need to be in ascending
 				    * order */
+	unsigned int	efficient; /* idx of an efficient frequency */
 };
 
 #if defined(CONFIG_CPU_FREQ) && defined(CONFIG_PM_OPP)