diff mbox series

[v4,10/18] PM: EM: Add RCU mechanism which safely cleans the old data

Message ID 20230925081139.1305766-11-lukasz.luba@arm.com
State New
Headers show
Series Introduce runtime modifiable Energy Model | expand

Commit Message

Lukasz Luba Sept. 25, 2023, 8:11 a.m. UTC
The EM is going to support runtime modifications of the power data.
Introduce RCU safe mechanism to clean up the old allocated EM data.
It also adds a mutex for the EM structure to serialize the modifiers.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

Comments

kernel test robot Sept. 26, 2023, 10:28 a.m. UTC | #1
Hi Lukasz,

kernel test robot noticed the following build warnings:

[auto build test WARNING on rafael-pm/linux-next]
[also build test WARNING on rafael-pm/thermal linus/master v6.6-rc3 next-20230926]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lukasz-Luba/PM-EM-Add-missing-newline-for-the-message-log/20230925-181243
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
patch link:    https://lore.kernel.org/r/20230925081139.1305766-11-lukasz.luba%40arm.com
patch subject: [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
config: i386-randconfig-063-20230926 (https://download.01.org/0day-ci/archive/20230926/202309261850.jrucSbN8-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230926/202309261850.jrucSbN8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309261850.jrucSbN8-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/power/energy_model.c:125:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct em_perf_table *tmp @@     got struct em_perf_table [noderef] __rcu *runtime_table @@
   kernel/power/energy_model.c:125:13: sparse:     expected struct em_perf_table *tmp
   kernel/power/energy_model.c:125:13: sparse:     got struct em_perf_table [noderef] __rcu *runtime_table

vim +125 kernel/power/energy_model.c

   118	
   119	static void em_perf_runtime_table_set(struct device *dev,
   120					      struct em_perf_table *runtime_table)
   121	{
   122		struct em_perf_domain *pd = dev->em_pd;
   123		struct em_perf_table *tmp;
   124	
 > 125		tmp = pd->runtime_table;
   126	
   127		rcu_assign_pointer(pd->runtime_table, runtime_table);
   128	
   129		em_cpufreq_update_efficiencies(dev, runtime_table->state);
   130	
   131		/* Don't free default table since it's used by other frameworks. */
   132		if (tmp != pd->default_table)
   133			call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
   134	}
   135
Lukasz Luba Sept. 29, 2023, 9:36 a.m. UTC | #2
On 9/26/23 20:26, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> The EM is going to support runtime modifications of the power data.
>> Introduce RCU safe mechanism to clean up the old allocated EM data.
> 
> "RCU-based" probably and "to clean up the old EM data safely".

Yes, thanks

> 
>> It also adds a mutex for the EM structure to serialize the modifiers.
> 
> This part doesn't match the code changes in the patch.

Good catch. It left from some older version. We use the existing
em_pd_mutex.

> 
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   kernel/power/energy_model.c | 29 +++++++++++++++++++++++++++++
>>   1 file changed, 29 insertions(+)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 5b40db38b745..2345837bfd2c 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -23,6 +23,9 @@
>>    */
>>   static DEFINE_MUTEX(em_pd_mutex);
>>
>> +static void em_cpufreq_update_efficiencies(struct device *dev,
>> +                                          struct em_perf_state *table);
>> +
>>   static bool _is_cpu_device(struct device *dev)
>>   {
>>          return (dev->bus == &cpu_subsys);
>> @@ -104,6 +107,32 @@ static void em_debug_create_pd(struct device *dev) {}
>>   static void em_debug_remove_pd(struct device *dev) {}
>>   #endif
>>
>> +static void em_destroy_rt_table_rcu(struct rcu_head *rp)
> 
> Adding static functions without callers will obviously cause the
> compiler to complain, which is one of the reasons to avoid doing that.
> The other is that it is hard to say how these functions are going to
> be used without reviewing multiple patches simultaneously, which is a
> pain as far as I'm concerned.

It is used in this patch, but inside the call_rcu() as 2nd arg.
I have marked that below. The compiler didn't complain IIRC.

> 
>> +{
>> +       struct em_perf_table *runtime_table;
>> +
>> +       runtime_table = container_of(rp, struct em_perf_table, rcu);
>> +       kfree(runtime_table->state);
>> +       kfree(runtime_table);
> 
> If runtime_table and its state were allocated in one go, it would be
> possible to free them in one go either.
> 
> For some reason, you don't seem to want to do that, but why?

We had a few internal reviews and there were voices where saying that
it's better to have 2 identical tables: 'default_table' and
'runtime_table' to make sure it's visible everywhere when it's used.
That made the need to actually have also the 'state' table inside.
I don't see it as a big problem, though.

> 
>> +}
>> +
>> +static void em_perf_runtime_table_set(struct device *dev,
>> +                                     struct em_perf_table *runtime_table)
>> +{
>> +       struct em_perf_domain *pd = dev->em_pd;
>> +       struct em_perf_table *tmp;
>> +
>> +       tmp = pd->runtime_table;
>> +
>> +       rcu_assign_pointer(pd->runtime_table, runtime_table);
>> +
>> +       em_cpufreq_update_efficiencies(dev, runtime_table->state);
>> +
>> +       /* Don't free default table since it's used by other frameworks. */
> 
> Apparently, some frameworks are only going to use the default table
> while the runtime-updatable table will be used somewhere else at the
> same time.
> 
> I'm not really sure if this is a good idea.

Runtime table is only for driving the task placement in the EAS.

The thermal gov IPA won't make better decisions because it already
has the mechanism to accumulate the error that it made.

The same applies to DTPM, which works in a more 'configurable' way,
rather that hard optimization mechanism (like EAS).

> 
>> +       if (tmp != pd->default_table)
>> +               call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);

The em_destroy_rt_table_rcu() is used here ^^^^^^
Rafael J. Wysocki Sept. 29, 2023, 12:59 p.m. UTC | #3
On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 9/26/23 20:26, Rafael J. Wysocki wrote:
> > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >> The EM is going to support runtime modifications of the power data.
> >> Introduce RCU safe mechanism to clean up the old allocated EM data.
> >
> > "RCU-based" probably and "to clean up the old EM data safely".
>
> Yes, thanks
>
> >
> >> It also adds a mutex for the EM structure to serialize the modifiers.
> >
> > This part doesn't match the code changes in the patch.
>
> Good catch. It left from some older version. We use the existing
> em_pd_mutex.
>
> >
> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> >> ---
> >>   kernel/power/energy_model.c | 29 +++++++++++++++++++++++++++++
> >>   1 file changed, 29 insertions(+)
> >>
> >> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> >> index 5b40db38b745..2345837bfd2c 100644
> >> --- a/kernel/power/energy_model.c
> >> +++ b/kernel/power/energy_model.c
> >> @@ -23,6 +23,9 @@
> >>    */
> >>   static DEFINE_MUTEX(em_pd_mutex);
> >>
> >> +static void em_cpufreq_update_efficiencies(struct device *dev,
> >> +                                          struct em_perf_state *table);
> >> +
> >>   static bool _is_cpu_device(struct device *dev)
> >>   {
> >>          return (dev->bus == &cpu_subsys);
> >> @@ -104,6 +107,32 @@ static void em_debug_create_pd(struct device *dev) {}
> >>   static void em_debug_remove_pd(struct device *dev) {}
> >>   #endif
> >>
> >> +static void em_destroy_rt_table_rcu(struct rcu_head *rp)
> >
> > Adding static functions without callers will obviously cause the
> > compiler to complain, which is one of the reasons to avoid doing that.
> > The other is that it is hard to say how these functions are going to
> > be used without reviewing multiple patches simultaneously, which is a
> > pain as far as I'm concerned.
>
> It is used in this patch, but inside the call_rcu() as 2nd arg.

I missed that, sorry for the noise.

> I have marked that below. The compiler didn't complain IIRC.
>
> >
> >> +{
> >> +       struct em_perf_table *runtime_table;
> >> +
> >> +       runtime_table = container_of(rp, struct em_perf_table, rcu);
> >> +       kfree(runtime_table->state);
> >> +       kfree(runtime_table);
> >
> > If runtime_table and its state were allocated in one go, it would be
> > possible to free them in one go either.
> >
> > For some reason, you don't seem to want to do that, but why?
>
> We had a few internal reviews and there were voices where saying that
> it's better to have 2 identical tables: 'default_table' and
> 'runtime_table' to make sure it's visible everywhere when it's used.
> That made the need to actually have also the 'state' table inside.
> I don't see it as a big problem, though.

What I'm trying to say is that you can allocate runtime_table along
with the table pointed to by its state field in one invocation of
kzalloc() (say).

Having just one memory region to free eventually instead of two of
them would help to avoid some complexity, especially in the next
patch.

> >
> >> +}
> >> +
> >> +static void em_perf_runtime_table_set(struct device *dev,
> >> +                                     struct em_perf_table *runtime_table)
> >> +{
> >> +       struct em_perf_domain *pd = dev->em_pd;
> >> +       struct em_perf_table *tmp;
> >> +
> >> +       tmp = pd->runtime_table;
> >> +
> >> +       rcu_assign_pointer(pd->runtime_table, runtime_table);
> >> +
> >> +       em_cpufreq_update_efficiencies(dev, runtime_table->state);
> >> +
> >> +       /* Don't free default table since it's used by other frameworks. */
> >
> > Apparently, some frameworks are only going to use the default table
> > while the runtime-updatable table will be used somewhere else at the
> > same time.
> >
> > I'm not really sure if this is a good idea.
>
> Runtime table is only for driving the task placement in the EAS.
>
> The thermal gov IPA won't make better decisions because it already
> has the mechanism to accumulate the error that it made.
>
> The same applies to DTPM, which works in a more 'configurable' way,
> rather that hard optimization mechanism (like EAS).

My understanding of the above is that the other EM users don't really
care that much so they can get away with using the default table all
the time, but EAS needs more accuracy, so the table used by it needs
to be adjusted in certain situations.

Fair enough, I'm assuming that you've done some research around it.
Still, this is rather confusing.

> >
> >> +       if (tmp != pd->default_table)
> >> +               call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
>
> The em_destroy_rt_table_rcu() is used here ^^^^^^
Lukasz Luba Oct. 2, 2023, 1:44 p.m. UTC | #4
On 9/29/23 13:59, Rafael J. Wysocki wrote:
> On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:

[snip]

>> We had a few internal reviews and there were voices where saying that
>> it's better to have 2 identical tables: 'default_table' and
>> 'runtime_table' to make sure it's visible everywhere when it's used.
>> That made the need to actually have also the 'state' table inside.
>> I don't see it as a big problem, though.
> 
> What I'm trying to say is that you can allocate runtime_table along
> with the table pointed to by its state field in one invocation of
> kzalloc() (say).
> 
> Having just one memory region to free eventually instead of two of
> them would help to avoid some complexity, especially in the next
> patch.

I think, I know what you mean, basically:

------------------------------
struct em_perf_table {
	struct rcu_head rcu;
	struct em_perf_state state[];
}

kzalloc(sizeof(struct em_perf_table) + N * sizeof(struct em_perf_state))

------

IMO that should also be OK in the rest of places.
I agree the alloc/free code would be smaller.

Let me do that than.

> 
>>>
>>>> +}
>>>> +
>>>> +static void em_perf_runtime_table_set(struct device *dev,
>>>> +                                     struct em_perf_table *runtime_table)
>>>> +{
>>>> +       struct em_perf_domain *pd = dev->em_pd;
>>>> +       struct em_perf_table *tmp;
>>>> +
>>>> +       tmp = pd->runtime_table;
>>>> +
>>>> +       rcu_assign_pointer(pd->runtime_table, runtime_table);
>>>> +
>>>> +       em_cpufreq_update_efficiencies(dev, runtime_table->state);
>>>> +
>>>> +       /* Don't free default table since it's used by other frameworks. */
>>>
>>> Apparently, some frameworks are only going to use the default table
>>> while the runtime-updatable table will be used somewhere else at the
>>> same time.
>>>
>>> I'm not really sure if this is a good idea.
>>
>> Runtime table is only for driving the task placement in the EAS.
>>
>> The thermal gov IPA won't make better decisions because it already
>> has the mechanism to accumulate the error that it made.
>>
>> The same applies to DTPM, which works in a more 'configurable' way,
>> rather that hard optimization mechanism (like EAS).
> 
> My understanding of the above is that the other EM users don't really
> care that much so they can get away with using the default table all
> the time, but EAS needs more accuracy, so the table used by it needs
> to be adjusted in certain situations.

Yes

> 
> Fair enough, I'm assuming that you've done some research around it.
> Still, this is rather confusing.

Yes, I have presented those ~2y ago in Android Gerrit world
(got feedback from a few vendors) and in a few Linux conferences.

For now we don't plan to have this feature for the thermal
governor or something similar.

> 
>>>
>>>> +       if (tmp != pd->default_table)
>>>> +               call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
>>
>> The em_destroy_rt_table_rcu() is used here ^^^^^^
Lukasz Luba Oct. 6, 2023, 8:46 a.m. UTC | #5
Hi Rafael,

A change of direction here, regarding your comment below.

On 10/2/23 14:44, Lukasz Luba wrote:
> 
> 
> On 9/29/23 13:59, Rafael J. Wysocki wrote:
>> On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> 
> [snip]
> 

[snip]

>>>> Apparently, some frameworks are only going to use the default table
>>>> while the runtime-updatable table will be used somewhere else at the
>>>> same time.
>>>>
>>>> I'm not really sure if this is a good idea.
>>>
>>> Runtime table is only for driving the task placement in the EAS.
>>>
>>> The thermal gov IPA won't make better decisions because it already
>>> has the mechanism to accumulate the error that it made.
>>>
>>> The same applies to DTPM, which works in a more 'configurable' way,
>>> rather that hard optimization mechanism (like EAS).
>>
>> My understanding of the above is that the other EM users don't really
>> care that much so they can get away with using the default table all
>> the time, but EAS needs more accuracy, so the table used by it needs
>> to be adjusted in certain situations.
> 
> Yes
> 
>>
>> Fair enough, I'm assuming that you've done some research around it.
>> Still, this is rather confusing.
> 
> Yes, I have presented those ~2y ago in Android Gerrit world
> (got feedback from a few vendors) and in a few Linux conferences.
> 
> For now we don't plan to have this feature for the thermal
> governor or something similar.
> 

I have discussed with one of our partners your comment about 2 tables.
They would like to have this runtime modified EM in other places
as well: DTPM and thermal governor. So you had good gut feeling.

In the past in our IPA (thermal gov ~2016 and kernel v4.14) we
had two callbacks:
- get_static_power() [1]
- get_dynamic_power() [2]

Later ~2017/2018 v4.16 the static power mechanism was removed
completely by this commit 84fe2cab48590e4373978e4e.
The way how it was design, implemented and used justified that
decision. We later used EM in the cpu cooling which also only
had dynamic power information.

The PID mechanism in IPA tries to compensate that
missing information (about changed static power in time or a chip
binning) and adjusts the 'error'. How good and fast that is in all
situations - it's a different story (out of this scope).
So, IPA should not be worse with the runtime table.

The static power was on the chips and probably will be still.
You might remember my slide 13 from OSPM2024 showing two power
usage plots for the same Big CPU and 1.4GHz fixed (50% of fmax):
- w/ GPU working in the background using 1-1.5W
- w/o GPU in the background

The same workload run on Big, but power bigger is ~15% higher
after ~1min.

The static power (leakage) is the issue that this patch tries
to address for EAS. Although, there is not only the leakage.
It's about the whole 'profile', which can be different than what
could be built during boot default information.

So we would want to go for one single table in EM, which
is runtime modifiable.

That is something that you might be more confident and we would
have less diversity (2 tables) in the kernel.

Regards,
Lukasz


[1] 
https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L336
[2] 
https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L383
Wei Wang Oct. 11, 2023, 4:02 p.m. UTC | #6
On Fri, Oct 6, 2023 at 1:45 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Rafael,
>
> A change of direction here, regarding your comment below.
>
> On 10/2/23 14:44, Lukasz Luba wrote:
> >
> >
> > On 9/29/23 13:59, Rafael J. Wysocki wrote:
> >> On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >
> > [snip]
> >
>
> [snip]
>
> >>>> Apparently, some frameworks are only going to use the default table
> >>>> while the runtime-updatable table will be used somewhere else at the
> >>>> same time.
> >>>>
> >>>> I'm not really sure if this is a good idea.
> >>>
> >>> Runtime table is only for driving the task placement in the EAS.
> >>>
> >>> The thermal gov IPA won't make better decisions because it already
> >>> has the mechanism to accumulate the error that it made.
> >>>
> >>> The same applies to DTPM, which works in a more 'configurable' way,
> >>> rather that hard optimization mechanism (like EAS).
> >>
> >> My understanding of the above is that the other EM users don't really
> >> care that much so they can get away with using the default table all
> >> the time, but EAS needs more accuracy, so the table used by it needs
> >> to be adjusted in certain situations.
> >
> > Yes
> >
> >>
> >> Fair enough, I'm assuming that you've done some research around it.
> >> Still, this is rather confusing.
> >
> > Yes, I have presented those ~2y ago in Android Gerrit world
> > (got feedback from a few vendors) and in a few Linux conferences.
> >
> > For now we don't plan to have this feature for the thermal
> > governor or something similar.
> >
>
> I have discussed with one of our partners your comment about 2 tables.
> They would like to have this runtime modified EM in other places
> as well: DTPM and thermal governor. So you had good gut feeling.
>
> In the past in our IPA (thermal gov ~2016 and kernel v4.14) we
> had two callbacks:
> - get_static_power() [1]
> - get_dynamic_power() [2]
>
> Later ~2017/2018 v4.16 the static power mechanism was removed
> completely by this commit 84fe2cab48590e4373978e4e.
> The way how it was design, implemented and used justified that
> decision. We later used EM in the cpu cooling which also only
> had dynamic power information.
>
> The PID mechanism in IPA tries to compensate that
> missing information (about changed static power in time or a chip
> binning) and adjusts the 'error'. How good and fast that is in all
> situations - it's a different story (out of this scope).
> So, IPA should not be worse with the runtime table.
>
> The static power was on the chips and probably will be still.
> You might remember my slide 13 from OSPM2024 showing two power
> usage plots for the same Big CPU and 1.4GHz fixed (50% of fmax):
> - w/ GPU working in the background using 1-1.5W
> - w/o GPU in the background
>
> The same workload run on Big, but power bigger is ~15% higher
> after ~1min.
>
> The static power (leakage) is the issue that this patch tries
> to address for EAS. Although, there is not only the leakage.
> It's about the whole 'profile', which can be different than what
> could be built during boot default information.
>
> So we would want to go for one single table in EM, which
> is runtime modifiable.
>
> That is something that you might be more confident and we would
> have less diversity (2 tables) in the kernel.
>
> Regards,
> Lukasz
>
>

Indeed, we had a conversation about this with Lukasz recently. The key
idea is that there is no compelling reason to introduce diversity in
the mathematics involved. If we have confidence in the superior
accuracy of our model, it should be universally implemented. While the
governors are designed with some error tolerance, they can benefit
from enhanced accuracy in their operation.

Thanks!
-Wei

> [1]
> https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L336
> [2]
> https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L383
Rafael J. Wysocki Oct. 11, 2023, 4:07 p.m. UTC | #7
On Wed, Oct 11, 2023 at 6:03 PM Wei Wang <wvw@google.com> wrote:
>
> On Fri, Oct 6, 2023 at 1:45 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >
> > Hi Rafael,
> >
> > A change of direction here, regarding your comment below.
> >
> > On 10/2/23 14:44, Lukasz Luba wrote:
> > >
> > >
> > > On 9/29/23 13:59, Rafael J. Wysocki wrote:
> > >> On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> > >
> > > [snip]
> > >
> >
> > [snip]
> >
> > >>>> Apparently, some frameworks are only going to use the default table
> > >>>> while the runtime-updatable table will be used somewhere else at the
> > >>>> same time.
> > >>>>
> > >>>> I'm not really sure if this is a good idea.
> > >>>
> > >>> Runtime table is only for driving the task placement in the EAS.
> > >>>
> > >>> The thermal gov IPA won't make better decisions because it already
> > >>> has the mechanism to accumulate the error that it made.
> > >>>
> > >>> The same applies to DTPM, which works in a more 'configurable' way,
> > >>> rather that hard optimization mechanism (like EAS).
> > >>
> > >> My understanding of the above is that the other EM users don't really
> > >> care that much so they can get away with using the default table all
> > >> the time, but EAS needs more accuracy, so the table used by it needs
> > >> to be adjusted in certain situations.
> > >
> > > Yes
> > >
> > >>
> > >> Fair enough, I'm assuming that you've done some research around it.
> > >> Still, this is rather confusing.
> > >
> > > Yes, I have presented those ~2y ago in Android Gerrit world
> > > (got feedback from a few vendors) and in a few Linux conferences.
> > >
> > > For now we don't plan to have this feature for the thermal
> > > governor or something similar.
> > >
> >
> > I have discussed with one of our partners your comment about 2 tables.
> > They would like to have this runtime modified EM in other places
> > as well: DTPM and thermal governor. So you had good gut feeling.
> >
> > In the past in our IPA (thermal gov ~2016 and kernel v4.14) we
> > had two callbacks:
> > - get_static_power() [1]
> > - get_dynamic_power() [2]
> >
> > Later ~2017/2018 v4.16 the static power mechanism was removed
> > completely by this commit 84fe2cab48590e4373978e4e.
> > The way how it was design, implemented and used justified that
> > decision. We later used EM in the cpu cooling which also only
> > had dynamic power information.
> >
> > The PID mechanism in IPA tries to compensate that
> > missing information (about changed static power in time or a chip
> > binning) and adjusts the 'error'. How good and fast that is in all
> > situations - it's a different story (out of this scope).
> > So, IPA should not be worse with the runtime table.
> >
> > The static power was on the chips and probably will be still.
> > You might remember my slide 13 from OSPM2024 showing two power
> > usage plots for the same Big CPU and 1.4GHz fixed (50% of fmax):
> > - w/ GPU working in the background using 1-1.5W
> > - w/o GPU in the background
> >
> > The same workload run on Big, but power bigger is ~15% higher
> > after ~1min.
> >
> > The static power (leakage) is the issue that this patch tries
> > to address for EAS. Although, there is not only the leakage.
> > It's about the whole 'profile', which can be different than what
> > could be built during boot default information.
> >
> > So we would want to go for one single table in EM, which
> > is runtime modifiable.
> >
> > That is something that you might be more confident and we would
> > have less diversity (2 tables) in the kernel.
> >
> > Regards,
> > Lukasz
> >
> >
>
> Indeed, we had a conversation about this with Lukasz recently. The key
> idea is that there is no compelling reason to introduce diversity in
> the mathematics involved. If we have confidence in the superior
> accuracy of our model, it should be universally implemented. While the
> governors are designed with some error tolerance, they can benefit
> from enhanced accuracy in their operation.

I agree, thanks!

> > [1]
> > https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L336
> > [2]
> > https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L383
Lukasz Luba Oct. 12, 2023, 1:16 p.m. UTC | #8
On 10/11/23 17:07, Rafael J. Wysocki wrote:
> On Wed, Oct 11, 2023 at 6:03 PM Wei Wang <wvw@google.com> wrote:
>>
>> On Fri, Oct 6, 2023 at 1:45 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>
>>> Hi Rafael,
>>>
>>> A change of direction here, regarding your comment below.
>>>
>>> On 10/2/23 14:44, Lukasz Luba wrote:
>>>>
>>>>
>>>> On 9/29/23 13:59, Rafael J. Wysocki wrote:
>>>>> On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>>
>>>> [snip]
>>>>
>>>
>>> [snip]
>>>
>>>>>>> Apparently, some frameworks are only going to use the default table
>>>>>>> while the runtime-updatable table will be used somewhere else at the
>>>>>>> same time.
>>>>>>>
>>>>>>> I'm not really sure if this is a good idea.
>>>>>>
>>>>>> Runtime table is only for driving the task placement in the EAS.
>>>>>>
>>>>>> The thermal gov IPA won't make better decisions because it already
>>>>>> has the mechanism to accumulate the error that it made.
>>>>>>
>>>>>> The same applies to DTPM, which works in a more 'configurable' way,
>>>>>> rather that hard optimization mechanism (like EAS).
>>>>>
>>>>> My understanding of the above is that the other EM users don't really
>>>>> care that much so they can get away with using the default table all
>>>>> the time, but EAS needs more accuracy, so the table used by it needs
>>>>> to be adjusted in certain situations.
>>>>
>>>> Yes
>>>>
>>>>>
>>>>> Fair enough, I'm assuming that you've done some research around it.
>>>>> Still, this is rather confusing.
>>>>
>>>> Yes, I have presented those ~2y ago in Android Gerrit world
>>>> (got feedback from a few vendors) and in a few Linux conferences.
>>>>
>>>> For now we don't plan to have this feature for the thermal
>>>> governor or something similar.
>>>>
>>>
>>> I have discussed with one of our partners your comment about 2 tables.
>>> They would like to have this runtime modified EM in other places
>>> as well: DTPM and thermal governor. So you had good gut feeling.
>>>
>>> In the past in our IPA (thermal gov ~2016 and kernel v4.14) we
>>> had two callbacks:
>>> - get_static_power() [1]
>>> - get_dynamic_power() [2]
>>>
>>> Later ~2017/2018 v4.16 the static power mechanism was removed
>>> completely by this commit 84fe2cab48590e4373978e4e.
>>> The way how it was design, implemented and used justified that
>>> decision. We later used EM in the cpu cooling which also only
>>> had dynamic power information.
>>>
>>> The PID mechanism in IPA tries to compensate that
>>> missing information (about changed static power in time or a chip
>>> binning) and adjusts the 'error'. How good and fast that is in all
>>> situations - it's a different story (out of this scope).
>>> So, IPA should not be worse with the runtime table.
>>>
>>> The static power was on the chips and probably will be still.
>>> You might remember my slide 13 from OSPM2024 showing two power
>>> usage plots for the same Big CPU and 1.4GHz fixed (50% of fmax):
>>> - w/ GPU working in the background using 1-1.5W
>>> - w/o GPU in the background
>>>
>>> The same workload run on Big, but power bigger is ~15% higher
>>> after ~1min.
>>>
>>> The static power (leakage) is the issue that this patch tries
>>> to address for EAS. Although, there is not only the leakage.
>>> It's about the whole 'profile', which can be different than what
>>> could be built during boot default information.
>>>
>>> So we would want to go for one single table in EM, which
>>> is runtime modifiable.
>>>
>>> That is something that you might be more confident and we would
>>> have less diversity (2 tables) in the kernel.
>>>
>>> Regards,
>>> Lukasz
>>>
>>>
>>
>> Indeed, we had a conversation about this with Lukasz recently. The key
>> idea is that there is no compelling reason to introduce diversity in
>> the mathematics involved. If we have confidence in the superior
>> accuracy of our model, it should be universally implemented. While the
>> governors are designed with some error tolerance, they can benefit
>> from enhanced accuracy in their operation.
> 
> I agree, thanks!
> 

Thank you Wei and Rafael. I'm working on that implementation and
will be in v5 soon.
diff mbox series

Patch

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 5b40db38b745..2345837bfd2c 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -23,6 +23,9 @@ 
  */
 static DEFINE_MUTEX(em_pd_mutex);
 
+static void em_cpufreq_update_efficiencies(struct device *dev,
+					   struct em_perf_state *table);
+
 static bool _is_cpu_device(struct device *dev)
 {
 	return (dev->bus == &cpu_subsys);
@@ -104,6 +107,32 @@  static void em_debug_create_pd(struct device *dev) {}
 static void em_debug_remove_pd(struct device *dev) {}
 #endif
 
+static void em_destroy_rt_table_rcu(struct rcu_head *rp)
+{
+	struct em_perf_table *runtime_table;
+
+	runtime_table = container_of(rp, struct em_perf_table, rcu);
+	kfree(runtime_table->state);
+	kfree(runtime_table);
+}
+
+static void em_perf_runtime_table_set(struct device *dev,
+				      struct em_perf_table *runtime_table)
+{
+	struct em_perf_domain *pd = dev->em_pd;
+	struct em_perf_table *tmp;
+
+	tmp = pd->runtime_table;
+
+	rcu_assign_pointer(pd->runtime_table, runtime_table);
+
+	em_cpufreq_update_efficiencies(dev, runtime_table->state);
+
+	/* Don't free default table since it's used by other frameworks. */
+	if (tmp != pd->default_table)
+		call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
+}
+
 static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 			    struct em_data_callback *cb, int nr_states,
 			    unsigned long flags)