mbox series

[v5,00/23] Introduce runtime modifiable Energy Model

Message ID 20231129110853.94344-1-lukasz.luba@arm.com
Headers show
Series Introduce runtime modifiable Energy Model | expand

Message

Lukasz Luba Nov. 29, 2023, 11:08 a.m. UTC
Hi all,

This patch set adds a new feature which allows to modify Energy Model (EM)
power values at runtime. It will allow to better reflect power model of
a recent SoCs and silicon. Different characteristics of the power usage
can be leveraged and thus better decisions made during task placement in EAS.

It's part of feature set know as Dynamic Energy Model. It has been presented
and discussed recently at OSPM2023 [3]. This patch set implements the 1st
improvement for the EM.

The concepts:
1. The CPU power usage can vary due to the workload that it's running or due
to the temperature of the SoC. The same workload can use more power when the
temperature of the silicon has increased (e.g. due to hot GPU or ISP).
In such situation the EM can be adjusted and reflect the fact of increased
power usage. That power increase is due to static power
(sometimes called simply: leakage). The CPUs in recent SoCs are different.
We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
They are also built differently with High Performance (HP) cells or
Low Power (LP) cells. They are affected by the temperature increase
differently: HP cells have bigger leakage. The SW model can leverage that
knowledge.

2. It is also possible to change the EM to better reflect the currently
running workload. Usually the EM is derived from some average power values
taken from experiments with benchmark (e.g. Dhrystone). The model derived
from such scenario might not represent properly the workloads usually running
on the device. Therefore, runtime modification of the EM allows to switch to
a different model, when there is a need.

3. The EM can be adjusted after boot, when all the modules are loaded and
more information about the SoC is available e.g. chip binning. This would help
to better reflect the silicon characteristics. Thus, this EM modification
API allows it now. It wasn't possible in the past and the EM had to be
'set in stone'.

More detailed explanation and background can be found in presentations
during LPC2022 [1][2] or in the documentation patches.

Some test results.
The EM can be updated to fit better the workload type. In the case below the EM
has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
for the scheduler bits). The Jankbench was run 10 times for those two configurations,
to get more reliable data.

1. Janky frames percentage
+--------+-----------------+---------------------+-------+-----------+
| metric |    variable     |       kernel        | value | perc_diff |
+--------+-----------------+---------------------+-------+-----------+
| gmean  | jank_percentage | EM_default          |  2.0  |   0.0%    |
| gmean  | jank_percentage | EM_modified_runtime |  1.3  |  -35.33%  |
+--------+-----------------+---------------------+-------+-----------+

2. Avg frame render time duration
+--------+---------------------+---------------------+-------+-----------+
| metric |      variable       |       kernel        | value | perc_diff |
+--------+---------------------+---------------------+-------+-----------+
| gmean  | mean_frame_duration | EM_default          | 10.5  |   0.0%    |
| gmean  | mean_frame_duration | EM_modified_runtime |  9.6  |  -8.52%   |
+--------+---------------------+---------------------+-------+-----------+

3. Max frame render time duration
+--------+--------------------+---------------------+-------+-----------+
| metric |      variable      |       kernel        | value | perc_diff |
+--------+--------------------+---------------------+-------+-----------+
| gmean  | max_frame_duration | EM_default          | 251.6 |   0.0%    |
| gmean  | max_frame_duration | EM_modified_runtime | 115.5 |  -54.09%  |
+--------+--------------------+---------------------+-------+-----------+

4. OS overutilized state percentage (when EAS is not working)
+--------------+---------------------+------+------------+------------+
|    metric    |       wa_path       | time | total_time | percentage |
+--------------+---------------------+------+------------+------------+
| overutilized | EM_default          | 1.65 |   253.38   |    0.65    |
| overutilized | EM_modified_runtime | 1.4  |   277.5    |    0.51    |
+--------------+---------------------+------+------------+------------+

5. All CPUs (Little+Mid+Big) power values in mW
+------------+--------+---------------------+-------+-----------+
|  channel   | metric |       kernel        | value | perc_diff |
+------------+--------+---------------------+-------+-----------+
|    CPU     | gmean  | EM_default          | 142.1 |   0.0%    |
|    CPU     | gmean  | EM_modified_runtime | 131.8 |  -7.27%   |
+------------+--------+---------------------+-------+-----------+

The time cost to update the EM decreased in this v5 vs v4:
big: 5us vs 2us -> 2.6x faster
mid: 9us vs 3us -> 3x faster
little: 16us vs 16us -> no change

We still have to update the inefficiency in the cpufreq framework, thus
a bit of overhead will be there.

Changelog:
v5:
- removed 2 tables design
- have only one table (runtime_table) used also in thermal (Wei, Rafael)
- refactored update function and removed callback call for each opp
- added faster EM table swap, using only the RCU pointer update
- added memory allocation API and tracking with kref
- avoid overhead for computing 'cost' for each OPP in update, it can be
  pre-computed in device drivers EM earlier
- add support for device drivers providing EM table
- added API for computing 'cost' values in EM for EAS
- added API for thermal/powercap to use EM (using RCU wrappers) 
- switched to single allocation and 'state[]' array (Rafael)
- changed documentation to align with current design
- added helper API for computing cost values
- simplified EM free in unregister path (thanks to kref)
- split patch updating EM clients and changed them separetly
- added seperate patch removing old static EM table
- added EM debugfs change patch to dump the runtime_table
- addressed comments in v4 for spelling/comments/headers
- added review tags
v4 changes are here [4]

Regards,
Lukasz Luba

[1] https://lpc.events/event/16/contributions/1341/attachments/955/1873/Dynamic_Energy_Model_to_handle_leakage_power.pdf
[2] https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf
[3] https://www.youtube.com/watch?v=2C-5uikSbtM&list=PL0fKordpLTjKsBOUcZqnzlHShri4YBL1H
[4] https://lore.kernel.org/lkml/20230925081139.1305766-1-lukasz.luba@arm.com/


Lukasz Luba (23):
  PM: EM: Add missing newline for the message log
  PM: EM: Refactor em_cpufreq_update_efficiencies() arguments
  PM: EM: Find first CPU active while updating OPP efficiency
  PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
  PM: EM: Refactor a new function em_compute_costs()
  PM: EM: Check if the get_cost() callback is present in
    em_compute_costs()
  PM: EM: Refactor how the EM table is allocated and populated
  PM: EM: Introduce runtime modifiable table
  PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
  PM: EM: Add API for memory allocations for new tables
  PM: EM: Add API for updating the runtime modifiable EM
  PM: EM: Add helpers to read under RCU lock the EM table
  PM: EM: Add performance field to struct em_perf_state
  PM: EM: Support late CPUs booting and capacity adjustment
  PM: EM: Optimize em_cpu_energy() and remove division
  powercap/dtpm_cpu: Use new Energy Model interface to get table
  powercap/dtpm_devfreq: Use new Energy Model interface to get table
  drivers/thermal/cpufreq_cooling: Use new Energy Model interface
  drivers/thermal/devfreq_cooling: Use new Energy Model interface
  PM: EM: Change debugfs configuration to use runtime EM table data
  PM: EM: Remove old table
  PM: EM: Add em_dev_compute_costs() as API for device drivers
  Documentation: EM: Update with runtime modification design

 Documentation/power/energy-model.rst | 206 +++++++++++-
 drivers/powercap/dtpm_cpu.c          |  35 +-
 drivers/powercap/dtpm_devfreq.c      |  31 +-
 drivers/thermal/cpufreq_cooling.c    |  40 ++-
 drivers/thermal/devfreq_cooling.c    |  43 ++-
 include/linux/energy_model.h         | 163 +++++----
 kernel/power/energy_model.c          | 479 +++++++++++++++++++++++----
 7 files changed, 813 insertions(+), 184 deletions(-)

Comments

Dietmar Eggemann Dec. 12, 2023, 6:48 p.m. UTC | #1
On 29/11/2023 12:08, Lukasz Luba wrote:
> Hi all,
> 
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
> 
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.
> 
> The concepts:
> 1. The CPU power usage can vary due to the workload that it's running or due
> to the temperature of the SoC. The same workload can use more power when the
> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
> In such situation the EM can be adjusted and reflect the fact of increased
> power usage. That power increase is due to static power
> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
> They are also built differently with High Performance (HP) cells or
> Low Power (LP) cells. They are affected by the temperature increase
> differently: HP cells have bigger leakage. The SW model can leverage that
> knowledge.
> 
> 2. It is also possible to change the EM to better reflect the currently
> running workload. Usually the EM is derived from some average power values
> taken from experiments with benchmark (e.g. Dhrystone). The model derived
> from such scenario might not represent properly the workloads usually running
> on the device. Therefore, runtime modification of the EM allows to switch to
> a different model, when there is a need.
> 
> 3. The EM can be adjusted after boot, when all the modules are loaded and
> more information about the SoC is available e.g. chip binning. This would help
> to better reflect the silicon characteristics. Thus, this EM modification
> API allows it now. It wasn't possible in the past and the EM had to be
> 'set in stone'.
> 
> More detailed explanation and background can be found in presentations
> during LPC2022 [1][2] or in the documentation patches.
> 
> Some test results.
> The EM can be updated to fit better the workload type. In the case below the EM
> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
> to get more reliable data.
> 
> 1. Janky frames percentage
> +--------+-----------------+---------------------+-------+-----------+
> | metric |    variable     |       kernel        | value | perc_diff |
> +--------+-----------------+---------------------+-------+-----------+
> | gmean  | jank_percentage | EM_default          |  2.0  |   0.0%    |
> | gmean  | jank_percentage | EM_modified_runtime |  1.3  |  -35.33%  |
> +--------+-----------------+---------------------+-------+-----------+
> 
> 2. Avg frame render time duration
> +--------+---------------------+---------------------+-------+-----------+
> | metric |      variable       |       kernel        | value | perc_diff |
> +--------+---------------------+---------------------+-------+-----------+
> | gmean  | mean_frame_duration | EM_default          | 10.5  |   0.0%    |
> | gmean  | mean_frame_duration | EM_modified_runtime |  9.6  |  -8.52%   |
> +--------+---------------------+---------------------+-------+-----------+
> 
> 3. Max frame render time duration
> +--------+--------------------+---------------------+-------+-----------+
> | metric |      variable      |       kernel        | value | perc_diff |
> +--------+--------------------+---------------------+-------+-----------+
> | gmean  | max_frame_duration | EM_default          | 251.6 |   0.0%    |
> | gmean  | max_frame_duration | EM_modified_runtime | 115.5 |  -54.09%  |
> +--------+--------------------+---------------------+-------+-----------+
> 
> 4. OS overutilized state percentage (when EAS is not working)
> +--------------+---------------------+------+------------+------------+
> |    metric    |       wa_path       | time | total_time | percentage |
> +--------------+---------------------+------+------------+------------+
> | overutilized | EM_default          | 1.65 |   253.38   |    0.65    |
> | overutilized | EM_modified_runtime | 1.4  |   277.5    |    0.51    |
> +--------------+---------------------+------+------------+------------+
> 
> 5. All CPUs (Little+Mid+Big) power values in mW
> +------------+--------+---------------------+-------+-----------+
> |  channel   | metric |       kernel        | value | perc_diff |
> +------------+--------+---------------------+-------+-----------+
> |    CPU     | gmean  | EM_default          | 142.1 |   0.0%    |
> |    CPU     | gmean  | EM_modified_runtime | 131.8 |  -7.27%   |
> +------------+--------+---------------------+-------+-----------+
> 
> The time cost to update the EM decreased in this v5 vs v4:
> big: 5us vs 2us -> 2.6x faster
> mid: 9us vs 3us -> 3x faster
> little: 16us vs 16us -> no change

I guess this is entirely due to the changes in
em_dev_update_perf_domain()? Moving from per-OPP em_update_callback to
switching the entire table (pd->runtime_table) inside
em_dev_update_perf_domain()?

> We still have to update the inefficiency in the cpufreq framework, thus
> a bit of overhead will be there.
> 
> Changelog:
> v5:
> - removed 2 tables design
> - have only one table (runtime_table) used also in thermal (Wei, Rafael)

Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in
v5 this changed to only have one, the modifiable. IMHO it would be
better to change the existing table to be modifiable rather than staring
with two EM's and then removing the static one. I assume you end up with
way less code changes and the patch-set will become easier to digest for
reviewers.

I would mention that 14/23 "PM: EM: Support late CPUs booting and
capacity adjustment" is a testcase for the modifiable EM build-in into
the code changes. This relies on the table being modifiable.

> - refactored update function and removed callback call for each opp
> - added faster EM table swap, using only the RCU pointer update
> - added memory allocation API and tracking with kref
> - avoid overhead for computing 'cost' for each OPP in update, it can be
>   pre-computed in device drivers EM earlier
> - add support for device drivers providing EM table
> - added API for computing 'cost' values in EM for EAS
> - added API for thermal/powercap to use EM (using RCU wrappers) 
> - switched to single allocation and 'state[]' array (Rafael)
> - changed documentation to align with current design
> - added helper API for computing cost values
> - simplified EM free in unregister path (thanks to kref)
> - split patch updating EM clients and changed them separetly
> - added seperate patch removing old static EM table
> - added EM debugfs change patch to dump the runtime_table
> - addressed comments in v4 for spelling/comments/headers
> - added review tags

[...]
Rafael J. Wysocki Dec. 12, 2023, 6:49 p.m. UTC | #2
Hi Lukasz,

On Wed, Nov 29, 2023 at 12:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi all,
>
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
>
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.
>
> The concepts:
> 1. The CPU power usage can vary due to the workload that it's running or due
> to the temperature of the SoC. The same workload can use more power when the
> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
> In such situation the EM can be adjusted and reflect the fact of increased
> power usage. That power increase is due to static power
> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
> They are also built differently with High Performance (HP) cells or
> Low Power (LP) cells. They are affected by the temperature increase
> differently: HP cells have bigger leakage. The SW model can leverage that
> knowledge.
>
> 2. It is also possible to change the EM to better reflect the currently
> running workload. Usually the EM is derived from some average power values
> taken from experiments with benchmark (e.g. Dhrystone). The model derived
> from such scenario might not represent properly the workloads usually running
> on the device. Therefore, runtime modification of the EM allows to switch to
> a different model, when there is a need.
>
> 3. The EM can be adjusted after boot, when all the modules are loaded and
> more information about the SoC is available e.g. chip binning. This would help
> to better reflect the silicon characteristics. Thus, this EM modification
> API allows it now. It wasn't possible in the past and the EM had to be
> 'set in stone'.
>
> More detailed explanation and background can be found in presentations
> during LPC2022 [1][2] or in the documentation patches.
>
> Some test results.
> The EM can be updated to fit better the workload type. In the case below the EM
> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
> to get more reliable data.
>
> 1. Janky frames percentage
> +--------+-----------------+---------------------+-------+-----------+
> | metric |    variable     |       kernel        | value | perc_diff |
> +--------+-----------------+---------------------+-------+-----------+
> | gmean  | jank_percentage | EM_default          |  2.0  |   0.0%    |
> | gmean  | jank_percentage | EM_modified_runtime |  1.3  |  -35.33%  |
> +--------+-----------------+---------------------+-------+-----------+
>
> 2. Avg frame render time duration
> +--------+---------------------+---------------------+-------+-----------+
> | metric |      variable       |       kernel        | value | perc_diff |
> +--------+---------------------+---------------------+-------+-----------+
> | gmean  | mean_frame_duration | EM_default          | 10.5  |   0.0%    |
> | gmean  | mean_frame_duration | EM_modified_runtime |  9.6  |  -8.52%   |
> +--------+---------------------+---------------------+-------+-----------+
>
> 3. Max frame render time duration
> +--------+--------------------+---------------------+-------+-----------+
> | metric |      variable      |       kernel        | value | perc_diff |
> +--------+--------------------+---------------------+-------+-----------+
> | gmean  | max_frame_duration | EM_default          | 251.6 |   0.0%    |
> | gmean  | max_frame_duration | EM_modified_runtime | 115.5 |  -54.09%  |
> +--------+--------------------+---------------------+-------+-----------+
>
> 4. OS overutilized state percentage (when EAS is not working)
> +--------------+---------------------+------+------------+------------+
> |    metric    |       wa_path       | time | total_time | percentage |
> +--------------+---------------------+------+------------+------------+
> | overutilized | EM_default          | 1.65 |   253.38   |    0.65    |
> | overutilized | EM_modified_runtime | 1.4  |   277.5    |    0.51    |
> +--------------+---------------------+------+------------+------------+
>
> 5. All CPUs (Little+Mid+Big) power values in mW
> +------------+--------+---------------------+-------+-----------+
> |  channel   | metric |       kernel        | value | perc_diff |
> +------------+--------+---------------------+-------+-----------+
> |    CPU     | gmean  | EM_default          | 142.1 |   0.0%    |
> |    CPU     | gmean  | EM_modified_runtime | 131.8 |  -7.27%   |
> +------------+--------+---------------------+-------+-----------+
>
> The time cost to update the EM decreased in this v5 vs v4:
> big: 5us vs 2us -> 2.6x faster
> mid: 9us vs 3us -> 3x faster
> little: 16us vs 16us -> no change
>
> We still have to update the inefficiency in the cpufreq framework, thus
> a bit of overhead will be there.
>
> Changelog:
> v5:
> - removed 2 tables design
> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
> - refactored update function and removed callback call for each opp
> - added faster EM table swap, using only the RCU pointer update
> - added memory allocation API and tracking with kref
> - avoid overhead for computing 'cost' for each OPP in update, it can be
>   pre-computed in device drivers EM earlier
> - add support for device drivers providing EM table
> - added API for computing 'cost' values in EM for EAS
> - added API for thermal/powercap to use EM (using RCU wrappers)
> - switched to single allocation and 'state[]' array (Rafael)
> - changed documentation to align with current design
> - added helper API for computing cost values
> - simplified EM free in unregister path (thanks to kref)
> - split patch updating EM clients and changed them separetly
> - added seperate patch removing old static EM table
> - added EM debugfs change patch to dump the runtime_table
> - addressed comments in v4 for spelling/comments/headers
> - added review tags

I like this one more than the previous one and thanks for taking my
feedback into account.

I would still like other people having a vested interest in the EM to
look at it and give feedback (or just tags), so I'm not inclined to
apply it just yet.  However, I don't have any specific comments on it.
Lukasz Luba Dec. 13, 2023, 9:23 a.m. UTC | #3
Hi Dietmar,

Thank you for the review, I will go one-by-one to respond
your comments in patches as well. First comments are below.

On 12/12/23 18:48, Dietmar Eggemann wrote:
> On 29/11/2023 12:08, Lukasz Luba wrote:
>> Hi all,
>>
>> This patch set adds a new feature which allows to modify Energy Model (EM)
>> power values at runtime. It will allow to better reflect power model of
>> a recent SoCs and silicon. Different characteristics of the power usage
>> can be leveraged and thus better decisions made during task placement in EAS.
>>
>> It's part of feature set know as Dynamic Energy Model. It has been presented
>> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
>> improvement for the EM.
>>
>> The concepts:
>> 1. The CPU power usage can vary due to the workload that it's running or due
>> to the temperature of the SoC. The same workload can use more power when the
>> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
>> In such situation the EM can be adjusted and reflect the fact of increased
>> power usage. That power increase is due to static power
>> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
>> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
>> They are also built differently with High Performance (HP) cells or
>> Low Power (LP) cells. They are affected by the temperature increase
>> differently: HP cells have bigger leakage. The SW model can leverage that
>> knowledge.
>>
>> 2. It is also possible to change the EM to better reflect the currently
>> running workload. Usually the EM is derived from some average power values
>> taken from experiments with benchmark (e.g. Dhrystone). The model derived
>> from such scenario might not represent properly the workloads usually running
>> on the device. Therefore, runtime modification of the EM allows to switch to
>> a different model, when there is a need.
>>
>> 3. The EM can be adjusted after boot, when all the modules are loaded and
>> more information about the SoC is available e.g. chip binning. This would help
>> to better reflect the silicon characteristics. Thus, this EM modification
>> API allows it now. It wasn't possible in the past and the EM had to be
>> 'set in stone'.
>>
>> More detailed explanation and background can be found in presentations
>> during LPC2022 [1][2] or in the documentation patches.
>>
>> Some test results.
>> The EM can be updated to fit better the workload type. In the case below the EM
>> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
>> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
>> to get more reliable data.
>>
>> 1. Janky frames percentage
>> +--------+-----------------+---------------------+-------+-----------+
>> | metric |    variable     |       kernel        | value | perc_diff |
>> +--------+-----------------+---------------------+-------+-----------+
>> | gmean  | jank_percentage | EM_default          |  2.0  |   0.0%    |
>> | gmean  | jank_percentage | EM_modified_runtime |  1.3  |  -35.33%  |
>> +--------+-----------------+---------------------+-------+-----------+
>>
>> 2. Avg frame render time duration
>> +--------+---------------------+---------------------+-------+-----------+
>> | metric |      variable       |       kernel        | value | perc_diff |
>> +--------+---------------------+---------------------+-------+-----------+
>> | gmean  | mean_frame_duration | EM_default          | 10.5  |   0.0%    |
>> | gmean  | mean_frame_duration | EM_modified_runtime |  9.6  |  -8.52%   |
>> +--------+---------------------+---------------------+-------+-----------+
>>
>> 3. Max frame render time duration
>> +--------+--------------------+---------------------+-------+-----------+
>> | metric |      variable      |       kernel        | value | perc_diff |
>> +--------+--------------------+---------------------+-------+-----------+
>> | gmean  | max_frame_duration | EM_default          | 251.6 |   0.0%    |
>> | gmean  | max_frame_duration | EM_modified_runtime | 115.5 |  -54.09%  |
>> +--------+--------------------+---------------------+-------+-----------+
>>
>> 4. OS overutilized state percentage (when EAS is not working)
>> +--------------+---------------------+------+------------+------------+
>> |    metric    |       wa_path       | time | total_time | percentage |
>> +--------------+---------------------+------+------------+------------+
>> | overutilized | EM_default          | 1.65 |   253.38   |    0.65    |
>> | overutilized | EM_modified_runtime | 1.4  |   277.5    |    0.51    |
>> +--------------+---------------------+------+------------+------------+
>>
>> 5. All CPUs (Little+Mid+Big) power values in mW
>> +------------+--------+---------------------+-------+-----------+
>> |  channel   | metric |       kernel        | value | perc_diff |
>> +------------+--------+---------------------+-------+-----------+
>> |    CPU     | gmean  | EM_default          | 142.1 |   0.0%    |
>> |    CPU     | gmean  | EM_modified_runtime | 131.8 |  -7.27%   |
>> +------------+--------+---------------------+-------+-----------+
>>
>> The time cost to update the EM decreased in this v5 vs v4:
>> big: 5us vs 2us -> 2.6x faster
>> mid: 9us vs 3us -> 3x faster
>> little: 16us vs 16us -> no change
> 
> I guess this is entirely due to the changes in
> em_dev_update_perf_domain()? Moving from per-OPP em_update_callback to
> switching the entire table (pd->runtime_table) inside
> em_dev_update_perf_domain()?

Yes correct, it's due to that design change.

> 
>> We still have to update the inefficiency in the cpufreq framework, thus
>> a bit of overhead will be there.
>>
>> Changelog:
>> v5:
>> - removed 2 tables design
>> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
> 
> Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in
> v5 this changed to only have one, the modifiable. IMHO it would be
> better to change the existing table to be modifiable rather than staring
> with two EM's and then removing the static one. I assume you end up with
> way less code changes and the patch-set will become easier to digest for
> reviewers.

The patches are structured in this way following Daniel's recommendation
I got when I was adding similar big changes to EM in 2020 (support all
devices in kernel). The approach is as follows:
0. Do some basic clean-up/refactoring if needed for a new feature, to
    re-use some code if possible in future
1. Introduce new feature next to the existing one
2. Add API and all needed infrastructure (structures, fields) for
    drivers
3. Re-wire the existing drivers/frameworks to the new feature via new
    API; ideally keep 1 patch per driver so the maintainer can easily
    grasp the changes and ACK it, because it will go via different tree
    (Rafael's tree); in case of some code clash in the driver's code
    during merge - it will be a single driver so easier to handle
4. when all drivers and frameworks are wired up with the new feature
    remove the old feature (structures, fields, APIs, etc)
5. Update the documentation with new latest state of desing

In this approach the patches are less convoluted. Because if I remove
the old feature and add new in a single patch (e.g. the main structure)
that patch will have to modify all drivers to still compile. It
would be a big messy patch for this re-design.

I can see in some later comment from Rafael that he is OK with current
patch set structure.

> 
> I would mention that 14/23 "PM: EM: Support late CPUs booting and
> capacity adjustment" is a testcase for the modifiable EM build-in into
> the code changes. This relies on the table being modifiable.
> 

Correct, that the 1st user on runtime modifiable EM, which is actually
also build-in. I could add that to the cover letter.

Regards,
Lukasz
Lukasz Luba Dec. 13, 2023, 9:32 a.m. UTC | #4
Hi Rafael,

Thank you for having a loot at the series.

On 12/12/23 18:49, Rafael J. Wysocki wrote:
> Hi Lukasz,
> 
> On Wed, Nov 29, 2023 at 12:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Hi all,
>>
>> This patch set adds a new feature which allows to modify Energy Model (EM)
>> power values at runtime. It will allow to better reflect power model of
>> a recent SoCs and silicon. Different characteristics of the power usage
>> can be leveraged and thus better decisions made during task placement in EAS.
>>
>> It's part of feature set know as Dynamic Energy Model. It has been presented
>> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
>> improvement for the EM.
>>
>> The concepts:
>> 1. The CPU power usage can vary due to the workload that it's running or due
>> to the temperature of the SoC. The same workload can use more power when the
>> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
>> In such situation the EM can be adjusted and reflect the fact of increased
>> power usage. That power increase is due to static power
>> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
>> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
>> They are also built differently with High Performance (HP) cells or
>> Low Power (LP) cells. They are affected by the temperature increase
>> differently: HP cells have bigger leakage. The SW model can leverage that
>> knowledge.
>>
>> 2. It is also possible to change the EM to better reflect the currently
>> running workload. Usually the EM is derived from some average power values
>> taken from experiments with benchmark (e.g. Dhrystone). The model derived
>> from such scenario might not represent properly the workloads usually running
>> on the device. Therefore, runtime modification of the EM allows to switch to
>> a different model, when there is a need.
>>
>> 3. The EM can be adjusted after boot, when all the modules are loaded and
>> more information about the SoC is available e.g. chip binning. This would help
>> to better reflect the silicon characteristics. Thus, this EM modification
>> API allows it now. It wasn't possible in the past and the EM had to be
>> 'set in stone'.
>>
>> More detailed explanation and background can be found in presentations
>> during LPC2022 [1][2] or in the documentation patches.
>>
>> Some test results.
>> The EM can be updated to fit better the workload type. In the case below the EM
>> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
>> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
>> to get more reliable data.
>>
>> 1. Janky frames percentage
>> +--------+-----------------+---------------------+-------+-----------+
>> | metric |    variable     |       kernel        | value | perc_diff |
>> +--------+-----------------+---------------------+-------+-----------+
>> | gmean  | jank_percentage | EM_default          |  2.0  |   0.0%    |
>> | gmean  | jank_percentage | EM_modified_runtime |  1.3  |  -35.33%  |
>> +--------+-----------------+---------------------+-------+-----------+
>>
>> 2. Avg frame render time duration
>> +--------+---------------------+---------------------+-------+-----------+
>> | metric |      variable       |       kernel        | value | perc_diff |
>> +--------+---------------------+---------------------+-------+-----------+
>> | gmean  | mean_frame_duration | EM_default          | 10.5  |   0.0%    |
>> | gmean  | mean_frame_duration | EM_modified_runtime |  9.6  |  -8.52%   |
>> +--------+---------------------+---------------------+-------+-----------+
>>
>> 3. Max frame render time duration
>> +--------+--------------------+---------------------+-------+-----------+
>> | metric |      variable      |       kernel        | value | perc_diff |
>> +--------+--------------------+---------------------+-------+-----------+
>> | gmean  | max_frame_duration | EM_default          | 251.6 |   0.0%    |
>> | gmean  | max_frame_duration | EM_modified_runtime | 115.5 |  -54.09%  |
>> +--------+--------------------+---------------------+-------+-----------+
>>
>> 4. OS overutilized state percentage (when EAS is not working)
>> +--------------+---------------------+------+------------+------------+
>> |    metric    |       wa_path       | time | total_time | percentage |
>> +--------------+---------------------+------+------------+------------+
>> | overutilized | EM_default          | 1.65 |   253.38   |    0.65    |
>> | overutilized | EM_modified_runtime | 1.4  |   277.5    |    0.51    |
>> +--------------+---------------------+------+------------+------------+
>>
>> 5. All CPUs (Little+Mid+Big) power values in mW
>> +------------+--------+---------------------+-------+-----------+
>> |  channel   | metric |       kernel        | value | perc_diff |
>> +------------+--------+---------------------+-------+-----------+
>> |    CPU     | gmean  | EM_default          | 142.1 |   0.0%    |
>> |    CPU     | gmean  | EM_modified_runtime | 131.8 |  -7.27%   |
>> +------------+--------+---------------------+-------+-----------+
>>
>> The time cost to update the EM decreased in this v5 vs v4:
>> big: 5us vs 2us -> 2.6x faster
>> mid: 9us vs 3us -> 3x faster
>> little: 16us vs 16us -> no change
>>
>> We still have to update the inefficiency in the cpufreq framework, thus
>> a bit of overhead will be there.
>>
>> Changelog:
>> v5:
>> - removed 2 tables design
>> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
>> - refactored update function and removed callback call for each opp
>> - added faster EM table swap, using only the RCU pointer update
>> - added memory allocation API and tracking with kref
>> - avoid overhead for computing 'cost' for each OPP in update, it can be
>>    pre-computed in device drivers EM earlier
>> - add support for device drivers providing EM table
>> - added API for computing 'cost' values in EM for EAS
>> - added API for thermal/powercap to use EM (using RCU wrappers)
>> - switched to single allocation and 'state[]' array (Rafael)
>> - changed documentation to align with current design
>> - added helper API for computing cost values
>> - simplified EM free in unregister path (thanks to kref)
>> - split patch updating EM clients and changed them separetly
>> - added seperate patch removing old static EM table
>> - added EM debugfs change patch to dump the runtime_table
>> - addressed comments in v4 for spelling/comments/headers
>> - added review tags
> 
> I like this one more than the previous one and thanks for taking my
> feedback into account.
> 
> I would still like other people having a vested interest in the EM to
> look at it and give feedback (or just tags), so I'm not inclined to
> apply it just yet.  However, I don't have any specific comments on it.

Let me contact offline some of the partners who were keen to have this
in mainline (when I presented some first implementation in 2021 at
Android kernel review systems).

Regards,
Lukasz
Dietmar Eggemann Dec. 13, 2023, 11:34 a.m. UTC | #5
On 13/12/2023 10:23, Lukasz Luba wrote:
> Hi Dietmar,
> 
> Thank you for the review, I will go one-by-one to respond
> your comments in patches as well. First comments are below.
> 
> On 12/12/23 18:48, Dietmar Eggemann wrote:
>> On 29/11/2023 12:08, Lukasz Luba wrote:

[...]

>>> Changelog:
>>> v5:
>>> - removed 2 tables design
>>> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
>>
>> Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in
>> v5 this changed to only have one, the modifiable. IMHO it would be
>> better to change the existing table to be modifiable rather than staring
>> with two EM's and then removing the static one. I assume you end up with
>> way less code changes and the patch-set will become easier to digest for
>> reviewers.
> 
> The patches are structured in this way following Daniel's recommendation
> I got when I was adding similar big changes to EM in 2020 (support all
> devices in kernel). The approach is as follows:
> 0. Do some basic clean-up/refactoring if needed for a new feature, to
>    re-use some code if possible in future
> 1. Introduce new feature next to the existing one
> 2. Add API and all needed infrastructure (structures, fields) for
>    drivers
> 3. Re-wire the existing drivers/frameworks to the new feature via new
>    API; ideally keep 1 patch per driver so the maintainer can easily
>    grasp the changes and ACK it, because it will go via different tree
>    (Rafael's tree); in case of some code clash in the driver's code
>    during merge - it will be a single driver so easier to handle
> 4. when all drivers and frameworks are wired up with the new feature
>    remove the old feature (structures, fields, APIs, etc)
> 5. Update the documentation with new latest state of desing
> 
> In this approach the patches are less convoluted. Because if I remove
> the old feature and add new in a single patch (e.g. the main structure)
> that patch will have to modify all drivers to still compile. It
> would be a big messy patch for this re-design.
> 
> I can see in some later comment from Rafael that he is OK with current
> patch set structure.

OK, in case Rafael and Daniel prefer this, then it's fine.

I just find it weird that we now have

70 struct em_perf_domain {
71         struct em_perf_table __rcu *runtime_table;
                                       ^^^^^^^^^^^^^

as the only EM table.
Rafael J. Wysocki Dec. 13, 2023, 11:45 a.m. UTC | #6
On Wed, Dec 13, 2023 at 12:34 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 13/12/2023 10:23, Lukasz Luba wrote:
> > Hi Dietmar,
> >
> > Thank you for the review, I will go one-by-one to respond
> > your comments in patches as well. First comments are below.
> >
> > On 12/12/23 18:48, Dietmar Eggemann wrote:
> >> On 29/11/2023 12:08, Lukasz Luba wrote:
>
> [...]
>
> >>> Changelog:
> >>> v5:
> >>> - removed 2 tables design
> >>> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
> >>
> >> Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in
> >> v5 this changed to only have one, the modifiable. IMHO it would be
> >> better to change the existing table to be modifiable rather than staring
> >> with two EM's and then removing the static one. I assume you end up with
> >> way less code changes and the patch-set will become easier to digest for
> >> reviewers.
> >
> > The patches are structured in this way following Daniel's recommendation
> > I got when I was adding similar big changes to EM in 2020 (support all
> > devices in kernel). The approach is as follows:
> > 0. Do some basic clean-up/refactoring if needed for a new feature, to
> >    re-use some code if possible in future
> > 1. Introduce new feature next to the existing one
> > 2. Add API and all needed infrastructure (structures, fields) for
> >    drivers
> > 3. Re-wire the existing drivers/frameworks to the new feature via new
> >    API; ideally keep 1 patch per driver so the maintainer can easily
> >    grasp the changes and ACK it, because it will go via different tree
> >    (Rafael's tree); in case of some code clash in the driver's code
> >    during merge - it will be a single driver so easier to handle
> > 4. when all drivers and frameworks are wired up with the new feature
> >    remove the old feature (structures, fields, APIs, etc)
> > 5. Update the documentation with new latest state of desing
> >
> > In this approach the patches are less convoluted. Because if I remove
> > the old feature and add new in a single patch (e.g. the main structure)
> > that patch will have to modify all drivers to still compile. It
> > would be a big messy patch for this re-design.
> >
> > I can see in some later comment from Rafael that he is OK with current
> > patch set structure.
>
> OK, in case Rafael and Daniel prefer this, then it's fine.
>
> I just find it weird that we now have
>
> 70 struct em_perf_domain {
> 71         struct em_perf_table __rcu *runtime_table;
>                                        ^^^^^^^^^^^^^
>
> as the only EM table.

I agree that it would be better to call it something like em_table.
Lukasz Luba Dec. 13, 2023, 12:20 p.m. UTC | #7
On 12/13/23 11:45, Rafael J. Wysocki wrote:
> On Wed, Dec 13, 2023 at 12:34 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 13/12/2023 10:23, Lukasz Luba wrote:
>>> Hi Dietmar,
>>>
>>> Thank you for the review, I will go one-by-one to respond
>>> your comments in patches as well. First comments are below.
>>>
>>> On 12/12/23 18:48, Dietmar Eggemann wrote:
>>>> On 29/11/2023 12:08, Lukasz Luba wrote:
>>
>> [...]
>>
>>>>> Changelog:
>>>>> v5:
>>>>> - removed 2 tables design
>>>>> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
>>>>
>>>> Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in
>>>> v5 this changed to only have one, the modifiable. IMHO it would be
>>>> better to change the existing table to be modifiable rather than staring
>>>> with two EM's and then removing the static one. I assume you end up with
>>>> way less code changes and the patch-set will become easier to digest for
>>>> reviewers.
>>>
>>> The patches are structured in this way following Daniel's recommendation
>>> I got when I was adding similar big changes to EM in 2020 (support all
>>> devices in kernel). The approach is as follows:
>>> 0. Do some basic clean-up/refactoring if needed for a new feature, to
>>>     re-use some code if possible in future
>>> 1. Introduce new feature next to the existing one
>>> 2. Add API and all needed infrastructure (structures, fields) for
>>>     drivers
>>> 3. Re-wire the existing drivers/frameworks to the new feature via new
>>>     API; ideally keep 1 patch per driver so the maintainer can easily
>>>     grasp the changes and ACK it, because it will go via different tree
>>>     (Rafael's tree); in case of some code clash in the driver's code
>>>     during merge - it will be a single driver so easier to handle
>>> 4. when all drivers and frameworks are wired up with the new feature
>>>     remove the old feature (structures, fields, APIs, etc)
>>> 5. Update the documentation with new latest state of desing
>>>
>>> In this approach the patches are less convoluted. Because if I remove
>>> the old feature and add new in a single patch (e.g. the main structure)
>>> that patch will have to modify all drivers to still compile. It
>>> would be a big messy patch for this re-design.
>>>
>>> I can see in some later comment from Rafael that he is OK with current
>>> patch set structure.
>>
>> OK, in case Rafael and Daniel prefer this, then it's fine.
>>
>> I just find it weird that we now have
>>
>> 70 struct em_perf_domain {
>> 71         struct em_perf_table __rcu *runtime_table;
>>                                         ^^^^^^^^^^^^^
>>
>> as the only EM table.
> 
> I agree that it would be better to call it something like em_table.
> 

OK, I'll change that. Thanks Rafael and Dietmar!
Lukasz Luba Dec. 13, 2023, 1:16 p.m. UTC | #8
Hi Abhijeet,

It's been a while when we discussed an EM feature presented on some
Android common kernel Gerrit (Nov 2021).

On 11/29/23 11:08, Lukasz Luba wrote:
> Hi all,
> 
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
> 
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.
> 
> The concepts:
> 1. The CPU power usage can vary due to the workload that it's running or due
> to the temperature of the SoC. The same workload can use more power when the
> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
> In such situation the EM can be adjusted and reflect the fact of increased
> power usage. That power increase is due to static power
> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
> They are also built differently with High Performance (HP) cells or
> Low Power (LP) cells. They are affected by the temperature increase
> differently: HP cells have bigger leakage. The SW model can leverage that
> knowledge.
> 
> 2. It is also possible to change the EM to better reflect the currently
> running workload. Usually the EM is derived from some average power values
> taken from experiments with benchmark (e.g. Dhrystone). The model derived
> from such scenario might not represent properly the workloads usually running
> on the device. Therefore, runtime modification of the EM allows to switch to
> a different model, when there is a need.
> 
> 3. The EM can be adjusted after boot, when all the modules are loaded and
> more information about the SoC is available e.g. chip binning. This would help
> to better reflect the silicon characteristics. Thus, this EM modification
> API allows it now. It wasn't possible in the past and the EM had to be
> 'set in stone'.
> 
> More detailed explanation and background can be found in presentations
> during LPC2022 [1][2] or in the documentation patches.
> 
> Some test results.
> The EM can be updated to fit better the workload type. In the case below the EM
> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
> to get more reliable data.
> 
> 1. Janky frames percentage
> +--------+-----------------+---------------------+-------+-----------+
> | metric |    variable     |       kernel        | value | perc_diff |
> +--------+-----------------+---------------------+-------+-----------+
> | gmean  | jank_percentage | EM_default          |  2.0  |   0.0%    |
> | gmean  | jank_percentage | EM_modified_runtime |  1.3  |  -35.33%  |
> +--------+-----------------+---------------------+-------+-----------+
> 
> 2. Avg frame render time duration
> +--------+---------------------+---------------------+-------+-----------+
> | metric |      variable       |       kernel        | value | perc_diff |
> +--------+---------------------+---------------------+-------+-----------+
> | gmean  | mean_frame_duration | EM_default          | 10.5  |   0.0%    |
> | gmean  | mean_frame_duration | EM_modified_runtime |  9.6  |  -8.52%   |
> +--------+---------------------+---------------------+-------+-----------+
> 
> 3. Max frame render time duration
> +--------+--------------------+---------------------+-------+-----------+
> | metric |      variable      |       kernel        | value | perc_diff |
> +--------+--------------------+---------------------+-------+-----------+
> | gmean  | max_frame_duration | EM_default          | 251.6 |   0.0%    |
> | gmean  | max_frame_duration | EM_modified_runtime | 115.5 |  -54.09%  |
> +--------+--------------------+---------------------+-------+-----------+
> 
> 4. OS overutilized state percentage (when EAS is not working)
> +--------------+---------------------+------+------------+------------+
> |    metric    |       wa_path       | time | total_time | percentage |
> +--------------+---------------------+------+------------+------------+
> | overutilized | EM_default          | 1.65 |   253.38   |    0.65    |
> | overutilized | EM_modified_runtime | 1.4  |   277.5    |    0.51    |
> +--------------+---------------------+------+------------+------------+
> 
> 5. All CPUs (Little+Mid+Big) power values in mW
> +------------+--------+---------------------+-------+-----------+
> |  channel   | metric |       kernel        | value | perc_diff |
> +------------+--------+---------------------+-------+-----------+
> |    CPU     | gmean  | EM_default          | 142.1 |   0.0%    |
> |    CPU     | gmean  | EM_modified_runtime | 131.8 |  -7.27%   |
> +------------+--------+---------------------+-------+-----------+
> 
> The time cost to update the EM decreased in this v5 vs v4:
> big: 5us vs 2us -> 2.6x faster
> mid: 9us vs 3us -> 3x faster
> little: 16us vs 16us -> no change
> 
> We still have to update the inefficiency in the cpufreq framework, thus
> a bit of overhead will be there.
> 
> Changelog:
> v5:
> - removed 2 tables design
> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
> - refactored update function and removed callback call for each opp
> - added faster EM table swap, using only the RCU pointer update
> - added memory allocation API and tracking with kref
> - avoid overhead for computing 'cost' for each OPP in update, it can be
>    pre-computed in device drivers EM earlier
> - add support for device drivers providing EM table
> - added API for computing 'cost' values in EM for EAS
> - added API for thermal/powercap to use EM (using RCU wrappers)
> - switched to single allocation and 'state[]' array (Rafael)
> - changed documentation to align with current design
> - added helper API for computing cost values
> - simplified EM free in unregister path (thanks to kref)
> - split patch updating EM clients and changed them separetly
> - added seperate patch removing old static EM table
> - added EM debugfs change patch to dump the runtime_table
> - addressed comments in v4 for spelling/comments/headers
> - added review tags
> v4 changes are here [4]
> 
> Regards,
> Lukasz Luba
> 
> [1] https://lpc.events/event/16/contributions/1341/attachments/955/1873/Dynamic_Energy_Model_to_handle_leakage_power.pdf
> [2] https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf
> [3] https://www.youtube.com/watch?v=2C-5uikSbtM&list=PL0fKordpLTjKsBOUcZqnzlHShri4YBL1H
> [4] https://lore.kernel.org/lkml/20230925081139.1305766-1-lukasz.luba@arm.com/
> 
> 
> Lukasz Luba (23):
>    PM: EM: Add missing newline for the message log
>    PM: EM: Refactor em_cpufreq_update_efficiencies() arguments
>    PM: EM: Find first CPU active while updating OPP efficiency
>    PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
>    PM: EM: Refactor a new function em_compute_costs()
>    PM: EM: Check if the get_cost() callback is present in
>      em_compute_costs()
>    PM: EM: Refactor how the EM table is allocated and populated
>    PM: EM: Introduce runtime modifiable table
>    PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
>    PM: EM: Add API for memory allocations for new tables
>    PM: EM: Add API for updating the runtime modifiable EM
>    PM: EM: Add helpers to read under RCU lock the EM table
>    PM: EM: Add performance field to struct em_perf_state
>    PM: EM: Support late CPUs booting and capacity adjustment
>    PM: EM: Optimize em_cpu_energy() and remove division
>    powercap/dtpm_cpu: Use new Energy Model interface to get table
>    powercap/dtpm_devfreq: Use new Energy Model interface to get table
>    drivers/thermal/cpufreq_cooling: Use new Energy Model interface
>    drivers/thermal/devfreq_cooling: Use new Energy Model interface
>    PM: EM: Change debugfs configuration to use runtime EM table data
>    PM: EM: Remove old table
>    PM: EM: Add em_dev_compute_costs() as API for device drivers
>    Documentation: EM: Update with runtime modification design
> 
>   Documentation/power/energy-model.rst | 206 +++++++++++-
>   drivers/powercap/dtpm_cpu.c          |  35 +-
>   drivers/powercap/dtpm_devfreq.c      |  31 +-
>   drivers/thermal/cpufreq_cooling.c    |  40 ++-
>   drivers/thermal/devfreq_cooling.c    |  43 ++-
>   include/linux/energy_model.h         | 163 +++++----
>   kernel/power/energy_model.c          | 479 +++++++++++++++++++++++----
>   7 files changed, 813 insertions(+), 184 deletions(-)
> 

You've been interested in this feature back then.

I have a gentle ask, if you are still interested in. It would be nice if
you (or some other Qcom engineer) could leave a feedback comment
(similar what you have made for the Gerrit original series). I will be
really grateful.

In this cover letter, there are some power saving numbers from
a real phone, with also performance metrics (janky frames). You might
be interested in those scenarios as well.

Regards,
Lukasz
Hongyan Xia Dec. 13, 2023, 1:40 p.m. UTC | #9
Hi Rafael,

On 12/12/2023 18:49, Rafael J. Wysocki wrote:
> Hi Lukasz,
> 
> On Wed, Nov 29, 2023 at 12:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> [...]
> 
> I like this one more than the previous one and thanks for taking my
> feedback into account.
> 
> I would still like other people having a vested interest in the EM to
> look at it and give feedback (or just tags), so I'm not inclined to
> apply it just yet.  However, I don't have any specific comments on it.

I do have a keen interest in this series, but mostly from the point of 
view of uclamp. Currently uclamp is able to send hint the scheduler to 
bias task placement. Some CPU cores are known to have very different 
energy efficiency depending on the task. We know these tasks beforehand 
and can use uclamp to bias to certain CPUs which we know are more 
efficient for them.

Personally I've always been wondering if this could just be reflected in 
the EM itself without emphasizing on the task placement aspect of 
uclamp. The idea of this series LGTM and I'll take a deeper look.

Hongyan
Qais Yousef Dec. 17, 2023, 6:22 p.m. UTC | #10
Hi Lukasz

On 11/29/23 11:08, Lukasz Luba wrote:
> Hi all,
> 
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
> 
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.

Thanks. The problem of EM accuracy has been observed in the field and would be
nice to have a mainline solution for it. We carry our own out-of-tree change to
enable modifying the EM.

> 
> The concepts:
> 1. The CPU power usage can vary due to the workload that it's running or due
> to the temperature of the SoC. The same workload can use more power when the
> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
> In such situation the EM can be adjusted and reflect the fact of increased
> power usage. That power increase is due to static power
> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
> They are also built differently with High Performance (HP) cells or
> Low Power (LP) cells. They are affected by the temperature increase
> differently: HP cells have bigger leakage. The SW model can leverage that
> knowledge.

One thing I'm not sure about is that in practice temperature of the SoC can
vary a lot in a short period of time. What is the expectation here? I can see
this useful in practice only if we average it over a window of time. Following
it will be really hard. Big variations can happen in few ms scales.

Driver interface for this part makes sense; as thermal framework will likely to
know how feed things back to EM table, if necessary.

> 
> 2. It is also possible to change the EM to better reflect the currently
> running workload. Usually the EM is derived from some average power values
> taken from experiments with benchmark (e.g. Dhrystone). The model derived
> from such scenario might not represent properly the workloads usually running
> on the device. Therefore, runtime modification of the EM allows to switch to
> a different model, when there is a need.

I didn't get how the new performance field is supposed to be controlled and
modified by users. A driver interface doesn't seem suitable as there's no
subsystem that knows the characteristic of the workload except userspace. In
Android we do have contextual info about what the current top-app to enable
modifying the capacities to match its characteristics.

> 
> 3. The EM can be adjusted after boot, when all the modules are loaded and
> more information about the SoC is available e.g. chip binning. This would help
> to better reflect the silicon characteristics. Thus, this EM modification
> API allows it now. It wasn't possible in the past and the EM had to be
> 'set in stone'.
> 
> More detailed explanation and background can be found in presentations
> during LPC2022 [1][2] or in the documentation patches.
> 
> Some test results.
> The EM can be updated to fit better the workload type. In the case below the EM
> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
> to get more reliable data.
> 
> 1. Janky frames percentage
> +--------+-----------------+---------------------+-------+-----------+
> | metric |    variable     |       kernel        | value | perc_diff |
> +--------+-----------------+---------------------+-------+-----------+
> | gmean  | jank_percentage | EM_default          |  2.0  |   0.0%    |
> | gmean  | jank_percentage | EM_modified_runtime |  1.3  |  -35.33%  |
> +--------+-----------------+---------------------+-------+-----------+
> 
> 2. Avg frame render time duration
> +--------+---------------------+---------------------+-------+-----------+
> | metric |      variable       |       kernel        | value | perc_diff |
> +--------+---------------------+---------------------+-------+-----------+
> | gmean  | mean_frame_duration | EM_default          | 10.5  |   0.0%    |
> | gmean  | mean_frame_duration | EM_modified_runtime |  9.6  |  -8.52%   |
> +--------+---------------------+---------------------+-------+-----------+
> 
> 3. Max frame render time duration
> +--------+--------------------+---------------------+-------+-----------+
> | metric |      variable      |       kernel        | value | perc_diff |
> +--------+--------------------+---------------------+-------+-----------+
> | gmean  | max_frame_duration | EM_default          | 251.6 |   0.0%    |
> | gmean  | max_frame_duration | EM_modified_runtime | 115.5 |  -54.09%  |
> +--------+--------------------+---------------------+-------+-----------+
> 
> 4. OS overutilized state percentage (when EAS is not working)
> +--------------+---------------------+------+------------+------------+
> |    metric    |       wa_path       | time | total_time | percentage |
> +--------------+---------------------+------+------------+------------+
> | overutilized | EM_default          | 1.65 |   253.38   |    0.65    |
> | overutilized | EM_modified_runtime | 1.4  |   277.5    |    0.51    |
> +--------------+---------------------+------+------------+------------+
> 
> 5. All CPUs (Little+Mid+Big) power values in mW
> +------------+--------+---------------------+-------+-----------+
> |  channel   | metric |       kernel        | value | perc_diff |
> +------------+--------+---------------------+-------+-----------+
> |    CPU     | gmean  | EM_default          | 142.1 |   0.0%    |
> |    CPU     | gmean  | EM_modified_runtime | 131.8 |  -7.27%   |
> +------------+--------+---------------------+-------+-----------+

How did you modify the EM here? Did you change both performance and power
fields? How did you calculate the new ones?

Did you try to simulate any heating effect during the run if you're taking
temperature into account to modify the power? What was the variation like and
at what rate was the EM being updated in this case? I think Jankbench in
general wouldn't stress the SoC enough.

It'd be insightful to look at frequency residencies between the two runs and
power breakdown for each cluster if you have access to them. No worries if not!

My brain started to fail me somewhere around patch 15. I'll have another look
some time later in the week but generally looks good to me. If I have any
worries it is about how it can be used with the provided interfaces. Especially
expectations about managing fast thermal changes at the level you're targeting.


Thanks!

--
Qais Yousef
Lukasz Luba Dec. 19, 2023, 10:22 a.m. UTC | #11
Hi Qais,

On 12/17/23 18:22, Qais Yousef wrote:
> Hi Lukasz
> 
> On 11/29/23 11:08, Lukasz Luba wrote:
>> Hi all,
>>
>> This patch set adds a new feature which allows to modify Energy Model (EM)
>> power values at runtime. It will allow to better reflect power model of
>> a recent SoCs and silicon. Different characteristics of the power usage
>> can be leveraged and thus better decisions made during task placement in EAS.
>>
>> It's part of feature set know as Dynamic Energy Model. It has been presented
>> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
>> improvement for the EM.
> 
> Thanks. The problem of EM accuracy has been observed in the field and would be
> nice to have a mainline solution for it. We carry our own out-of-tree change to
> enable modifying the EM.

Thanks for that statement here.

> 
>>
>> The concepts:
>> 1. The CPU power usage can vary due to the workload that it's running or due
>> to the temperature of the SoC. The same workload can use more power when the
>> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
>> In such situation the EM can be adjusted and reflect the fact of increased
>> power usage. That power increase is due to static power
>> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
>> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
>> They are also built differently with High Performance (HP) cells or
>> Low Power (LP) cells. They are affected by the temperature increase
>> differently: HP cells have bigger leakage. The SW model can leverage that
>> knowledge.
> 
> One thing I'm not sure about is that in practice temperature of the SoC can
> vary a lot in a short period of time. What is the expectation here? I can see
> this useful in practice only if we average it over a window of time. Following
> it will be really hard. Big variations can happen in few ms scales.

It's mostly for long running heavy workloads, which involve other device
than CPUs, e.g. GPU or ISP (Image Signal Processor). Those devices can
heat up the SoC. In our game DrArm running on pixel6 the GPU uses 75-77%
of total power budget (starting from ~2.5W for GPU + 1.3W for all CPUs).
That 2.5W from the GPU is heating up the CPUs and mostly impact the Big
cores, which are made from High-Performance cells (thus leaking more).
OverUtilization in the first 4-5min of gaming is ~4-9%, so EAS can work
and save some power, if it has a good model. Later we have thermal
throttling and OU goes to ~50% but EAS still can work. If the model is
more precised - thus adjusted for the raising leakage due to temperature
increase (generated due to GPU power), than we still can use better that
power budget and not waist on the leakage at higher OPPs.

> 
> Driver interface for this part makes sense; as thermal framework will likely to
> know how feed things back to EM table, if necessary.

Thermal framework or I would rather say smarter thermal dedicated driver
which has built-in power model and access to the sensors data. In this
way it can provide adjusted power model into the EM dynamically.
It will also calculate the efficiency (the 'cost' field).

> 
>>
>> 2. It is also possible to change the EM to better reflect the currently
>> running workload. Usually the EM is derived from some average power values
>> taken from experiments with benchmark (e.g. Dhrystone). The model derived
>> from such scenario might not represent properly the workloads usually running
>> on the device. Therefore, runtime modification of the EM allows to switch to
>> a different model, when there is a need.
> 
> I didn't get how the new performance field is supposed to be controlled and
> modified by users. A driver interface doesn't seem suitable as there's no
> subsystem that knows the characteristic of the workload except userspace. In
> Android we do have contextual info about what the current top-app to enable
> modifying the capacities to match its characteristics.

Well in latest public documentation (May2023) for Cortex-X4 there are
described new features of Arm cores: PDP, MPMM, which can change the
'performance' of the core in FW. Our SCMI kernel subsystem will get an
interrupt, so the drivers can know about it. It could be used for
recalculating the efficiency of the CPUs in the EM. When there is no
hotplug and the long running app is still running, that FW policy would
be reflected in EM. It's just not done all-in-one-step. Those patches
will be later.

Second, I have used that 'performance' field to finally get rid of
this runtime division in em_cpu_energy() hot path - which was annoying
me for very long time. It wasn't possible to optimize that last
operation there, because the not all CPUs boot and final CPU capacity
is not known when we register EMs. With this feature finally I can
remove that heavy operation. You can see more in that patch 15/23.

> 
>>
>> 3. The EM can be adjusted after boot, when all the modules are loaded and
>> more information about the SoC is available e.g. chip binning. This would help
>> to better reflect the silicon characteristics. Thus, this EM modification
>> API allows it now. It wasn't possible in the past and the EM had to be
>> 'set in stone'.
>>
>> More detailed explanation and background can be found in presentations
>> during LPC2022 [1][2] or in the documentation patches.
>>
>> Some test results.
>> The EM can be updated to fit better the workload type. In the case below the EM
>> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
>> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
>> to get more reliable data.
>>
>> 1. Janky frames percentage
>> +--------+-----------------+---------------------+-------+-----------+
>> | metric |    variable     |       kernel        | value | perc_diff |
>> +--------+-----------------+---------------------+-------+-----------+
>> | gmean  | jank_percentage | EM_default          |  2.0  |   0.0%    |
>> | gmean  | jank_percentage | EM_modified_runtime |  1.3  |  -35.33%  |
>> +--------+-----------------+---------------------+-------+-----------+
>>
>> 2. Avg frame render time duration
>> +--------+---------------------+---------------------+-------+-----------+
>> | metric |      variable       |       kernel        | value | perc_diff |
>> +--------+---------------------+---------------------+-------+-----------+
>> | gmean  | mean_frame_duration | EM_default          | 10.5  |   0.0%    |
>> | gmean  | mean_frame_duration | EM_modified_runtime |  9.6  |  -8.52%   |
>> +--------+---------------------+---------------------+-------+-----------+
>>
>> 3. Max frame render time duration
>> +--------+--------------------+---------------------+-------+-----------+
>> | metric |      variable      |       kernel        | value | perc_diff |
>> +--------+--------------------+---------------------+-------+-----------+
>> | gmean  | max_frame_duration | EM_default          | 251.6 |   0.0%    |
>> | gmean  | max_frame_duration | EM_modified_runtime | 115.5 |  -54.09%  |
>> +--------+--------------------+---------------------+-------+-----------+
>>
>> 4. OS overutilized state percentage (when EAS is not working)
>> +--------------+---------------------+------+------------+------------+
>> |    metric    |       wa_path       | time | total_time | percentage |
>> +--------------+---------------------+------+------------+------------+
>> | overutilized | EM_default          | 1.65 |   253.38   |    0.65    |
>> | overutilized | EM_modified_runtime | 1.4  |   277.5    |    0.51    |
>> +--------------+---------------------+------+------------+------------+
>>
>> 5. All CPUs (Little+Mid+Big) power values in mW
>> +------------+--------+---------------------+-------+-----------+
>> |  channel   | metric |       kernel        | value | perc_diff |
>> +------------+--------+---------------------+-------+-----------+
>> |    CPU     | gmean  | EM_default          | 142.1 |   0.0%    |
>> |    CPU     | gmean  | EM_modified_runtime | 131.8 |  -7.27%   |
>> +------------+--------+---------------------+-------+-----------+
> 
> How did you modify the EM here? Did you change both performance and power
> fields? How did you calculate the new ones?

It was just the power values modified on my pixel6:
for Littles 1.6x, Mid 0.8x, Big 1.3x of their boot power.
TBH I don't know the chip binning of that SoC, but I suspect it
could be due to this fact. More about possible error range in chip
binning power values you can find in my comment to the patch 22/23

> 
> Did you try to simulate any heating effect during the run if you're taking
> temperature into account to modify the power? What was the variation like and

Yes, I did that experiment and presented on OSPM 2023 slide 13. There is
big CPU power plot change in time, due to GPU heat. All detailed data is
there. The big CPU power is ~18-20% higher when 1-1.5W GPU is heating up
the whole SoC.

> at what rate was the EM being updated in this case? I think Jankbench in

In this experiment EM was only set once w/ the values mentioned above.
It could be due to the chip lottery. I cannot say on 100% this phone.

> general wouldn't stress the SoC enough.

True, this test is not power heavy as it can be seen. It's more
to show that the default EM after boot might not be the optimal one.

> 
> It'd be insightful to look at frequency residencies between the two runs and
> power breakdown for each cluster if you have access to them. No worries if not!

I'm afraid you're asking for too much ;)

> 
> My brain started to fail me somewhere around patch 15. I'll have another look
> some time later in the week but generally looks good to me. If I have any
> worries it is about how it can be used with the provided interfaces. Especially
> expectations about managing fast thermal changes at the level you're targeting.

No worries, thanks for the review! The fast thermal changes, which are
linked to the CPU's workload are not an issue here and I'm not worried
about those. The side effect of the heat from other device is the issue.
Thus, that thermal driver which modifies the EM should be aware of the
'whole SoC' situation (like mainline IPA does, when it manages all
devices in a single thermal zone).

Regards,
Lukasz
Qais Yousef Dec. 28, 2023, 6:41 p.m. UTC | #12
On 12/19/23 10:22, Lukasz Luba wrote:

> > One thing I'm not sure about is that in practice temperature of the SoC can
> > vary a lot in a short period of time. What is the expectation here? I can see
> > this useful in practice only if we average it over a window of time. Following
> > it will be really hard. Big variations can happen in few ms scales.
> 
> It's mostly for long running heavy workloads, which involve other device
> than CPUs, e.g. GPU or ISP (Image Signal Processor). Those devices can
> heat up the SoC. In our game DrArm running on pixel6 the GPU uses 75-77%
> of total power budget (starting from ~2.5W for GPU + 1.3W for all CPUs).
> That 2.5W from the GPU is heating up the CPUs and mostly impact the Big
> cores, which are made from High-Performance cells (thus leaking more).
> OverUtilization in the first 4-5min of gaming is ~4-9%, so EAS can work
> and save some power, if it has a good model. Later we have thermal
> throttling and OU goes to ~50% but EAS still can work. If the model is
> more precised - thus adjusted for the raising leakage due to temperature
> increase (generated due to GPU power), than we still can use better that
> power budget and not waist on the leakage at higher OPPs.

I can understand the need. But looking at one specific case vs generalized form
is different.

So IIUC the expectation is to track temperature variations over minutes by
external sources to CPU.

> > I didn't get how the new performance field is supposed to be controlled and
> > modified by users. A driver interface doesn't seem suitable as there's no
> > subsystem that knows the characteristic of the workload except userspace. In
> > Android we do have contextual info about what the current top-app to enable
> > modifying the capacities to match its characteristics.
> 
> Well in latest public documentation (May2023) for Cortex-X4 there are
> described new features of Arm cores: PDP, MPMM, which can change the
> 'performance' of the core in FW. Our SCMI kernel subsystem will get an
> interrupt, so the drivers can know about it. It could be used for
> recalculating the efficiency of the CPUs in the EM. When there is no
> hotplug and the long running app is still running, that FW policy would
> be reflected in EM. It's just not done all-in-one-step. Those patches
> will be later.

I think these features are some form of thermal throttling IIUC.

I was asking for handling the EM accuracy issue using the runtime model. I was
expecting some sysfs knobs. Do you see this also require a vendor specific
driver to try to account for the EM inaccuracy issues we're seeing?

> Second, I have used that 'performance' field to finally get rid of
> this runtime division in em_cpu_energy() hot path - which was annoying
> me for very long time. It wasn't possible to optimize that last
> operation there, because the not all CPUs boot and final CPU capacity
> is not known when we register EMs. With this feature finally I can
> remove that heavy operation. You can see more in that patch 15/23.

Yep, it's good addition :)

> > > 5. All CPUs (Little+Mid+Big) power values in mW
> > > +------------+--------+---------------------+-------+-----------+
> > > |  channel   | metric |       kernel        | value | perc_diff |
> > > +------------+--------+---------------------+-------+-----------+
> > > |    CPU     | gmean  | EM_default          | 142.1 |   0.0%    |
> > > |    CPU     | gmean  | EM_modified_runtime | 131.8 |  -7.27%   |
> > > +------------+--------+---------------------+-------+-----------+
> > 
> > How did you modify the EM here? Did you change both performance and power
> > fields? How did you calculate the new ones?
> 
> It was just the power values modified on my pixel6:
> for Littles 1.6x, Mid 0.8x, Big 1.3x of their boot power.
> TBH I don't know the chip binning of that SoC, but I suspect it
> could be due to this fact. More about possible error range in chip
> binning power values you can find in my comment to the patch 22/23

Strange just modifying the power had this impact. It could be related to
similar impact I've seen with migration margin for the little increasing. By
making the cost higher there, then it'd move the residency to other cores and
potentially reduce running at higher freq on the littles.

> > Did you try to simulate any heating effect during the run if you're taking
> > temperature into account to modify the power? What was the variation like and
> 
> Yes, I did that experiment and presented on OSPM 2023 slide 13. There is
> big CPU power plot change in time, due to GPU heat. All detailed data is
> there. The big CPU power is ~18-20% higher when 1-1.5W GPU is heating up
> the whole SoC.

I meant during your experiment above.

> > at what rate was the EM being updated in this case? I think Jankbench in
> 
> In this experiment EM was only set once w/ the values mentioned above.
> It could be due to the chip lottery. I cannot say on 100% this phone.
> 
> > general wouldn't stress the SoC enough.
> 
> True, this test is not power heavy as it can be seen. It's more
> to show that the default EM after boot might not be the optimal one.

I wouldn't reach that conclusion for this particular case. But the underlying
issues exists for sure.

> > It'd be insightful to look at frequency residencies between the two runs and
> > power breakdown for each cluster if you have access to them. No worries if not!
> 
> I'm afraid you're asking for too much ;)

It should be easy to get them. It's hard to know where the benefit is coming
from otherwise. But as I said, no worries if not. If you have perfetto traces
I can take help to take a look.

> > My brain started to fail me somewhere around patch 15. I'll have another look
> > some time later in the week but generally looks good to me. If I have any
> > worries it is about how it can be used with the provided interfaces. Especially
> > expectations about managing fast thermal changes at the level you're targeting.
> 
> No worries, thanks for the review! The fast thermal changes, which are
> linked to the CPU's workload are not an issue here and I'm not worried
> about those. The side effect of the heat from other device is the issue.
> Thus, that thermal driver which modifies the EM should be aware of the
> 'whole SoC' situation (like mainline IPA does, when it manages all
> devices in a single thermal zone).

I think in practice there will be challenges to generalize the thermal impact.
But overall from EM accuracy point of view (for all the various reasons
mentioned), we need this ability to help handle them in practice. Booting with
a single hardcoded EM doesn't work.


Cheers

--
Qais Yousef