[v4,00/18] Introduce runtime modifiable Energy Model

Message ID	20230925081139.1305766-1-lukasz.luba@arm.com
Headers	show Return-Path: <linux-pm-owner@vger.kernel.org> From: Lukasz Luba <lukasz.luba@arm.com> To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, rafael@kernel.org Cc: lukasz.luba@arm.com, dietmar.eggemann@arm.com, rui.zhang@intel.com, amit.kucheria@verdurent.com, amit.kachhap@gmail.com, daniel.lezcano@linaro.org, viresh.kumar@linaro.org, len.brown@intel.com, pavel@ucw.cz, mhiramat@kernel.org, qyousef@layalina.io, wvw@google.com Subject: [PATCH v4 00/18] Introduce runtime modifiable Energy Model Date: Mon, 25 Sep 2023 09:11:21 +0100 Message-Id: <20230925081139.1305766-1-lukasz.luba@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Introduce runtime modifiable Energy Model \| expand [v4,00/18] Introduce runtime modifiable Energy Model [v4,02/18] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments [v4,03/18] PM: EM: Find first CPU online while updating OPP efficiency [v4,06/18] PM: EM: Check if the get_cost() callback is present in em_compute_costs() [v4,07/18] PM: EM: Refactor struct em_perf_domain and add default_table [v4,10/18] PM: EM: Add RCU mechanism which safely cleans the old data [v4,12/18] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS [v4,14/18] PM: EM: Add performance field to struct em_perf_state [v4,16/18] PM: EM: Support late CPUs booting and capacity adjustment [v4,18/18] Documentation: EM: Update information about performance field

Lukasz Luba Sept. 25, 2023, 8:11 a.m. UTC

Hi all,

This patch set adds a new feature which allows to modify Energy Model (EM)
power values at runtime. It will allow to better reflect power model of
a recent SoCs and silicon. Different characteristics of the power usage
can be leveraged and thus better decisions made during task placement in EAS.

It's part of feature set know as Dynamic Energy Model. It has been presented
and discussed recently at OSPM2023 [3]. This patch set implements the 1st
improvement for the EM.

The concepts:
1. The CPU power usage can vary due to the workload that it's running or due
to the temperature of the SoC. The same workload can use more power when the
temperature of the silicon has increased (e.g. due to hot GPU or ISP).
In such situation the EM can be adjusted and reflect the fact of increased
power usage. That power increase is due to static power
(sometimes called simply: leakage). The CPUs in recent SoCs are different.
We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
They are also built differently with High Performance (HP) cells or
Low Power (LP) cells. They are affected by the temperature increase
differently: HP cells have bigger leakage. The SW model can leverage that
knowledge.

2. It is also possible to change the EM to better reflect the currently
running workload. Usually the EM is derived from some average power values
taken from experiments with benchmark (e.g. Dhrystone). The model derived
from such scenario might not represent properly the workloads usually running
on the device. Therefore, runtime modification of the EM allows to switch to
a different model, when there is a need.

3. The EM can be adjusted after boot, when all the modules are loaded and
more information about the SoC is available e.g. chip binning. This would help
to better reflect the silicon characteristics. Thus, this EM modification
API allows it now. It wasn't possible in the past and the EM had to be
'set in stone'.

Some design details:
The internal mechanisms for the memory allocation are handled internally in the 
EM. Kernel modules can just call the new API to update the EM data and the 
new memory would be provided and owned by the EM. The EM memory is used by
EAS, which impacts those design decisions. The EM writers are protected by
a mutex. This new runtime modified EM table is protected using RCU mechanism,
which fits the current EAS hot path (which already uses RCU read lock).
The unregister API handles only non-CPU (e.g. GPU, ISP) devices and uses the
same mutex as EM modifiers to make sure the memory is safely freed.

More detailed explanation and background can be found in presentations
during LPC2022 [1][2] or in the documentation patches.

The time cost to update EM for 11 OPPs is listed below. It's roughly
1.5us per 1 OPP while doing this on Little CPU at max frequency (1.8GHz).
More detailed results:

(The 4 CPUs from top are the little (1.8MHz), than 2 Mid (2.2GHz) and
then 2 big (2.8GHz) (while cpu6 didn't run that code)
------------------------------------
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain             3104    51236.39 us     16.506 us       75.344 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain             1264    20768.15 us     16.430 us       62.257 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain             1166    18632.95 us     15.980 us       70.707 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain              770    12334.43 us     16.018 us       66.337 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain              101    920.613 us      9.114 us        21.380 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain               20    211.830 us      10.591 us       23.998 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain               15    78.085 us       5.205 us        7.444 us


Some test results.
The EM can be updated to fit better the workload type. In the case below the EM
has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
for the scheduler bits). The Jankbench was run 10 times for those two configurations,
to get more reliable data.

1. Janky frames percentage
+--------+-----------------+---------------------+-------+-----------+
| metric |    variable     |       kernel        | value | perc_diff |
+--------+-----------------+---------------------+-------+-----------+
| gmean  | jank_percentage | EM_default          |  2.0  |   0.0%    |
| gmean  | jank_percentage | EM_modified_runtime |  1.3  |  -35.33%  |
+--------+-----------------+---------------------+-------+-----------+

2. Avg frame render time duration
+--------+---------------------+---------------------+-------+-----------+
| metric |      variable       |       kernel        | value | perc_diff |
+--------+---------------------+---------------------+-------+-----------+
| gmean  | mean_frame_duration | EM_default          | 10.5  |   0.0%    |
| gmean  | mean_frame_duration | EM_modified_runtime |  9.6  |  -8.52%   |
+--------+---------------------+---------------------+-------+-----------+

3. Max frame render time duration
+--------+--------------------+---------------------+-------+-----------+
| metric |      variable      |       kernel        | value | perc_diff |
+--------+--------------------+---------------------+-------+-----------+
| gmean  | max_frame_duration | EM_default          | 251.6 |   0.0%    |
| gmean  | max_frame_duration | EM_modified_runtime | 115.5 |  -54.09%  |
+--------+--------------------+---------------------+-------+-----------+

4. OS overutilized state percentage (when EAS is not working)
+--------------+---------------------+------+------------+------------+
|    metric    |       wa_path       | time | total_time | percentage |
+--------------+---------------------+------+------------+------------+
| overutilized | EM_default          | 1.65 |   253.38   |    0.65    |
| overutilized | EM_modified_runtime | 1.4  |   277.5    |    0.51    |
+--------------+---------------------+------+------------+------------+

5. All CPUs (Little+Mid+Big) power values in mW
+------------+--------+---------------------+-------+-----------+
|  channel   | metric |       kernel        | value | perc_diff |
+------------+--------+---------------------+-------+-----------+
|    CPU     | gmean  | EM_default          | 142.1 |   0.0%    |
|    CPU     | gmean  | EM_modified_runtime | 131.8 |  -7.27%   |
+------------+--------+---------------------+-------+-----------+



Changelog:
v4:
- refactored 2 rcu callbacks into 1 (Dietmar)
- fixed documentation (Dietmar)
- fixed 'run-time' into 'runtime' in all comments (Dietmar)
- fixed comments in patch headers (Diermar)
- fixed build issue in one patch (Dietmar)
- added cost of updating EM in the cover letter (Dietmar)
- added 'performance' field, which alled to optilize further
  the pre-calculated 'cost' field and remove division operation
  from runtime
- added update function during the boot after new EM is registered
  to modify older EMs if needed due to potential CPU capacity change
v3 [6]:
- adjusted inline comments for structure doc (Dietmar)
- extended patch header with infromation that only EAS will use the feature
  and it was driving the design (Dietmar)
- changed patch header and put shorter comment (Dietmar)
- moved the 'refactoring' patch that adds 'default_table' before the
  introduction of runtime modifiable feature as sugested by Dietmar in 
  numerous patches v2
- merged documentation patches into one
- added more explenation about the 2 tables design into the Doc (Dietmar)
- removed the CPPC+EM patch for runtime modification
- removed the trace patch, since the trace point would be added after a while
- renamed 'tmp' to 'runtime_table' variable in the unregister function,
  to better highlight the memory pointer checks (it is possible after
  moving the 'default_table' earlier)
- and added '__rcu' in the unregister function which should calm down
  the test bot warning
- renamed 'create' to 'refactor' in the patch header (Dietmar)
v2 [5]:
- solved build warning of unused variable in patch 13/17 when EM is
  not compiled in, e.g. on Intel platform for this cpufreq_cooling
- re-based on top of v6.4-rc1
v1:
- implementation can be found here [4]

Regards,
Lukasz Luba

[1] https://lpc.events/event/16/contributions/1341/attachments/955/1873/Dynamic_Energy_Model_to_handle_leakage_power.pdf
[2] https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf
[3] https://www.youtube.com/watch?v=2C-5uikSbtM&list=PL0fKordpLTjKsBOUcZqnzlHShri4YBL1H
[4] https://lore.kernel.org/lkml/20230314103357.26010-1-lukasz.luba@arm.com/
[5] https://lore.kernel.org/lkml/20230512095743.3393563-1-lukasz.luba@arm.com/
[6] https://lore.kernel.org/lkml/20230721155022.2339982-1-lukasz.luba@arm.com/


Lukasz Luba (18):
  PM: EM: Add missing newline for the message log
  PM: EM: Refactor em_cpufreq_update_efficiencies() arguments
  PM: EM: Find first CPU online while updating OPP efficiency
  PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
  PM: EM: Refactor a new function em_compute_costs()
  PM: EM: Check if the get_cost() callback is present in
    em_compute_costs()
  PM: EM: Refactor struct em_perf_domain and add default_table
  PM: EM: Add update_power() callback for runtime modifications
  PM: EM: Introduce runtime modifiable table
  PM: EM: Add RCU mechanism which safely cleans the old data
  PM: EM: Add runtime update interface to modify EM power
  PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
  Documentation: EM: Update with runtime modification design
  PM: EM: Add performance field to struct em_perf_state
  PM: EM: Adjust performance with runtime modification callback
  PM: EM: Support late CPUs booting and capacity adjustment
  PM: EM: Optimize em_cpu_energy() and remove division
  Documentation: EM: Update information about performance field

 Documentation/power/energy-model.rst | 149 +++++++++-
 drivers/powercap/dtpm_cpu.c          |  27 +-
 drivers/powercap/dtpm_devfreq.c      |  23 +-
 drivers/thermal/cpufreq_cooling.c    |  24 +-
 drivers/thermal/devfreq_cooling.c    |  23 +-
 include/linux/energy_model.h         | 155 ++++++-----
 kernel/power/energy_model.c          | 400 ++++++++++++++++++++++++---
 7 files changed, 651 insertions(+), 150 deletions(-)

Rafael J. Wysocki Sept. 26, 2023, 6:59 p.m. UTC | #1

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> The Energy Model (EM) is going to support runtime modifications. This
> new callback would be used in the upcoming EM changes. The drivers
> or frameworks which want to modify the EM have to implement the
> update_power() callback.
>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index d236e08e80dc..546dee90f716 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -168,6 +168,26 @@ struct em_data_callback {
>          */
>         int (*get_cost)(struct device *dev, unsigned long freq,
>                         unsigned long *cost);
> +
> +       /**
> +        * update_power() - Provide new power at the given performance state of
> +        *              a device

The meaning of the above is unclear to me.

> +        * @dev         : Device for which we do this operation (can be a CPU)

It is unclear what "we" means in this context.  Maybe say "Target
device (can be a CPU)"?

> +        * @freq        : Frequency at the performance state in kHz

What performance state does this refer to?  And the frequency of what?

> +        * @power       : New power value at the performance state
> +        *              (modified)

Similarly unclear as the above.

> +        * @priv        : Pointer to private data useful for tracking context
> +        *              during runtime modifications of EM.

Who's going to set this pointer and use this data?

> +        *
> +        * The update_power() is used by runtime modifiable EM. It aims to

I would drop "The" from the above.

> +        * provide updated power value for a given frequency, which is stored
> +        * in the performance state.

A given frequency of what and the performance state of what does this refer to?

> + The power value provided by this callback
> +        * should fit in the [0, EM_MAX_POWER] range.
> +        *
> +        * Return 0 on success, or appropriate error value in case of failure.
> +        */
> +       int (*update_power)(struct device *dev, unsigned long freq,
> +                           unsigned long *power, void *priv);
>  };
>  #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) ((em_cb).active_power = cb)
>  #define EM_ADV_DATA_CB(_active_power_cb, _cost_cb)     \
> @@ -175,6 +195,7 @@ struct em_data_callback {
>           .get_cost = _cost_cb }
>  #define EM_DATA_CB(_active_power_cb)                   \
>                 EM_ADV_DATA_CB(_active_power_cb, NULL)
> +#define EM_UPDATE_CB(_update_power_cb) { .update_power = &_update_power_cb }
>
>  struct em_perf_domain *em_cpu_get(int cpu);
>  struct em_perf_domain *em_pd_get(struct device *dev);
> @@ -331,6 +352,7 @@ struct em_data_callback {};
>  #define EM_ADV_DATA_CB(_active_power_cb, _cost_cb) { }
>  #define EM_DATA_CB(_active_power_cb) { }
>  #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) do { } while (0)
> +#define EM_UPDATE_CB(_update_cb) { }
>
>  static inline
>  int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
> --

Rafael J. Wysocki Sept. 26, 2023, 7:12 p.m. UTC | #2

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> The new runtime table would be populated with a new power data to better
> reflect the actual power. The power can vary over time e.g. due to the
> SoC temperature change. Higher temperature can increase power values.
> For longer running scenarios, such as game or camera, when also other
> devices are used (e.g. GPU, ISP) the CPU power can change. The new
> EM framework is able to addresses this issue and change the data
> at runtime safely.
>
> The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)
> for the task placement. All the other users (thermal, etc.) are still
> using the default (basic) EM. This fact drove the design of this feature.
>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h |  4 +++-
>  kernel/power/energy_model.c  | 12 +++++++++++-
>  2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 546dee90f716..740e7c25cfff 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -39,7 +39,7 @@ struct em_perf_state {
>  /**
>   * struct em_perf_table - Performance states table
>   * @state:     List of performance states, in ascending order
> - * @rcu:       RCU used for safe access and destruction
> + * @rcu:       RCU used only for runtime modifiable table

This still doesn't appear to be used anywhere, so why change it here?

>   */
>  struct em_perf_table {
>         struct em_perf_state *state;
> @@ -49,6 +49,7 @@ struct em_perf_table {
>  /**
>   * struct em_perf_domain - Performance domain
>   * @default_table:     Pointer to the default em_perf_table
> + * @runtime_table:     Pointer to the runtime modifiable em_perf_table

"Pointer to em_perf_table that can be dynamically updated"

>   * @nr_perf_states:    Number of performance states
>   * @flags:             See "em_perf_domain flags"
>   * @cpus:              Cpumask covering the CPUs of the domain. It's here
> @@ -64,6 +65,7 @@ struct em_perf_table {
>   */
>  struct em_perf_domain {
>         struct em_perf_table *default_table;
> +       struct em_perf_table __rcu *runtime_table;
>         int nr_perf_states;
>         unsigned long flags;
>         unsigned long cpus[];
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 797141638b29..5b40db38b745 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states,
>                 return ret;
>         }
>
> +       /* Initialize runtime table as default table. */

Redundant comment.

> +       rcu_assign_pointer(pd->runtime_table, default_table);
> +
>         if (_is_cpu_device(dev))
>                 for_each_cpu(cpu, cpus) {
>                         cpu_dev = get_cpu_device(cpu);
> @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
>   */
>  void em_dev_unregister_perf_domain(struct device *dev)
>  {
> +       struct em_perf_table __rcu *runtime_table;
>         struct em_perf_domain *pd;
>
>         if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
> @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev)
>                 return;
>
>         pd = dev->em_pd;
> -

Unrelated change.

>         /*
>          * The mutex separates all register/unregister requests and protects
>          * from potential clean-up/setup issues in the debugfs directories.
>          * The debugfs directory name is the same as device's name.
>          */
>         mutex_lock(&em_pd_mutex);
> +

Same here.

>         em_debug_remove_pd(dev);
>
> +       runtime_table = pd->runtime_table;
> +
> +       rcu_assign_pointer(pd->runtime_table, NULL);
> +       synchronize_rcu();

Is it really a good idea to call this under a mutex?

> +
>         kfree(pd->default_table->state);
>         kfree(pd->default_table);
>         kfree(dev->em_pd);
> +

Unrelated change.

>         dev->em_pd = NULL;
>         mutex_unlock(&em_pd_mutex);
>  }
> --

So this really adds a pointer to a table that can be dynamically
updated to struct em_perf_domain without any users so far.  It is not
used anywhere as of this patch AFAICS, which is not what the changelog
is saying.

Rafael J. Wysocki Sept. 26, 2023, 7:48 p.m. UTC | #3

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:

First off, I would merge this with the previous patch, as the changes
would be much clearer then IMO.

> Add an interface which allows to modify EM power data at runtime.
> The new power information is populated by the provided callback, which
> is called for each performance state.

But it all starts with copying the frequencies from the default table.

> The CPU frequencies' efficiency is
> re-calculated since that might be affected as well. The old EM memory
> is going to be freed later using RCU mechanism.

Not all of it, but the old runtime table that is not going to be used any more.

> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h |   8 +++
>  kernel/power/energy_model.c  | 111 +++++++++++++++++++++++++++++++++++
>  2 files changed, 119 insertions(+)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 740e7c25cfff..8f055ab356ed 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -201,6 +201,8 @@ struct em_data_callback {
>
>  struct em_perf_domain *em_cpu_get(int cpu);
>  struct em_perf_domain *em_pd_get(struct device *dev);
> +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
> +                             void *priv);
>  int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
>                                 struct em_data_callback *cb, cpumask_t *span,
>                                 bool microwatts);
> @@ -384,6 +386,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
>  {
>         return 0;
>  }
> +static inline
> +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
> +                             void *priv)
> +{
> +       return -EINVAL;
> +}
>  #endif
>
>  #endif
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 2345837bfd2c..78e1495dc87e 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -172,6 +172,101 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>         return 0;
>  }
>
> +/**
> + * em_dev_update_perf_domain() - Update runtime EM table for a device
> + * @dev                : Device for which the EM is to be updated
> + * @cb         : Callback function providing the power data for the EM
> + * @priv       : Pointer to private data useful for passing context
> + *             which might be required while calling @cb

It is still unclear to me who is going to use this priv pointer and how.

> + *
> + * Update EM runtime modifiable table for a @dev using the callback
> + * defined in @cb. The EM new power values are then used for calculating
> + * the em_perf_state::cost for associated performance state.

It actually allocates a new runtime table and populates it from
scratch, using the frequencies from the default table and the
callback.

> + *
> + * This function uses mutex to serialize writers, so it must not be called

"a mutex"

> + * from non-sleeping context.
> + *
> + * Return 0 on success or a proper error in case of failure.
> + */
> +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
> +                             void *priv)
> +{
> +       struct em_perf_table *runtime_table;
> +       unsigned long power, freq;
> +       struct em_perf_domain *pd;
> +       int ret, i;
> +
> +       if (!cb || !cb->update_power)
> +               return -EINVAL;
> +
> +       /*
> +        * The lock serializes update and unregister code paths. When the
> +        * EM has been unregistered in the meantime, we should capture that
> +        * when entering this critical section. It also makes sure that
> +        * two concurrent updates will be serialized.
> +        */
> +       mutex_lock(&em_pd_mutex);
> +
> +       if (!dev || !dev->em_pd) {

Checking dev against NULL under the mutex is pointless (either it is
NULL or it isn't, so check it earlier).

> +               ret = -EINVAL;
> +               goto unlock_em;
> +       }
> +
> +       pd = dev->em_pd;

And I would check pd against NULL here.

> +
> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
> +       if (!runtime_table) {
> +               ret = -ENOMEM;
> +               goto unlock_em;
> +       }
> +
> +       runtime_table->state = kcalloc(pd->nr_perf_states,
> +                                      sizeof(struct em_perf_state),
> +                                      GFP_KERNEL);
> +       if (!runtime_table->state) {
> +               ret = -ENOMEM;
> +               goto free_runtime_table;
> +       }

The above allocations can be merged into one and allocating memory
under the mutex is questionable.

> +
> +       /* Populate runtime table with updated values using driver callback */
> +       for (i = 0; i < pd->nr_perf_states; i++) {
> +               freq = pd->default_table->state[i].frequency;
> +               runtime_table->state[i].frequency = freq;
> +
> +               /*
> +                * Call driver callback to get a new power value for
> +                * a given frequency.
> +                */
> +               ret = cb->update_power(dev, freq, &power, priv);
> +               if (ret) {
> +                       dev_dbg(dev, "EM: runtime update error: %d\n", ret);
> +                       goto free_runtime_state_table;
> +               }
> +
> +               runtime_table->state[i].power = power;
> +       }
> +
> +       ret = em_compute_costs(dev, runtime_table->state, cb,
> +                              pd->nr_perf_states, pd->flags);
> +       if (ret)
> +               goto free_runtime_state_table;
> +
> +       em_perf_runtime_table_set(dev, runtime_table);
> +
> +       mutex_unlock(&em_pd_mutex);
> +       return 0;
> +
> +free_runtime_state_table:
> +       kfree(runtime_table->state);
> +free_runtime_table:
> +       kfree(runtime_table);
> +unlock_em:
> +       mutex_unlock(&em_pd_mutex);
> +
> +       return -EINVAL;
> +}
> +EXPORT_SYMBOL_GPL(em_dev_update_perf_domain);
> +
>  static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
>                                 int nr_states, struct em_data_callback *cb,
>                                 unsigned long flags)
> @@ -494,6 +589,8 @@ void em_dev_unregister_perf_domain(struct device *dev)
>          * The mutex separates all register/unregister requests and protects
>          * from potential clean-up/setup issues in the debugfs directories.
>          * The debugfs directory name is the same as device's name.
> +        * The lock also protects the updater of the runtime modifiable
> +        * EM and this remover.
>          */
>         mutex_lock(&em_pd_mutex);
>
> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
>
>         runtime_table = pd->runtime_table;
>
> +       /*
> +        * Safely destroy runtime modifiable EM. By using the call
> +        * synchronize_rcu() we make sure we don't progress till last user
> +        * finished the RCU section and our update got applied.
> +        */
>         rcu_assign_pointer(pd->runtime_table, NULL);
>         synchronize_rcu();
>
> +       /*
> +        * After the sync no updates will be in-flight, so free the
> +        * memory allocated for runtime table (if there was such).
> +        */
> +       if (runtime_table != pd->default_table) {
> +               kfree(runtime_table->state);
> +               kfree(runtime_table);
> +       }

Can't this race with the RCU callback freeing the runtime table?

> +
>         kfree(pd->default_table->state);
>         kfree(pd->default_table);
>         kfree(dev->em_pd);
> --

Qais Yousef Sept. 28, 2023, 9:56 p.m. UTC | #4

Hi Lukasz

On 09/25/23 09:11, Lukasz Luba wrote:
> Hi all,
> 
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
> 
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.

Thanks for the series! I'm on CC this time :-) Unfortunately I have a planned
holiday starting tomorrow, so won't get a chance to do proper review till I'm
back which would be few weeks from now.

I just want to iterate that this is a real problem being seen in practice where
it's hard to provide a single average EM for all workloads now. So we certainly
need something like this.

Hopefully I'll get a chance to help with review when I'm back from holidays.

Thanks!

--
Qais Yousef

Lukasz Luba Sept. 29, 2023, 8:38 a.m. UTC | #5

On 9/26/23 19:39, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Refactor a dedicated function which will be easier to maintain and re-use
>> in future. The upcoming changes for the modifiable EM perf_state table
>> will use it (instead of duplicating the code).
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> 
> If I'm not mistaken, this patch by itself is not going to change the
> observable functionality in any way and it would be good to say that
> in the changelog.
> 
> This also applies to some other patches in this series.
> 

Correct, no functional changes. I will add that information to the
patch header in next version.

Lukasz Luba Sept. 29, 2023, 9 a.m. UTC | #6

On 9/26/23 19:59, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> The Energy Model (EM) is going to support runtime modifications. This
>> new callback would be used in the upcoming EM changes. The drivers
>> or frameworks which want to modify the EM have to implement the
>> update_power() callback.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   include/linux/energy_model.h | 22 ++++++++++++++++++++++
>>   1 file changed, 22 insertions(+)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index d236e08e80dc..546dee90f716 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -168,6 +168,26 @@ struct em_data_callback {
>>           */
>>          int (*get_cost)(struct device *dev, unsigned long freq,
>>                          unsigned long *cost);
>> +
>> +       /**
>> +        * update_power() - Provide new power at the given performance state of
>> +        *              a device
> 
> The meaning of the above is unclear to me.

I can try to rephrase this a bit:
' Provide a new power value for the device at the given frequency. This
allows to reflect changed power profile in runtime.'

> 
>> +        * @dev         : Device for which we do this operation (can be a CPU)
> 
> It is unclear what "we" means in this context.  Maybe say "Target
> device (can be a CPU)"?

That sounds better indeed.

> 
>> +        * @freq        : Frequency at the performance state in kHz
> 
> What performance state does this refer to?  And the frequency of what?

We just call the entry in EM the 'performance state' (so frequency and
power). I will rephrase this as well:
'Frequency of the @dev expressed in kHz'

> 
>> +        * @power       : New power value at the performance state
>> +        *              (modified)
> 
> Similarly unclear as the above.

OK, it can be re-written as:
'Power value of the @dev at the given @freq (modified)'

> 
>> +        * @priv        : Pointer to private data useful for tracking context
>> +        *              during runtime modifications of EM.
> 
> Who's going to set this pointer and use this data?

The driver or kernel module, which is aware about thermal events. Those
events might be coming from FW to kernel, but we need to maintain
the same 'context' for all OPPs when we start the EM update.

This 'priv' field is used for passing the 'context' back to the
caller, so the caller can consistently the update. The update might
be with some math formula which multiplies the power by some factor
value (based on thermal model and current temperature).

> 
>> +        *
>> +        * The update_power() is used by runtime modifiable EM. It aims to
> 
> I would drop "The" from the above.

OK

> 
>> +        * provide updated power value for a given frequency, which is stored
>> +        * in the performance state.
> 
> A given frequency of what and the performance state of what does this refer to?

I will change that and add the '@dev' word to make this more precised.

'update_power() is used by runtime modifiable EM. It aims to update the
@dev EM power values for all registered frequencies.'

Lukasz Luba Sept. 29, 2023, 9:16 a.m. UTC | #7

On 9/26/23 20:12, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> The new runtime table would be populated with a new power data to better
>> reflect the actual power. The power can vary over time e.g. due to the
>> SoC temperature change. Higher temperature can increase power values.
>> For longer running scenarios, such as game or camera, when also other
>> devices are used (e.g. GPU, ISP) the CPU power can change. The new
>> EM framework is able to addresses this issue and change the data
>> at runtime safely.
>>
>> The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)
>> for the task placement. All the other users (thermal, etc.) are still
>> using the default (basic) EM. This fact drove the design of this feature.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   include/linux/energy_model.h |  4 +++-
>>   kernel/power/energy_model.c  | 12 +++++++++++-
>>   2 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index 546dee90f716..740e7c25cfff 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -39,7 +39,7 @@ struct em_perf_state {
>>   /**
>>    * struct em_perf_table - Performance states table
>>    * @state:     List of performance states, in ascending order
>> - * @rcu:       RCU used for safe access and destruction
>> + * @rcu:       RCU used only for runtime modifiable table
> 
> This still doesn't appear to be used anywhere, so why change it here?

I will try to move this later in the series.

> 
>>    */
>>   struct em_perf_table {
>>          struct em_perf_state *state;
>> @@ -49,6 +49,7 @@ struct em_perf_table {
>>   /**
>>    * struct em_perf_domain - Performance domain
>>    * @default_table:     Pointer to the default em_perf_table
>> + * @runtime_table:     Pointer to the runtime modifiable em_perf_table
> 
> "Pointer to em_perf_table that can be dynamically updated"

OK

> 
>>    * @nr_perf_states:    Number of performance states
>>    * @flags:             See "em_perf_domain flags"
>>    * @cpus:              Cpumask covering the CPUs of the domain. It's here
>> @@ -64,6 +65,7 @@ struct em_perf_table {
>>    */
>>   struct em_perf_domain {
>>          struct em_perf_table *default_table;
>> +       struct em_perf_table __rcu *runtime_table;
>>          int nr_perf_states;
>>          unsigned long flags;
>>          unsigned long cpus[];
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 797141638b29..5b40db38b745 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states,
>>                  return ret;
>>          }
>>
>> +       /* Initialize runtime table as default table. */
> 
> Redundant comment.

I'll drop it.

> 
>> +       rcu_assign_pointer(pd->runtime_table, default_table);
>> +
>>          if (_is_cpu_device(dev))
>>                  for_each_cpu(cpu, cpus) {
>>                          cpu_dev = get_cpu_device(cpu);
>> @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
>>    */
>>   void em_dev_unregister_perf_domain(struct device *dev)
>>   {
>> +       struct em_perf_table __rcu *runtime_table;
>>          struct em_perf_domain *pd;
>>
>>          if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
>> @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev)
>>                  return;
>>
>>          pd = dev->em_pd;
>> -
> 
> Unrelated change.

ACK

> 
>>          /*
>>           * The mutex separates all register/unregister requests and protects
>>           * from potential clean-up/setup issues in the debugfs directories.
>>           * The debugfs directory name is the same as device's name.
>>           */
>>          mutex_lock(&em_pd_mutex);
>> +
> 
> Same here.

ACK

> 
>>          em_debug_remove_pd(dev);
>>
>> +       runtime_table = pd->runtime_table;
>> +
>> +       rcu_assign_pointer(pd->runtime_table, NULL);
>> +       synchronize_rcu();
> 
> Is it really a good idea to call this under a mutex?

This is the unregistration of the EM code path, so a thermal
driver which gets some IRQs might not be aware that the EM
is going to vanish. That's why those two code paths: update
& unregister are protected with the same lock.

This synchronize_rcu() won't be long, but makes sure
that when we free(dev->em_pd) we don't leak runtime_table.

> 
>> +
>>          kfree(pd->default_table->state);
>>          kfree(pd->default_table);
>>          kfree(dev->em_pd);
>> +
> 
> Unrelated change.

ACK

> 
>>          dev->em_pd = NULL;
>>          mutex_unlock(&em_pd_mutex);
>>   }
>> --
> 
> So this really adds a pointer to a table that can be dynamically
> updated to struct em_perf_domain without any users so far.  It is not
> used anywhere as of this patch AFAICS, which is not what the changelog
> is saying.

Good catch. I will adjust the changlog in header and say:

'Add infrastructure and mechanisms for the new runtime table.
The runtime modifiable EM data is used by the Energy Aware Scheduler 
(EAS)for the task placement. All the other users (thermal, etc.) are
still using the default (basic) EM. This fact drove the design of this
feature.'

Lukasz Luba Sept. 29, 2023, 10 a.m. UTC | #8

On 9/26/23 20:48, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> 
> First off, I would merge this with the previous patch, as the changes
> would be much clearer then IMO.

I was trying to avoid a big patch ~150 lines. I will do that merge.

> 
>> Add an interface which allows to modify EM power data at runtime.
>> The new power information is populated by the provided callback, which
>> is called for each performance state.
> 
> But it all starts with copying the frequencies from the default table.

Yes, I can add that to the description.

> 
>> The CPU frequencies' efficiency is
>> re-calculated since that might be affected as well. The old EM memory
>> is going to be freed later using RCU mechanism.
> 
> Not all of it, but the old runtime table that is not going to be used any more.

True, I will rephrase that, to make it more precised.

> 
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>

[snip]

>>
>> +/**
>> + * em_dev_update_perf_domain() - Update runtime EM table for a device
>> + * @dev                : Device for which the EM is to be updated
>> + * @cb         : Callback function providing the power data for the EM
>> + * @priv       : Pointer to private data useful for passing context
>> + *             which might be required while calling @cb
> 
> It is still unclear to me who is going to use this priv pointer and how.

I have explained that in some previous patch response. A driver or
kernel module which monitors the thermal situation and has context.

> 
>> + *
>> + * Update EM runtime modifiable table for a @dev using the callback
>> + * defined in @cb. The EM new power values are then used for calculating
>> + * the em_perf_state::cost for associated performance state.
> 
> It actually allocates a new runtime table and populates it from
> scratch, using the frequencies from the default table and the
> callback.

Yes, it allocated new table and put the updated power values there.
I can add that to the comment.

> 
>> + *
>> + * This function uses mutex to serialize writers, so it must not be called
> 
> "a mutex"

ACK

> 
>> + * from non-sleeping context.

[snip]

>> +
>> +       if (!dev || !dev->em_pd) {
> 
> Checking dev against NULL under the mutex is pointless (either it is
> NULL or it isn't, so check it earlier).

ACK

> 
>> +               ret = -EINVAL;
>> +               goto unlock_em;
>> +       }
>> +
>> +       pd = dev->em_pd;
> 
> And I would check pd against NULL here.

It's done above, next to '!dev || !dev->em_pd'

> 
>> +
>> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
>> +       if (!runtime_table) {
>> +               ret = -ENOMEM;
>> +               goto unlock_em;
>> +       }
>> +
>> +       runtime_table->state = kcalloc(pd->nr_perf_states,
>> +                                      sizeof(struct em_perf_state),
>> +                                      GFP_KERNEL);
>> +       if (!runtime_table->state) {
>> +               ret = -ENOMEM;
>> +               goto free_runtime_table;
>> +       }
> 
> The above allocations can be merged into one and allocating memory
> under the mutex is questionable.

So how to make sure that there is no 2 callers trying to update the
same EM or unregistration is not in the background?

[snip]

>>
>> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
>>
>>          runtime_table = pd->runtime_table;
>>
>> +       /*
>> +        * Safely destroy runtime modifiable EM. By using the call
>> +        * synchronize_rcu() we make sure we don't progress till last user
>> +        * finished the RCU section and our update got applied.
>> +        */
>>          rcu_assign_pointer(pd->runtime_table, NULL);
>>          synchronize_rcu();
>>
>> +       /*
>> +        * After the sync no updates will be in-flight, so free the
>> +        * memory allocated for runtime table (if there was such).
>> +        */
>> +       if (runtime_table != pd->default_table) {
>> +               kfree(runtime_table->state);
>> +               kfree(runtime_table);
>> +       }
> 
> Can't this race with the RCU callback freeing the runtime table?

That's why there is this 'synchronize_rcu()' above and the mutex. The
updating caller if finished the update, would unlock the mutex and this
unregister code can go. Here we call the synchronize_rcu() so we assure
the callback has finished for the update path and than we explicitly
free the saved 'runtime_table' here. So all data should be freed and
code serialized in those two paths.

Rafael J. Wysocki Sept. 29, 2023, 12:18 p.m. UTC | #9

On Fri, Sep 29, 2023 at 10:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> On 9/26/23 19:59, Rafael J. Wysocki wrote:
> > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >> The Energy Model (EM) is going to support runtime modifications. This
> >> new callback would be used in the upcoming EM changes. The drivers
> >> or frameworks which want to modify the EM have to implement the
> >> update_power() callback.
> >>
> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> >> ---
> >>   include/linux/energy_model.h | 22 ++++++++++++++++++++++
> >>   1 file changed, 22 insertions(+)
> >>
> >> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> >> index d236e08e80dc..546dee90f716 100644
> >> --- a/include/linux/energy_model.h
> >> +++ b/include/linux/energy_model.h
> >> @@ -168,6 +168,26 @@ struct em_data_callback {
> >>           */
> >>          int (*get_cost)(struct device *dev, unsigned long freq,
> >>                          unsigned long *cost);
> >> +
> >> +       /**
> >> +        * update_power() - Provide new power at the given performance state of
> >> +        *              a device
> >
> > The meaning of the above is unclear to me.
>
> I can try to rephrase this a bit:
> ' Provide a new power value for the device at the given frequency. This
> allows to reflect changed power profile in runtime.'

Maybe "Estimate power for a given device frequency"

> >
> >> +        * @dev         : Device for which we do this operation (can be a CPU)
> >
> > It is unclear what "we" means in this context.  Maybe say "Target
> > device (can be a CPU)"?
>
> That sounds better indeed.
>
> >
> >> +        * @freq        : Frequency at the performance state in kHz
> >
> > What performance state does this refer to?  And the frequency of what?
>
> We just call the entry in EM the 'performance state' (so frequency and
> power). I will rephrase this as well:
> 'Frequency of the @dev expressed in kHz'

Or "Device frequency for which to estimate power"?

> >
> >> +        * @power       : New power value at the performance state
> >> +        *              (modified)
> >
> > Similarly unclear as the above.
>
> OK, it can be re-written as:
> 'Power value of the @dev at the given @freq (modified)'

This is not a power value, but a return pointer AFAICS.  So "location
to store the return power value".

> >
> >> +        * @priv        : Pointer to private data useful for tracking context
> >> +        *              during runtime modifications of EM.
> >
> > Who's going to set this pointer and use this data?
>
> The driver or kernel module, which is aware about thermal events. Those
> events might be coming from FW to kernel, but we need to maintain
> the same 'context' for all OPPs when we start the EM update.
>
> This 'priv' field is used for passing the 'context' back to the
> caller, so the caller can consistently the update. The update might
> be with some math formula which multiplies the power by some factor
> value (based on thermal model and current temperature).

I would say "Additional data for the callback function, provided by
the entity setting the callback pointer".

> >
> >> +        *
> >> +        * The update_power() is used by runtime modifiable EM. It aims to
> >
> > I would drop "The" from the above.
>
> OK
>
> >
> >> +        * provide updated power value for a given frequency, which is stored
> >> +        * in the performance state.
> >
> > A given frequency of what and the performance state of what does this refer to?
>
> I will change that and add the '@dev' word to make this more precised.

That would help.  Overall, I would say "is used by ... to obtain a new
power value for a given frequency of @dev".

>
> 'update_power() is used by runtime modifiable EM. It aims to update the
> @dev EM power values for all registered frequencies.'

Rafael J. Wysocki Sept. 29, 2023, 12:27 p.m. UTC | #10

On Fri, Sep 29, 2023 at 11:15 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 9/26/23 20:12, Rafael J. Wysocki wrote:
> > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >> The new runtime table would be populated with a new power data to better
> >> reflect the actual power. The power can vary over time e.g. due to the
> >> SoC temperature change. Higher temperature can increase power values.
> >> For longer running scenarios, such as game or camera, when also other
> >> devices are used (e.g. GPU, ISP) the CPU power can change. The new
> >> EM framework is able to addresses this issue and change the data
> >> at runtime safely.
> >>
> >> The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)
> >> for the task placement. All the other users (thermal, etc.) are still
> >> using the default (basic) EM. This fact drove the design of this feature.
> >>
> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> >> ---
> >>   include/linux/energy_model.h |  4 +++-
> >>   kernel/power/energy_model.c  | 12 +++++++++++-
> >>   2 files changed, 14 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> >> index 546dee90f716..740e7c25cfff 100644
> >> --- a/include/linux/energy_model.h
> >> +++ b/include/linux/energy_model.h
> >> @@ -39,7 +39,7 @@ struct em_perf_state {
> >>   /**
> >>    * struct em_perf_table - Performance states table
> >>    * @state:     List of performance states, in ascending order
> >> - * @rcu:       RCU used for safe access and destruction
> >> + * @rcu:       RCU used only for runtime modifiable table
> >
> > This still doesn't appear to be used anywhere, so why change it here?
>
> I will try to move this later in the series.
>
> >
> >>    */
> >>   struct em_perf_table {
> >>          struct em_perf_state *state;
> >> @@ -49,6 +49,7 @@ struct em_perf_table {
> >>   /**
> >>    * struct em_perf_domain - Performance domain
> >>    * @default_table:     Pointer to the default em_perf_table
> >> + * @runtime_table:     Pointer to the runtime modifiable em_perf_table
> >
> > "Pointer to em_perf_table that can be dynamically updated"
>
> OK
>
> >
> >>    * @nr_perf_states:    Number of performance states
> >>    * @flags:             See "em_perf_domain flags"
> >>    * @cpus:              Cpumask covering the CPUs of the domain. It's here
> >> @@ -64,6 +65,7 @@ struct em_perf_table {
> >>    */
> >>   struct em_perf_domain {
> >>          struct em_perf_table *default_table;
> >> +       struct em_perf_table __rcu *runtime_table;
> >>          int nr_perf_states;
> >>          unsigned long flags;
> >>          unsigned long cpus[];
> >> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> >> index 797141638b29..5b40db38b745 100644
> >> --- a/kernel/power/energy_model.c
> >> +++ b/kernel/power/energy_model.c
> >> @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states,
> >>                  return ret;
> >>          }
> >>
> >> +       /* Initialize runtime table as default table. */
> >
> > Redundant comment.
>
> I'll drop it.
>
> >
> >> +       rcu_assign_pointer(pd->runtime_table, default_table);
> >> +
> >>          if (_is_cpu_device(dev))
> >>                  for_each_cpu(cpu, cpus) {
> >>                          cpu_dev = get_cpu_device(cpu);
> >> @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
> >>    */
> >>   void em_dev_unregister_perf_domain(struct device *dev)
> >>   {
> >> +       struct em_perf_table __rcu *runtime_table;
> >>          struct em_perf_domain *pd;
> >>
> >>          if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
> >> @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev)
> >>                  return;
> >>
> >>          pd = dev->em_pd;
> >> -
> >
> > Unrelated change.
>
> ACK
>
> >
> >>          /*
> >>           * The mutex separates all register/unregister requests and protects
> >>           * from potential clean-up/setup issues in the debugfs directories.
> >>           * The debugfs directory name is the same as device's name.
> >>           */
> >>          mutex_lock(&em_pd_mutex);
> >> +
> >
> > Same here.
>
> ACK
>
> >
> >>          em_debug_remove_pd(dev);
> >>
> >> +       runtime_table = pd->runtime_table;
> >> +
> >> +       rcu_assign_pointer(pd->runtime_table, NULL);
> >> +       synchronize_rcu();
> >
> > Is it really a good idea to call this under a mutex?
>
> This is the unregistration of the EM code path, so a thermal
> driver which gets some IRQs might not be aware that the EM
> is going to vanish. That's why those two code paths: update
> & unregister are protected with the same lock.
>
> This synchronize_rcu() won't be long,

Are you sure?  This potentially waits for all of the CPUs in the
system to go through a quiescent state which may take a while in
principle.

In any case, though, this effectively makes everyone waiting for the
mutex also wait for the grace period to elapse and they may not care
about it.

> but makes sure that when we free(dev->em_pd) we don't leak runtime_table.
>
> >
> >> +
> >>          kfree(pd->default_table->state);
> >>          kfree(pd->default_table);
> >>          kfree(dev->em_pd);
> >> +
> >
> > Unrelated change.
>
> ACK
>
> >
> >>          dev->em_pd = NULL;
> >>          mutex_unlock(&em_pd_mutex);
> >>   }
> >> --
> >
> > So this really adds a pointer to a table that can be dynamically
> > updated to struct em_perf_domain without any users so far.  It is not
> > used anywhere as of this patch AFAICS, which is not what the changelog
> > is saying.
>
> Good catch. I will adjust the changlog in header and say:
>
> 'Add infrastructure and mechanisms for the new runtime table.
> The runtime modifiable EM data is used by the Energy Aware Scheduler
> (EAS)for the task placement.

I would make it more clear that this is going to happen after some
other subsequent changes.

> All the other users (thermal, etc.) are
> still using the default (basic) EM. This fact drove the design of this
> feature.'

Rafael J. Wysocki Sept. 29, 2023, 1:18 p.m. UTC | #11

On Fri, Sep 29, 2023 at 11:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 9/26/23 20:48, Rafael J. Wysocki wrote:
> > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >
> > First off, I would merge this with the previous patch, as the changes
> > would be much clearer then IMO.
>
> I was trying to avoid a big patch ~150 lines. I will do that merge.
>
> >
> >> Add an interface which allows to modify EM power data at runtime.
> >> The new power information is populated by the provided callback, which
> >> is called for each performance state.
> >
> > But it all starts with copying the frequencies from the default table.
>
> Yes, I can add that to the description.
>
> >
> >> The CPU frequencies' efficiency is
> >> re-calculated since that might be affected as well. The old EM memory
> >> is going to be freed later using RCU mechanism.
> >
> > Not all of it, but the old runtime table that is not going to be used any more.
>
> True, I will rephrase that, to make it more precised.
>
> >
> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>
> [snip]
>
> >>
> >> +/**
> >> + * em_dev_update_perf_domain() - Update runtime EM table for a device
> >> + * @dev                : Device for which the EM is to be updated
> >> + * @cb         : Callback function providing the power data for the EM
> >> + * @priv       : Pointer to private data useful for passing context
> >> + *             which might be required while calling @cb
> >
> > It is still unclear to me who is going to use this priv pointer and how.
>
> I have explained that in some previous patch response. A driver or
> kernel module which monitors the thermal situation and has context.
>
> >
> >> + *
> >> + * Update EM runtime modifiable table for a @dev using the callback
> >> + * defined in @cb. The EM new power values are then used for calculating
> >> + * the em_perf_state::cost for associated performance state.
> >
> > It actually allocates a new runtime table and populates it from
> > scratch, using the frequencies from the default table and the
> > callback.
>
> Yes, it allocated new table and put the updated power values there.
> I can add that to the comment.
>
> >
> >> + *
> >> + * This function uses mutex to serialize writers, so it must not be called
> >
> > "a mutex"
>
> ACK
>
> >
> >> + * from non-sleeping context.
>
> [snip]
>
> >> +
> >> +       if (!dev || !dev->em_pd) {
> >
> > Checking dev against NULL under the mutex is pointless (either it is
> > NULL or it isn't, so check it earlier).
>
> ACK
>
> >
> >> +               ret = -EINVAL;
> >> +               goto unlock_em;
> >> +       }
> >> +
> >> +       pd = dev->em_pd;
> >
> > And I would check pd against NULL here.
>
> It's done above, next to '!dev || !dev->em_pd'

Yes, it is, I meant something like this:

    if (!cb || !cb->update_power || !dev)
        return -EINVAL;

    mutex_lock(&em_pd_mutex);

    pd = dev->em_pd;
    if (!pd) {
        ret = -EINVAL; /* or perhaps -ENODATA */
        goto unlock_em;
    }


> >
> >> +
> >> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
> >> +       if (!runtime_table) {
> >> +               ret = -ENOMEM;
> >> +               goto unlock_em;
> >> +       }
> >> +
> >> +       runtime_table->state = kcalloc(pd->nr_perf_states,
> >> +                                      sizeof(struct em_perf_state),
> >> +                                      GFP_KERNEL);
> >> +       if (!runtime_table->state) {
> >> +               ret = -ENOMEM;
> >> +               goto free_runtime_table;
> >> +       }
> >
> > The above allocations can be merged into one and allocating memory
> > under the mutex is questionable.
>
> So how to make sure that there is no 2 callers trying to update the
> same EM or unregistration is not in the background?

You can allocate memory upfront and take the mutex before accessing
the shared data structures.  If there's an error in the code running
under the mutex, release it and then free the memory.

Allocating memory is one operation, updating the shared data
structures to use it is another one.  The former doesn't affect the
shared state in any way, so why do it under the mutex?

> [snip]
>
> >>
> >> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
> >>
> >>          runtime_table = pd->runtime_table;
> >>
> >> +       /*
> >> +        * Safely destroy runtime modifiable EM. By using the call
> >> +        * synchronize_rcu() we make sure we don't progress till last user
> >> +        * finished the RCU section and our update got applied.
> >> +        */
> >>          rcu_assign_pointer(pd->runtime_table, NULL);
> >>          synchronize_rcu();
> >>
> >> +       /*
> >> +        * After the sync no updates will be in-flight, so free the
> >> +        * memory allocated for runtime table (if there was such).
> >> +        */
> >> +       if (runtime_table != pd->default_table) {
> >> +               kfree(runtime_table->state);
> >> +               kfree(runtime_table);
> >> +       }
> >
> > Can't this race with the RCU callback freeing the runtime table?
>
> That's why there is this 'synchronize_rcu()' above and the mutex. The
> updating caller if finished the update, would unlock the mutex and this
> unregister code can go. Here we call the synchronize_rcu() so we assure
> the callback has finished for the update path and than we explicitly
> free the saved 'runtime_table' here. So all data should be freed and
> code serialized in those two paths.

This doesn't quite agree with my understanding of what synchronize_rcu() does.

IIUC, RCU callbacks can run as soon as the grace period has elapsed
and they need not wait for synchronize_rcu() to return.  Conversely,
synchronize_rcu() doesn't wait for all of the RCU callbacks to
complete.

Now, em_destroy_rt_table_rcu() doesn't actually use the mutex, so how
exactly is it protected against racing with this code?

Lukasz Luba Oct. 2, 2023, 2:09 p.m. UTC | #12

On 9/29/23 14:18, Rafael J. Wysocki wrote:
> On Fri, Sep 29, 2023 at 11:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>

[snip]

>>
>> It's done above, next to '!dev || !dev->em_pd'
> 
> Yes, it is, I meant something like this:
> 
>      if (!cb || !cb->update_power || !dev)
>          return -EINVAL;
> 
>      mutex_lock(&em_pd_mutex);
> 
>      pd = dev->em_pd;
>      if (!pd) {
>          ret = -EINVAL; /* or perhaps -ENODATA */
>          goto unlock_em;
>      }
> 
> 

OK, I see what you mean. Let me change that.

>>>
>>>> +
>>>> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
>>>> +       if (!runtime_table) {
>>>> +               ret = -ENOMEM;
>>>> +               goto unlock_em;
>>>> +       }
>>>> +
>>>> +       runtime_table->state = kcalloc(pd->nr_perf_states,
>>>> +                                      sizeof(struct em_perf_state),
>>>> +                                      GFP_KERNEL);
>>>> +       if (!runtime_table->state) {
>>>> +               ret = -ENOMEM;
>>>> +               goto free_runtime_table;
>>>> +       }
>>>
>>> The above allocations can be merged into one and allocating memory
>>> under the mutex is questionable.
>>
>> So how to make sure that there is no 2 callers trying to update the
>> same EM or unregistration is not in the background?
> 
> You can allocate memory upfront and take the mutex before accessing
> the shared data structures.  If there's an error in the code running
> under the mutex, release it and then free the memory.
> 
> Allocating memory is one operation, updating the shared data
> structures to use it is another one.  The former doesn't affect the
> shared state in any way, so why do it under the mutex?

Yes, make sense. I will shrink that critical section. Good catch,
thanks!

> 
>> [snip]
>>
>>>>
>>>> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
>>>>
>>>>           runtime_table = pd->runtime_table;
>>>>
>>>> +       /*
>>>> +        * Safely destroy runtime modifiable EM. By using the call
>>>> +        * synchronize_rcu() we make sure we don't progress till last user
>>>> +        * finished the RCU section and our update got applied.
>>>> +        */
>>>>           rcu_assign_pointer(pd->runtime_table, NULL);
>>>>           synchronize_rcu();
>>>>
>>>> +       /*
>>>> +        * After the sync no updates will be in-flight, so free the
>>>> +        * memory allocated for runtime table (if there was such).
>>>> +        */
>>>> +       if (runtime_table != pd->default_table) {
>>>> +               kfree(runtime_table->state);
>>>> +               kfree(runtime_table);
>>>> +       }
>>>
>>> Can't this race with the RCU callback freeing the runtime table?
>>
>> That's why there is this 'synchronize_rcu()' above and the mutex. The
>> updating caller if finished the update, would unlock the mutex and this
>> unregister code can go. Here we call the synchronize_rcu() so we assure
>> the callback has finished for the update path and than we explicitly
>> free the saved 'runtime_table' here. So all data should be freed and
>> code serialized in those two paths.
> 
> This doesn't quite agree with my understanding of what synchronize_rcu() does.
> 
> IIUC, RCU callbacks can run as soon as the grace period has elapsed
> and they need not wait for synchronize_rcu() to return.  Conversely,
> synchronize_rcu() doesn't wait for all of the RCU callbacks to
> complete.
> 
> Now, em_destroy_rt_table_rcu() doesn't actually use the mutex, so how
> exactly is it protected against racing with this code?


I'll try to draw in on some pictures...

(previous instance )
+---------------------+
|                     |
| runtime table    #1 |
|                     |
+---------------------+


(next instance )
+---------------------+
|                     |
| runtime table    #2 |
|                     |
+---------------------+


(not possible new instance)
+.....................+
.                     .
. runtime table    #3 .
.                     .
+.....................+



    cpu A - "updater"          |    cpu B - "remover"
                               |
------------------------------|------------------------------
    lock em_pd_mutex           |
                               |
       alloc runtime table #2  |   lock em_pd_mutex
                               |   (waiting)
       async free instance #1  |    .
                               |    .
    unlock em_pd_mutex         |    .
                               |   (enter critical section)
                               |
    lock em_pd_mutex           |   set NULL to runtime table ptr
    (waiting)                  |
    (wanted to create #3 inst) |   synchronize rcu to make it is visible
    .                          |
    .                          |   implicit free instance #2
    .                          |
    .                          |   free the rest of EM and EM
    .                          |
    .                          |   unlock em_pd_mutex
    (enter critical section)   |
    !dev->em_pd so             |
    unlock & exit              |
                               |
                               |


This should clean all involved memory and also prevent
of allocating new instance, when we unregister EM.

Lukasz Luba Oct. 6, 2023, 8:03 a.m. UTC | #13

On 9/29/23 13:27, Rafael J. Wysocki wrote:
> On Fri, Sep 29, 2023 at 11:15 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>>

[snip]

>>>>           em_debug_remove_pd(dev);
>>>>
>>>> +       runtime_table = pd->runtime_table;
>>>> +
>>>> +       rcu_assign_pointer(pd->runtime_table, NULL);
>>>> +       synchronize_rcu();
>>>
>>> Is it really a good idea to call this under a mutex?
>>
>> This is the unregistration of the EM code path, so a thermal
>> driver which gets some IRQs might not be aware that the EM
>> is going to vanish. That's why those two code paths: update
>> & unregister are protected with the same lock.
>>
>> This synchronize_rcu() won't be long,
> 
> Are you sure?  This potentially waits for all of the CPUs in the
> system to go through a quiescent state which may take a while in
> principle.
> 
> In any case, though, this effectively makes everyone waiting for the
> mutex also wait for the grace period to elapse and they may not care
> about it.

My apologies for the delay, I had to check this. Yes, should be possible
and safe to not wait here as you described on this synchronize_rcu().

What I have drawn in other response to patch 11/18 [1] should still be
true.

Thanks, I will remove this sync call from here.

> 
>> but makes sure that when we free(dev->em_pd) we don't leak runtime_table.
>>
>>>
>>>> +
>>>>           kfree(pd->default_table->state);
>>>>           kfree(pd->default_table);
>>>>           kfree(dev->em_pd);
>>>> +
>>>
>>> Unrelated change.
>>
>> ACK
>>
>>>
>>>>           dev->em_pd = NULL;
>>>>           mutex_unlock(&em_pd_mutex);
>>>>    }
>>>> --
>>>
>>> So this really adds a pointer to a table that can be dynamically
>>> updated to struct em_perf_domain without any users so far.  It is not
>>> used anywhere as of this patch AFAICS, which is not what the changelog
>>> is saying.
>>
>> Good catch. I will adjust the changlog in header and say:
>>
>> 'Add infrastructure and mechanisms for the new runtime table.
>> The runtime modifiable EM data is used by the Energy Aware Scheduler
>> (EAS)for the task placement.
> 
> I would make it more clear that this is going to happen after some
> other subsequent changes.
> 

OK, I will add that information too.

[1] 
https://lore.kernel.org/lkml/91d6e9be-d50c-d157-55a0-79134cbd01fb@arm.com/

Daniel Lezcano Oct. 23, 2023, 5:39 p.m. UTC | #14

On 25/09/2023 10:11, Lukasz Luba wrote:
> The Energy Model (EM) is going to support runtime modification. There
> are going to be 2 EM tables which store information. This patch aims
> to prepare the code to be generic and use one of the tables. The function
> will no longer get a pointer to 'struct em_perf_domain' (the EM) but
> instead a pointer to 'struct em_perf_state' (which is one of the EM's
> tables).
> 
> Prepare em_pd_get_efficient_state() for the upcoming changes and
> make it possible to re-use. Return an index for the best performance
> state for a given EM table. The function arguments that are introduced
> should allow to work on different performance state arrays. The caller of
> em_pd_get_efficient_state() should be able to use the index either
> on the default or the modifiable EM table.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---

[ ... ]

> @@ -251,7 +253,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>   	 * Find the lowest performance state of the Energy Model above the
>   	 * requested frequency.
>   	 */
> -	ps = em_pd_get_efficient_state(pd, freq);
> +	i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
> +				      pd->flags);

nitpicking but s/i/state/

Other than that:

Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>


> +	ps = &pd->table[i];
>   
>   	/*
>   	 * The capacity of a CPU in the domain at the performance state (ps)

[v4,00/18] Introduce runtime modifiable Energy Model

Message

Comments