diff mbox series

[v5,15/23] PM: EM: Optimize em_cpu_energy() and remove division

Message ID 20231129110853.94344-16-lukasz.luba@arm.com
State New
Headers show
Series Introduce runtime modifiable Energy Model | expand

Commit Message

Lukasz Luba Nov. 29, 2023, 11:08 a.m. UTC
The Energy Model (EM) can be modified at runtime which brings new
possibilities. The em_cpu_energy() is called by the Energy Aware Scheduler
(EAS) in it's hot path. The energy calculation uses power value for
a given performance state (ps) and the CPU busy time as percentage for that
given frequency, which effectively is:

pd_nrg = ps->power * busy_time_pct                        (1)

                 cpu_util
busy_time_pct = -----------------                         (2)
                 ps->performance

The 'ps->performance' is the CPU capacity (performance) at that given ps.
Thus, in a situation when the OS is not overloaded and we have EAS
working, the busy time is lower than 'ps->performance' that the CPU is
running at. Therefore, in longer scheduling period we can treat the power
value calculated above as the energy.

We can optimize the last arithmetic operation in em_cpu_energy() and
remove the division. This can be done because em_perf_state::cost, which
is a special coefficient, can now hold the pre-calculated value including
the 'ps->performance' information for a performance state (ps):

              ps->power
ps->cost = ---------------                                (3)
           ps->performance

In the past the 'ps->performance' had to be calculated at runtime every
time the em_cpu_energy() was called. Thus there was this formula involved:

                  ps->freq
ps->performance = ------------- * scale_cpu               (4)
                  cpu_max_freq

When we inject (4) into (2) than we can have this equation:

                 cpu_util * cpu_max_freq
busy_time_pct = ------------------------                  (5)
                 ps->freq * scale_cpu

Because the right 'scale_cpu' value wasn't ready during the boot time
and EM initialization, we had to perform the division by 'scale_cpu'
at runtime. There was not safe mechanism to update EM at runtime.
It has changed thanks to EM runtime modification feature.

It is possible to avoid the division by 'scale_cpu' at runtime, because
EM is updated whenever new max capacity CPU is set in the system or after
the boot has finished and proper CPU capacity is ready.

Use that feature and do the needed division during the calculation of the
coefficient 'ps->cost'. That enhanced 'ps->cost' value can be then just
multiplied simply by utilization:

pd_nrg = ps->cost * \Sum cpu_util                         (6)

to get the needed energy for whole Performance Domain (PD).

With this optimization, the em_cpu_energy() should run faster on the Big
CPU by 1.43x and on the Little CPU by 1.69x.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h | 68 +++++-------------------------------
 kernel/power/energy_model.c  |  7 ++--
 2 files changed, 12 insertions(+), 63 deletions(-)

Comments

Dietmar Eggemann Dec. 12, 2023, 6:50 p.m. UTC | #1
On 29/11/2023 12:08, Lukasz Luba wrote:
> The Energy Model (EM) can be modified at runtime which brings new
> possibilities. The em_cpu_energy() is called by the Energy Aware Scheduler
> (EAS) in it's hot path. The energy calculation uses power value for

NIT: s/it's/its

> a given performance state (ps) and the CPU busy time as percentage for that
> given frequency, which effectively is:
> 
> pd_nrg = ps->power * busy_time_pct                        (1)
> 
>                  cpu_util
> busy_time_pct = -----------------                         (2)
>                  ps->performance
> 
> The 'ps->performance' is the CPU capacity (performance) at that given ps.
> Thus, in a situation when the OS is not overloaded and we have EAS
> working, the busy time is lower than 'ps->performance' that the CPU is
> running at. Therefore, in longer scheduling period we can treat the power
> value calculated above as the energy.

Not sure I understand what a longer 'scheduling period' has to do with
that? Is this to highlight the issue between instantaneous power and the
energy being the integral over it? And the 'scheduling period' is the
runnable time of this task?

> We can optimize the last arithmetic operation in em_cpu_energy() and
> remove the division. This can be done because em_perf_state::cost, which
> is a special coefficient, can now hold the pre-calculated value including
> the 'ps->performance' information for a performance state (ps):
> 
>               ps->power
> ps->cost = ---------------                                (3)
>            ps->performance

Ah, this is equation (2) in the existing code with s/cap/performance.

> In the past the 'ps->performance' had to be calculated at runtime every
> time the em_cpu_energy() was called. Thus there was this formula involved:
> 

>                   ps->freq
> ps->performance = ------------- * scale_cpu               (4)
>                   cpu_max_freq
> 
> When we inject (4) into (2) than we can have this equation:
> 
>                  cpu_util * cpu_max_freq
> busy_time_pct = ------------------------                  (5)
>                  ps->freq * scale_cpu
> 
> Because the right 'scale_cpu' value wasn't ready during the boot time
> and EM initialization, we had to perform the division by 'scale_cpu'
> at runtime. There was not safe mechanism to update EM at runtime.
> It has changed thanks to EM runtime modification feature.
> 
> It is possible to avoid the division by 'scale_cpu' at runtime, because
> EM is updated whenever new max capacity CPU is set in the system or after
> the boot has finished and proper CPU capacity is ready.
> 
> Use that feature and do the needed division during the calculation of the
> coefficient 'ps->cost'. That enhanced 'ps->cost' value can be then just
> multiplied simply by utilization:
> 
> pd_nrg = ps->cost * \Sum cpu_util                         (6)
> 
> to get the needed energy for whole Performance Domain (PD).
> 
> With this optimization, the em_cpu_energy() should run faster on the Big
> CPU by 1.43x and on the Little CPU by 1.69x.

Where are those precise numbers are coming from? Which platform was it?

> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h | 68 +++++-------------------------------
>  kernel/power/energy_model.c  |  7 ++--
>  2 files changed, 12 insertions(+), 63 deletions(-)
> 
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index e30750500b10..0f5621898a81 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -115,27 +115,6 @@ struct em_perf_domain {
>  #define EM_MAX_NUM_CPUS 16
>  #endif
>  
> -/*
> - * To avoid an overflow on 32bit machines while calculating the energy
> - * use a different order in the operation. First divide by the 'cpu_scale'
> - * which would reduce big value stored in the 'cost' field, then multiply by
> - * the 'sum_util'. This would allow to handle existing platforms, which have
> - * e.g. power ~1.3 Watt at max freq, so the 'cost' value > 1mln micro-Watts.
> - * In such scenario, where there are 4 CPUs in the Perf. Domain the 'sum_util'
> - * could be 4096, then multiplication: 'cost' * 'sum_util'  would overflow.
> - * This reordering of operations has some limitations, we lose small
> - * precision in the estimation (comparing to 64bit platform w/o reordering).
> - *
> - * We are safe on 64bit machine.
> - */
> -#ifdef CONFIG_64BIT
> -#define em_estimate_energy(cost, sum_util, scale_cpu) \
> -	(((cost) * (sum_util)) / (scale_cpu))
> -#else
> -#define em_estimate_energy(cost, sum_util, scale_cpu) \
> -	(((cost) / (scale_cpu)) * (sum_util))
> -#endif
> -
>  struct em_data_callback {
>  	/**
>  	 * active_power() - Provide power at the next performance state of
> @@ -249,29 +228,16 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>  {
>  	struct em_perf_table *runtime_table;
>  	struct em_perf_state *ps;
> -	unsigned long scale_cpu;
> -	int cpu, i;
> +	int i;
>  
>  	if (!sum_util)
>  		return 0;
>  
> -	/*
> -	 * In order to predict the performance state, map the utilization of
> -	 * the most utilized CPU of the performance domain to a requested
> -	 * frequency, like schedutil. Take also into account that the real
> -	 * frequency might be set lower (due to thermal capping). Thus, clamp
> -	 * max utilization to the allowed CPU capacity before calculating
> -	 * effective frequency.

Why do you remove this comment? IMHO, it's still valid and independent
of the changes here?

> -	 */
> -	cpu = cpumask_first(to_cpumask(pd->cpus));
> -	scale_cpu = arch_scale_cpu_capacity(cpu);
> -
>  	/*
>  	 * No rcu_read_lock() since it's already called by task scheduler.
>  	 * The runtime_table is always there for CPUs, so we don't check.
>  	 */
>  	runtime_table = rcu_dereference(pd->runtime_table);
> -
>  	ps = &runtime_table->state[pd->nr_perf_states - 1];
>  
>  	max_util = map_util_perf(max_util);
> @@ -286,35 +252,21 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>  	ps = &runtime_table->state[i];
>  
>  	/*
> -	 * The capacity of a CPU in the domain at the performance state (ps)
> -	 * can be computed as:
> -	 *
> -	 *             ps->freq * scale_cpu
> -	 *   ps->cap = --------------------                          (1)
> -	 *                 cpu_max_freq
> -	 *
> -	 * So, ignoring the costs of idle states (which are not available in
> -	 * the EM), the energy consumed by this CPU at that performance state
> +	 * The energy consumed by the CPU at the given performance state (ps)
>  	 * is estimated as:
>  	 *
> -	 *             ps->power * cpu_util
> -	 *   cpu_nrg = --------------------                          (2)
> -	 *                   ps->cap
> +	 *                ps->power
> +	 *   cpu_nrg = --------------- * cpu_util                    (1)
> +	 *             ps->performance
>  	 *
> -	 * since 'cpu_util / ps->cap' represents its percentage of busy time.
> +	 * The 'cpu_util / ps->performance' represents its percentage of
> +	 * busy time. The idle cost is ignored (it's not available in the EM).
>  	 *
>  	 *   NOTE: Although the result of this computation actually is in
>  	 *         units of power, it can be manipulated as an energy value
>  	 *         over a scheduling period, since it is assumed to be
>  	 *         constant during that interval.
>  	 *
> -	 * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
> -	 * of two terms:
> -	 *
> -	 *             ps->power * cpu_max_freq   cpu_util
> -	 *   cpu_nrg = ------------------------ * ---------          (3)
> -	 *                    ps->freq            scale_cpu
> -	 *
>  	 * The first term is static, and is stored in the em_perf_state struct
>  	 * as 'ps->cost'.
>  	 *
> @@ -323,11 +275,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>  	 * total energy of the domain (which is the simple sum of the energy of
>  	 * all of its CPUs) can be factorized as:
>  	 *
> -	 *            ps->cost * \Sum cpu_util
> -	 *   pd_nrg = ------------------------                       (4)
> -	 *                  scale_cpu
> +	 *   pd_nrg = ps->cost * \Sum cpu_util                       (2)
>  	 */
> -	return em_estimate_energy(ps->cost, sum_util, scale_cpu);
> +	return ps->cost * sum_util;

Can you not keep the existing comment and only change:

(a) that ps->cap id ps->performance in (2) and

(b) that:

          *             ps->power * cpu_max_freq   cpu_util
          *   cpu_nrg = ------------------------ * ---------     (3)
          *                    ps->freq            scale_cpu

                        <---- (old) ps->cost --->

    is now

                ps->power * cpu_max_freq       1
    ps-> cost = ------------------------ * ----------
                        ps->freq            scale_cpu

                <---- (old) ps->cost --->

and (c) that (4) has changed to:

         *   pd_nrg = ps->cost * \Sum cpu_util                   (4)

which avoid the division?

Less changes is always much nicer since it makes it so much easier to
detect history and review changes.

I do understand the changes from the technical viewpoint but the review
took me way too long which I partly blame to all the changes in the
comments which could have been avoided. Just want to make sure that
others done have to go through this pain too.

[...]
Lukasz Luba Dec. 20, 2023, 8:42 a.m. UTC | #2
On 12/12/23 18:50, Dietmar Eggemann wrote:
> On 29/11/2023 12:08, Lukasz Luba wrote:
>> The Energy Model (EM) can be modified at runtime which brings new
>> possibilities. The em_cpu_energy() is called by the Energy Aware Scheduler
>> (EAS) in it's hot path. The energy calculation uses power value for
> 
> NIT: s/it's/its

OK

> 
>> a given performance state (ps) and the CPU busy time as percentage for that
>> given frequency, which effectively is:
>>
>> pd_nrg = ps->power * busy_time_pct                        (1)
>>
>>                   cpu_util
>> busy_time_pct = -----------------                         (2)
>>                   ps->performance
>>
>> The 'ps->performance' is the CPU capacity (performance) at that given ps.
>> Thus, in a situation when the OS is not overloaded and we have EAS
>> working, the busy time is lower than 'ps->performance' that the CPU is
>> running at. Therefore, in longer scheduling period we can treat the power
>> value calculated above as the energy.
> 
> Not sure I understand what a longer 'scheduling period' has to do with
> that? Is this to highlight the issue between instantaneous power and the
> energy being the integral over it? And the 'scheduling period' is the
> runnable time of this task?

I can probably drop this sentence. I just wanted to describe that EAS
operates on power values, but actually assumes that it will be energy
because we know that the tasks will run longer. It's not the best
place to even try to describe this bit of EAS+EM in this patch header.

> 
>> We can optimize the last arithmetic operation in em_cpu_energy() and
>> remove the division. This can be done because em_perf_state::cost, which
>> is a special coefficient, can now hold the pre-calculated value including
>> the 'ps->performance' information for a performance state (ps):
>>
>>                ps->power
>> ps->cost = ---------------                                (3)
>>             ps->performance
> 
> Ah, this is equation (2) in the existing code with s/cap/performance.

yes

> 
>> In the past the 'ps->performance' had to be calculated at runtime every
>> time the em_cpu_energy() was called. Thus there was this formula involved:
>>
> 
>>                    ps->freq
>> ps->performance = ------------- * scale_cpu               (4)
>>                    cpu_max_freq
>>
>> When we inject (4) into (2) than we can have this equation:
>>
>>                   cpu_util * cpu_max_freq
>> busy_time_pct = ------------------------                  (5)
>>                   ps->freq * scale_cpu
>>
>> Because the right 'scale_cpu' value wasn't ready during the boot time
>> and EM initialization, we had to perform the division by 'scale_cpu'
>> at runtime. There was not safe mechanism to update EM at runtime.
>> It has changed thanks to EM runtime modification feature.
>>
>> It is possible to avoid the division by 'scale_cpu' at runtime, because
>> EM is updated whenever new max capacity CPU is set in the system or after
>> the boot has finished and proper CPU capacity is ready.
>>
>> Use that feature and do the needed division during the calculation of the
>> coefficient 'ps->cost'. That enhanced 'ps->cost' value can be then just
>> multiplied simply by utilization:
>>
>> pd_nrg = ps->cost * \Sum cpu_util                         (6)
>>
>> to get the needed energy for whole Performance Domain (PD).
>>
>> With this optimization, the em_cpu_energy() should run faster on the Big
>> CPU by 1.43x and on the Little CPU by 1.69x.
> 
> Where are those precise numbers are coming from? Which platform was it?

That was mainline big.Little board rockpi4 b w/ rockchip 3399, present
quite a few commercial devices (e.g. chromebooks or plenty other seen in
DT). The numbers are from measuring the time it takes to run this
function em_cpu_cost() in a loop for mln of times. Thus, the instruction
cache and data cache should be hot, but the operation would impact the
different score.

> 
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   include/linux/energy_model.h | 68 +++++-------------------------------
>>   kernel/power/energy_model.c  |  7 ++--
>>   2 files changed, 12 insertions(+), 63 deletions(-)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index e30750500b10..0f5621898a81 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -115,27 +115,6 @@ struct em_perf_domain {
>>   #define EM_MAX_NUM_CPUS 16
>>   #endif
>>   
>> -/*
>> - * To avoid an overflow on 32bit machines while calculating the energy
>> - * use a different order in the operation. First divide by the 'cpu_scale'
>> - * which would reduce big value stored in the 'cost' field, then multiply by
>> - * the 'sum_util'. This would allow to handle existing platforms, which have
>> - * e.g. power ~1.3 Watt at max freq, so the 'cost' value > 1mln micro-Watts.
>> - * In such scenario, where there are 4 CPUs in the Perf. Domain the 'sum_util'
>> - * could be 4096, then multiplication: 'cost' * 'sum_util'  would overflow.
>> - * This reordering of operations has some limitations, we lose small
>> - * precision in the estimation (comparing to 64bit platform w/o reordering).
>> - *
>> - * We are safe on 64bit machine.
>> - */
>> -#ifdef CONFIG_64BIT
>> -#define em_estimate_energy(cost, sum_util, scale_cpu) \
>> -	(((cost) * (sum_util)) / (scale_cpu))
>> -#else
>> -#define em_estimate_energy(cost, sum_util, scale_cpu) \
>> -	(((cost) / (scale_cpu)) * (sum_util))
>> -#endif
>> -
>>   struct em_data_callback {
>>   	/**
>>   	 * active_power() - Provide power at the next performance state of
>> @@ -249,29 +228,16 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>>   {
>>   	struct em_perf_table *runtime_table;
>>   	struct em_perf_state *ps;
>> -	unsigned long scale_cpu;
>> -	int cpu, i;
>> +	int i;
>>   
>>   	if (!sum_util)
>>   		return 0;
>>   
>> -	/*
>> -	 * In order to predict the performance state, map the utilization of
>> -	 * the most utilized CPU of the performance domain to a requested
>> -	 * frequency, like schedutil. Take also into account that the real
>> -	 * frequency might be set lower (due to thermal capping). Thus, clamp
>> -	 * max utilization to the allowed CPU capacity before calculating
>> -	 * effective frequency.
> 
> Why do you remove this comment? IMHO, it's still valid and independent
> of the changes here?

Fair enough, I thought this comment makes more confusion in the new
function, but I'll keep it.

> 
>> -	 */
>> -	cpu = cpumask_first(to_cpumask(pd->cpus));
>> -	scale_cpu = arch_scale_cpu_capacity(cpu);
>> -
>>   	/*
>>   	 * No rcu_read_lock() since it's already called by task scheduler.
>>   	 * The runtime_table is always there for CPUs, so we don't check.
>>   	 */
>>   	runtime_table = rcu_dereference(pd->runtime_table);
>> -
>>   	ps = &runtime_table->state[pd->nr_perf_states - 1];
>>   
>>   	max_util = map_util_perf(max_util);
>> @@ -286,35 +252,21 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>>   	ps = &runtime_table->state[i];
>>   
>>   	/*
>> -	 * The capacity of a CPU in the domain at the performance state (ps)
>> -	 * can be computed as:
>> -	 *
>> -	 *             ps->freq * scale_cpu
>> -	 *   ps->cap = --------------------                          (1)
>> -	 *                 cpu_max_freq
>> -	 *
>> -	 * So, ignoring the costs of idle states (which are not available in
>> -	 * the EM), the energy consumed by this CPU at that performance state
>> +	 * The energy consumed by the CPU at the given performance state (ps)
>>   	 * is estimated as:
>>   	 *
>> -	 *             ps->power * cpu_util
>> -	 *   cpu_nrg = --------------------                          (2)
>> -	 *                   ps->cap
>> +	 *                ps->power
>> +	 *   cpu_nrg = --------------- * cpu_util                    (1)
>> +	 *             ps->performance
>>   	 *
>> -	 * since 'cpu_util / ps->cap' represents its percentage of busy time.
>> +	 * The 'cpu_util / ps->performance' represents its percentage of
>> +	 * busy time. The idle cost is ignored (it's not available in the EM).
>>   	 *
>>   	 *   NOTE: Although the result of this computation actually is in
>>   	 *         units of power, it can be manipulated as an energy value
>>   	 *         over a scheduling period, since it is assumed to be
>>   	 *         constant during that interval.
>>   	 *
>> -	 * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
>> -	 * of two terms:
>> -	 *
>> -	 *             ps->power * cpu_max_freq   cpu_util
>> -	 *   cpu_nrg = ------------------------ * ---------          (3)
>> -	 *                    ps->freq            scale_cpu
>> -	 *
>>   	 * The first term is static, and is stored in the em_perf_state struct
>>   	 * as 'ps->cost'.
>>   	 *
>> @@ -323,11 +275,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>>   	 * total energy of the domain (which is the simple sum of the energy of
>>   	 * all of its CPUs) can be factorized as:
>>   	 *
>> -	 *            ps->cost * \Sum cpu_util
>> -	 *   pd_nrg = ------------------------                       (4)
>> -	 *                  scale_cpu
>> +	 *   pd_nrg = ps->cost * \Sum cpu_util                       (2)
>>   	 */
>> -	return em_estimate_energy(ps->cost, sum_util, scale_cpu);
>> +	return ps->cost * sum_util;
> 
> Can you not keep the existing comment and only change:
> 
> (a) that ps->cap id ps->performance in (2) and
> 
> (b) that:
> 
>            *             ps->power * cpu_max_freq   cpu_util
>            *   cpu_nrg = ------------------------ * ---------     (3)
>            *                    ps->freq            scale_cpu
> 
>                          <---- (old) ps->cost --->
> 
>      is now
> 
>                  ps->power * cpu_max_freq       1
>      ps-> cost = ------------------------ * ----------
>                          ps->freq            scale_cpu
> 
>                  <---- (old) ps->cost --->
> 
> and (c) that (4) has changed to:
> 
>           *   pd_nrg = ps->cost * \Sum cpu_util                   (4)
> 
> which avoid the division?
> 
> Less changes is always much nicer since it makes it so much easier to
> detect history and review changes.

I'm open to change that, but I will have to contact you offline
what you mean. This comment section in code is really tricky to
handle right.

> 
> I do understand the changes from the technical viewpoint but the review
> took me way too long which I partly blame to all the changes in the
> comments which could have been avoided. Just want to make sure that
> others done have to go through this pain too.
> 

I'll try to apply your comments and produce smaller diff in that
patch.
Lukasz Luba Jan. 2, 2024, 11:47 a.m. UTC | #3
On 12/28/23 18:06, Qais Yousef wrote:
> On 11/29/23 11:08, Lukasz Luba wrote:
> 
>> @@ -220,8 +218,9 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>>   				return -EINVAL;
>>   			}
>>   		} else {
>> -			power_res = table[i].power;
>> -			cost = div64_u64(fmax * power_res, table[i].frequency);
>> +			/* increase resolution of 'cost' precision */
>> +			power_res = table[i].power * 10;
> 
> Power is in uW, right? You're just taking advantage here that everything will
> use this new cost field so you can add as many 0s to improve resolution without
> impact elsewhere that care to compare using the same units?

This code doesn't overwrite the 'power' field value. The 'cost' value is
only used in EAS, so yes I just want to increase resolution there.

I think you mixed 'power' and 'cost' fields. We don't compare 'cost'
anywhere. We just use 'cost' in one place em_cpu_energy() and we
multiply it (not compare it).

> 
> Did you see a problem or just being extra cautious here?

There is no problem, 'cost' is a private coefficient for EAS only.

> 
>> +			cost = power_res / table[i].performance;
>>   		}
>>   
>>   		table[i].cost = cost;
>> -- 
>> 2.25.1
>>
Lukasz Luba Jan. 4, 2024, 4:56 p.m. UTC | #4
On 1/4/24 16:30, Dietmar Eggemann wrote:
> On 20/12/2023 09:42, Lukasz Luba wrote:
>>
>>
>> On 12/12/23 18:50, Dietmar Eggemann wrote:
>>> On 29/11/2023 12:08, Lukasz Luba wrote:
> 
> [...]
> 
>>>> With this optimization, the em_cpu_energy() should run faster on the Big
>>>> CPU by 1.43x and on the Little CPU by 1.69x.
>>>
>>> Where are those precise numbers are coming from? Which platform was it?
>>
>> That was mainline big.Little board rockpi4 b w/ rockchip 3399, present
> 
> IMHO, you should mention the platform here so people don't wonder.
> 
>> quite a few commercial devices (e.g. chromebooks or plenty other seen in
>> DT). The numbers are from measuring the time it takes to run this
>> function em_cpu_cost() in a loop for mln of times. Thus, the instruction
>> cache and data cache should be hot, but the operation would impact the
>> different score.
> 
> [...]
> 
>>> Can you not keep the existing comment and only change:
>>>
>>> (a) that ps->cap id ps->performance in (2) and
>>>
>>> (b) that:
>>>
>>>             *             ps->power * cpu_max_freq   cpu_util
>>>             *   cpu_nrg = ------------------------ * ---------     (3)
>>>             *                    ps->freq            scale_cpu
>>>
>>>                           <---- (old) ps->cost --->
>>>
>>>       is now
>>>
>>>                   ps->power * cpu_max_freq       1
>>>       ps-> cost = ------------------------ * ----------
>>>                           ps->freq            scale_cpu
>>>
>>>                   <---- (old) ps->cost --->
>>>
>>> and (c) that (4) has changed to:
>>>
>>>            *   pd_nrg = ps->cost * \Sum cpu_util                   (4)
>>>
>>> which avoid the division?
>>>
>>> Less changes is always much nicer since it makes it so much easier to
>>> detect history and review changes.
>>
>> I'm open to change that, but I will have to contact you offline
>> what you mean. This comment section in code is really tricky to
>> handle right.
> 
> OK, the changes you showed me offline LGTM.
> 
> [...]
> 

All good then. Thank you for the comments. I'll send v6.
Lukasz Luba Jan. 10, 2024, 1:53 p.m. UTC | #5
On 1/4/24 19:23, Qais Yousef wrote:
> On 01/02/24 11:47, Lukasz Luba wrote:
>>> Did you see a problem or just being extra cautious here?
>>
>> There is no problem, 'cost' is a private coefficient for EAS only.
> 
> Let me  ask differently, what goes wrong if you don't increase the resolution
> here? Why is it necessary?
> 


When you have 800mW at CPU capacity 1024, then the value is small (below
1 thousand).
Example:
power = 800000 uW
cost = 800000 / 1024 = 781

While I know from past that sometimes OPPs might have close voltage
values and a rounding could occur and make some OPPs inefficient
while they aren't.

This is what would happen when we have the 1x resolution:
/sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
/sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
/sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
/sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
/sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
/sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
/sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
which is not true (see below).

This is what would happen when we have the 10x resolution:
/sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
/sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
/sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
/sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
/sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
/sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
/sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
Here the OPP with 600MHz is more efficient than 408MHz,
which is true. So only 408MHz will be marked as in-efficient OPP.


This is what would happen when we have the 100x resolution:
/sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
/sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
/sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
/sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
/sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
/sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
/sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
The higher (100x) resolution does not bring that much in
practice.

If you have other questions, let's continue on v6 series.
Qais Yousef Jan. 15, 2024, 12:21 p.m. UTC | #6
On 01/10/24 13:53, Lukasz Luba wrote:
> 
> 
> On 1/4/24 19:23, Qais Yousef wrote:
> > On 01/02/24 11:47, Lukasz Luba wrote:
> > > > Did you see a problem or just being extra cautious here?
> > > 
> > > There is no problem, 'cost' is a private coefficient for EAS only.
> > 
> > Let me  ask differently, what goes wrong if you don't increase the resolution
> > here? Why is it necessary?
> > 
> 
> 
> When you have 800mW at CPU capacity 1024, then the value is small (below
> 1 thousand).
> Example:
> power = 800000 uW
> cost = 800000 / 1024 = 781
> 
> While I know from past that sometimes OPPs might have close voltage
> values and a rounding could occur and make some OPPs inefficient
> while they aren't.
> 
> This is what would happen when we have the 1x resolution:
> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
> The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
> which is not true (see below).
> 
> This is what would happen when we have the 10x resolution:
> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
> Here the OPP with 600MHz is more efficient than 408MHz,
> which is true. So only 408MHz will be marked as in-efficient OPP.
> 
> 
> This is what would happen when we have the 100x resolution:
> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
> The higher (100x) resolution does not bring that much in
> practice.

So it seems a uW is not sufficient enough. We moved from mW because of
resolution already. Shall we make it nW then and multiply by 1000 always? The
choice of 10 looks arbitrary IMHO

> 
> If you have other questions, let's continue on v6 series.
Lukasz Luba Jan. 15, 2024, 12:36 p.m. UTC | #7
On 1/15/24 12:21, Qais Yousef wrote:
> On 01/10/24 13:53, Lukasz Luba wrote:
>>
>>
>> On 1/4/24 19:23, Qais Yousef wrote:
>>> On 01/02/24 11:47, Lukasz Luba wrote:
>>>>> Did you see a problem or just being extra cautious here?
>>>>
>>>> There is no problem, 'cost' is a private coefficient for EAS only.
>>>
>>> Let me  ask differently, what goes wrong if you don't increase the resolution
>>> here? Why is it necessary?
>>>
>>
>>
>> When you have 800mW at CPU capacity 1024, then the value is small (below
>> 1 thousand).
>> Example:
>> power = 800000 uW
>> cost = 800000 / 1024 = 781
>>
>> While I know from past that sometimes OPPs might have close voltage
>> values and a rounding could occur and make some OPPs inefficient
>> while they aren't.
>>
>> This is what would happen when we have the 1x resolution:
>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
>> The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
>> which is not true (see below).
>>
>> This is what would happen when we have the 10x resolution:
>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
>> Here the OPP with 600MHz is more efficient than 408MHz,
>> which is true. So only 408MHz will be marked as in-efficient OPP.
>>
>>
>> This is what would happen when we have the 100x resolution:
>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
>> The higher (100x) resolution does not bring that much in
>> practice.
> 
> So it seems a uW is not sufficient enough. We moved from mW because of
> resolution already. Shall we make it nW then and multiply by 1000 always? The
> choice of 10 looks arbitrary IMHO
> 

No, there is no need of nW in the 'power' field for this.
You've missed the point.
Qais Yousef Jan. 16, 2024, 1:10 p.m. UTC | #8
On 01/15/24 12:36, Lukasz Luba wrote:
> 
> 
> On 1/15/24 12:21, Qais Yousef wrote:
> > On 01/10/24 13:53, Lukasz Luba wrote:
> > > 
> > > 
> > > On 1/4/24 19:23, Qais Yousef wrote:
> > > > On 01/02/24 11:47, Lukasz Luba wrote:
> > > > > > Did you see a problem or just being extra cautious here?
> > > > > 
> > > > > There is no problem, 'cost' is a private coefficient for EAS only.
> > > > 
> > > > Let me  ask differently, what goes wrong if you don't increase the resolution
> > > > here? Why is it necessary?
> > > > 
> > > 
> > > 
> > > When you have 800mW at CPU capacity 1024, then the value is small (below
> > > 1 thousand).
> > > Example:
> > > power = 800000 uW
> > > cost = 800000 / 1024 = 781
> > > 
> > > While I know from past that sometimes OPPs might have close voltage
> > > values and a rounding could occur and make some OPPs inefficient
> > > while they aren't.
> > > 
> > > This is what would happen when we have the 1x resolution:
> > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
> > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
> > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
> > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
> > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
> > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
> > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
> > > The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
> > > which is not true (see below).
> > > 
> > > This is what would happen when we have the 10x resolution:
> > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
> > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
> > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
> > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
> > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
> > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
> > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
> > > Here the OPP with 600MHz is more efficient than 408MHz,
> > > which is true. So only 408MHz will be marked as in-efficient OPP.
> > > 
> > > 
> > > This is what would happen when we have the 100x resolution:
> > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
> > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
> > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
> > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
> > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
> > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
> > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
> > > The higher (100x) resolution does not bring that much in
> > > practice.
> > 
> > So it seems a uW is not sufficient enough. We moved from mW because of
> > resolution already. Shall we make it nW then and multiply by 1000 always? The
> > choice of 10 looks arbitrary IMHO
> > 
> 
> No, there is no need of nW in the 'power' field for this.
> You've missed the point.

I think you're missing what I am saying. The multiplication by 10 looks like
magic value to increase resolution based on a single observation you noticed.

The feedback I am giving is that this needs to be better explained, in
a comment. And instead of multiplying by 10 multiply by 1000. Saying this is
enough based on a single observation is not adequate IMO.

Also the difference is tiny. Could you actually measure a difference in overall
power with and without this extra decimal point resolution? It might be better
to run at 816MHz and go back to idle faster. So the trade-off is not clear cut
to me.

So generally I am not keen on magic values based on single observations.
I think removing this or use 1000 is better.

AFAICT you decided that 0.1uW is worth caring about. But 0.19uW difference
isn't.

I can't see how much difference this makes in practice tbh. But using more
uniform conversion so that the cost is in nW (keep the power field in uW) makes
more sense at least.

It still raises the question whether this minuscule cost difference is actually
better taken into account. I think the perf/watt for 816MHz is much better so
skipping 600MHz as inefficient looks better to me.
Lukasz Luba Jan. 16, 2024, 3:34 p.m. UTC | #9
On 1/16/24 13:10, Qais Yousef wrote:
> On 01/15/24 12:36, Lukasz Luba wrote:
>>
>>
>> On 1/15/24 12:21, Qais Yousef wrote:
>>> On 01/10/24 13:53, Lukasz Luba wrote:
>>>>
>>>>
>>>> On 1/4/24 19:23, Qais Yousef wrote:
>>>>> On 01/02/24 11:47, Lukasz Luba wrote:
>>>>>>> Did you see a problem or just being extra cautious here?
>>>>>>
>>>>>> There is no problem, 'cost' is a private coefficient for EAS only.
>>>>>
>>>>> Let me  ask differently, what goes wrong if you don't increase the resolution
>>>>> here? Why is it necessary?
>>>>>
>>>>
>>>>
>>>> When you have 800mW at CPU capacity 1024, then the value is small (below
>>>> 1 thousand).
>>>> Example:
>>>> power = 800000 uW
>>>> cost = 800000 / 1024 = 781
>>>>
>>>> While I know from past that sometimes OPPs might have close voltage
>>>> values and a rounding could occur and make some OPPs inefficient
>>>> while they aren't.
>>>>
>>>> This is what would happen when we have the 1x resolution:
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
>>>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
>>>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
>>>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
>>>> The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
>>>> which is not true (see below).
>>>>
>>>> This is what would happen when we have the 10x resolution:
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
>>>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
>>>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
>>>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
>>>> Here the OPP with 600MHz is more efficient than 408MHz,
>>>> which is true. So only 408MHz will be marked as in-efficient OPP.
>>>>
>>>>
>>>> This is what would happen when we have the 100x resolution:
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
>>>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
>>>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
>>>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
>>>> The higher (100x) resolution does not bring that much in
>>>> practice.
>>>
>>> So it seems a uW is not sufficient enough. We moved from mW because of
>>> resolution already. Shall we make it nW then and multiply by 1000 always? The
>>> choice of 10 looks arbitrary IMHO
>>>
>>
>> No, there is no need of nW in the 'power' field for this.
>> You've missed the point.
> 
> I think you're missing what I am saying. The multiplication by 10 looks like
> magic value to increase resolution based on a single observation you noticed.
> 
> The feedback I am giving is that this needs to be better explained, in
> a comment. And instead of multiplying by 10 multiply by 1000. Saying this is
> enough based on a single observation is not adequate IMO.

I think you are trying to review something which you don't have full
details and previous history. I have been fighting with those rounding
issues in past and there are commits with description of issues.
You haven't analyze all edge cases, one more is below (about your
proposal with 1000x the nW).

> 
> Also the difference is tiny. Could you actually measure a difference in overall
> power with and without this extra decimal point resolution? It might be better

Yes, I had such power measurements, but for older rounding issues. Take
into account that the EM model is reflecting one CPU, but in reality we
often have 4 CPUs linked together in one frequency domain. Thus, a small
energy difference is actually multiplied.

> to run at 816MHz and go back to idle faster. So the trade-off is not clear cut
> to me.

It's not the $subject to discuss other possible design which set such
trade-offs differently. Please don't mix many topics. A "race to idle"
from OPPs which have a bit higher voltage is totally different topic,
currently not in EAS design at all. Otherwise we end up in a heuristic
issue like: how much more 'inefficient' it has to be to skip it.
Currently we are strict in 'inefficient' OPP tagging.

> 
> So generally I am not keen on magic values based on single observations.
> I think removing this or use 1000 is better.

That is your opinion. I've tried to explain to you:
1) why we cannot remove it and why we need the 10x
2) why we don't need more that 10x

> 
> AFAICT you decided that 0.1uW is worth caring about. But 0.19uW difference
> isn't.

It's not strictly related to power value, but the earlier division
operation that we perform in setup time and not in runtime (in different
order on the arguments in the math involved). That operation cuts some
important information from the integer value (as listed above in those
different configurations' dumps of 'cost' values).

> 
> I can't see how much difference this makes in practice tbh. But using more
> uniform conversion so that the cost is in nW (keep the power field in uW) makes
> more sense at least.

This is the edge case which I've mentioned at the begging that you're
missing some background. Your proposal is to have 1000x resolution so in
nano-Watts power for the 'cost'. Let's consider example power of 1.4Watt
on single CPU at mid-high-freq OPP (700 capacity), running on 32bit
kernel, so unsigned long has 32bit.

power = 1.4W = 1400000000nW

cost = 1400000000 / 700 = 2000000 (2mln)

Then in EAS we can have this simulation:
4 CPUs with util 550 voting for this OPP (700 capacity),
so the em_cpu_energy() would perform:

return cost * sum_util

2000000 * (4 * 550) = 4400000000 <--- overflow on 32bit ulong

That's why I said you haven't considered your proposal fully.

> 
> It still raises the question whether this minuscule cost difference is actually
> better taken into account. I think the perf/watt for 816MHz is much better so
> skipping 600MHz as inefficient looks better to me.
> 

This is exactly the place where we disagree. You "think the perf/watt
for 816MHz is much better so skipping 600MHz as inefficient looks
better". For me, the numbers from 3 different configuration dumps are
telling me exactly opposite. I will base the algorithms on the numbers
and not on a heuristic that I think looks better.

I'm going to send v7. Please end this discussion on v5.

Regards,
Lukasz
Qais Yousef Jan. 16, 2024, 7:33 p.m. UTC | #10
On 01/16/24 15:34, Lukasz Luba wrote:
> 
> 
> On 1/16/24 13:10, Qais Yousef wrote:
> > On 01/15/24 12:36, Lukasz Luba wrote:
> > > 
> > > 
> > > On 1/15/24 12:21, Qais Yousef wrote:
> > > > On 01/10/24 13:53, Lukasz Luba wrote:
> > > > > 
> > > > > 
> > > > > On 1/4/24 19:23, Qais Yousef wrote:
> > > > > > On 01/02/24 11:47, Lukasz Luba wrote:
> > > > > > > > Did you see a problem or just being extra cautious here?
> > > > > > > 
> > > > > > > There is no problem, 'cost' is a private coefficient for EAS only.
> > > > > > 
> > > > > > Let me  ask differently, what goes wrong if you don't increase the resolution
> > > > > > here? Why is it necessary?
> > > > > > 
> > > > > 
> > > > > 
> > > > > When you have 800mW at CPU capacity 1024, then the value is small (below
> > > > > 1 thousand).
> > > > > Example:
> > > > > power = 800000 uW
> > > > > cost = 800000 / 1024 = 781
> > > > > 
> > > > > While I know from past that sometimes OPPs might have close voltage
> > > > > values and a rounding could occur and make some OPPs inefficient
> > > > > while they aren't.
> > > > > 
> > > > > This is what would happen when we have the 1x resolution:
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
> > > > > The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
> > > > > which is not true (see below).
> > > > > 
> > > > > This is what would happen when we have the 10x resolution:
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
> > > > > Here the OPP with 600MHz is more efficient than 408MHz,
> > > > > which is true. So only 408MHz will be marked as in-efficient OPP.
> > > > > 
> > > > > 
> > > > > This is what would happen when we have the 100x resolution:
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
> > > > > The higher (100x) resolution does not bring that much in
> > > > > practice.
> > > > 
> > > > So it seems a uW is not sufficient enough. We moved from mW because of
> > > > resolution already. Shall we make it nW then and multiply by 1000 always? The
> > > > choice of 10 looks arbitrary IMHO
> > > > 
> > > 
> > > No, there is no need of nW in the 'power' field for this.
> > > You've missed the point.
> > 
> > I think you're missing what I am saying. The multiplication by 10 looks like
> > magic value to increase resolution based on a single observation you noticed.
> > 
> > The feedback I am giving is that this needs to be better explained, in
> > a comment. And instead of multiplying by 10 multiply by 1000. Saying this is
> > enough based on a single observation is not adequate IMO.
> 
> I think you are trying to review something which you don't have full
> details and previous history. I have been fighting with those rounding

I don't think so..

> issues in past and there are commits with description of issues.
> You haven't analyze all edge cases, one more is below (about your
> proposal with 1000x the nW).
> 
> > 
> > Also the difference is tiny. Could you actually measure a difference in overall
> > power with and without this extra decimal point resolution? It might be better
> 
> Yes, I had such power measurements, but for older rounding issues. Take

so not against this series..

> into account that the EM model is reflecting one CPU, but in reality we
> often have 4 CPUs linked together in one frequency domain. Thus, a small
> energy difference is actually multiplied.
> 
> > to run at 816MHz and go back to idle faster. So the trade-off is not clear cut
> > to me.
> 
> It's not the $subject to discuss other possible design which set such
> trade-offs differently. Please don't mix many topics. A "race to idle"

I am not mixing topics. I am questioning the claim about this addition of
resolution which looked random to me.

> from OPPs which have a bit higher voltage is totally different topic,
> currently not in EAS design at all. Otherwise we end up in a heuristic
> issue like: how much more 'inefficient' it has to be to skip it.
> Currently we are strict in 'inefficient' OPP tagging.

Then this part of this patch about the resolution better be split into its own
patch submission?

> 
> > 
> > So generally I am not keen on magic values based on single observations.
> > I think removing this or use 1000 is better.
> 
> That is your opinion. I've tried to explain to you:
> 1) why we cannot remove it and why we need the 10x
> 2) why we don't need more that 10x
> 
> > 
> > AFAICT you decided that 0.1uW is worth caring about. But 0.19uW difference
> > isn't.
> 
> It's not strictly related to power value, but the earlier division
> operation that we perform in setup time and not in runtime (in different
> order on the arguments in the math involved). That operation cuts some
> important information from the integer value (as listed above in those
> different configurations' dumps of 'cost' values).

-ENOPARSE. From what I see the cost has different resolution.

> 
> > 
> > I can't see how much difference this makes in practice tbh. But using more
> > uniform conversion so that the cost is in nW (keep the power field in uW) makes
> > more sense at least.
> 
> This is the edge case which I've mentioned at the begging that you're
> missing some background. Your proposal is to have 1000x resolution so in
> nano-Watts power for the 'cost'. Let's consider example power of 1.4Watt
> on single CPU at mid-high-freq OPP (700 capacity), running on 32bit
> kernel, so unsigned long has 32bit.
> 
> power = 1.4W = 1400000000nW
> 
> cost = 1400000000 / 700 = 2000000 (2mln)
> 
> Then in EAS we can have this simulation:
> 4 CPUs with util 550 voting for this OPP (700 capacity),
> so the em_cpu_energy() would perform:
> 
> return cost * sum_util
> 
> 2000000 * (4 * 550) = 4400000000 <--- overflow on 32bit ulong
> 
> That's why I said you haven't considered your proposal fully.

overflow was in mind, I didn't feel it was necessary to elaborate more..
overflows issues can be handled

> 
> > 
> > It still raises the question whether this minuscule cost difference is actually
> > better taken into account. I think the perf/watt for 816MHz is much better so
> > skipping 600MHz as inefficient looks better to me.
> > 
> 
> This is exactly the place where we disagree. You "think the perf/watt
> for 816MHz is much better so skipping 600MHz as inefficient looks
> better". For me, the numbers from 3 different configuration dumps are
> telling me exactly opposite. I will base the algorithms on the numbers
> and not on a heuristic that I think looks better.
> 
> I'm going to send v7. Please end this discussion on v5.

This thread is the context of the discussion..

It seems you don't want the feedback. I don't think there's mixing of topics.
But decisions made and I don't see proper explanation to them. Hence the
questions and probing proposals in attempt to understand more. We do have to
cover a wide areas of cases in general and enforcing such random numbers has
been a problem in practice as there's no sensible defaults. And I am not seeing
that this is a good generalization from what I am reading. Similar to
util_threshold which has caused power regressions, and migration margin and
dvfs headroom that are causing problems too. I see this is another random
number added. The numbers you're referring to are very limited in scope, it's
not number vs heuristics looks better the issue here. It seems you don't want
to consider the perf/watt impact rather than the pure cost of a single OPP
- which was the true intent behind the question. I might have misunderstood
something, but If you explained this in this reply, I certainly would have lost
it with this constant stop discussing and move to v6 or v7.

Anyway. I'll leave this at here.
diff mbox series

Patch

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index e30750500b10..0f5621898a81 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -115,27 +115,6 @@  struct em_perf_domain {
 #define EM_MAX_NUM_CPUS 16
 #endif
 
-/*
- * To avoid an overflow on 32bit machines while calculating the energy
- * use a different order in the operation. First divide by the 'cpu_scale'
- * which would reduce big value stored in the 'cost' field, then multiply by
- * the 'sum_util'. This would allow to handle existing platforms, which have
- * e.g. power ~1.3 Watt at max freq, so the 'cost' value > 1mln micro-Watts.
- * In such scenario, where there are 4 CPUs in the Perf. Domain the 'sum_util'
- * could be 4096, then multiplication: 'cost' * 'sum_util'  would overflow.
- * This reordering of operations has some limitations, we lose small
- * precision in the estimation (comparing to 64bit platform w/o reordering).
- *
- * We are safe on 64bit machine.
- */
-#ifdef CONFIG_64BIT
-#define em_estimate_energy(cost, sum_util, scale_cpu) \
-	(((cost) * (sum_util)) / (scale_cpu))
-#else
-#define em_estimate_energy(cost, sum_util, scale_cpu) \
-	(((cost) / (scale_cpu)) * (sum_util))
-#endif
-
 struct em_data_callback {
 	/**
 	 * active_power() - Provide power at the next performance state of
@@ -249,29 +228,16 @@  static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 {
 	struct em_perf_table *runtime_table;
 	struct em_perf_state *ps;
-	unsigned long scale_cpu;
-	int cpu, i;
+	int i;
 
 	if (!sum_util)
 		return 0;
 
-	/*
-	 * In order to predict the performance state, map the utilization of
-	 * the most utilized CPU of the performance domain to a requested
-	 * frequency, like schedutil. Take also into account that the real
-	 * frequency might be set lower (due to thermal capping). Thus, clamp
-	 * max utilization to the allowed CPU capacity before calculating
-	 * effective frequency.
-	 */
-	cpu = cpumask_first(to_cpumask(pd->cpus));
-	scale_cpu = arch_scale_cpu_capacity(cpu);
-
 	/*
 	 * No rcu_read_lock() since it's already called by task scheduler.
 	 * The runtime_table is always there for CPUs, so we don't check.
 	 */
 	runtime_table = rcu_dereference(pd->runtime_table);
-
 	ps = &runtime_table->state[pd->nr_perf_states - 1];
 
 	max_util = map_util_perf(max_util);
@@ -286,35 +252,21 @@  static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	ps = &runtime_table->state[i];
 
 	/*
-	 * The capacity of a CPU in the domain at the performance state (ps)
-	 * can be computed as:
-	 *
-	 *             ps->freq * scale_cpu
-	 *   ps->cap = --------------------                          (1)
-	 *                 cpu_max_freq
-	 *
-	 * So, ignoring the costs of idle states (which are not available in
-	 * the EM), the energy consumed by this CPU at that performance state
+	 * The energy consumed by the CPU at the given performance state (ps)
 	 * is estimated as:
 	 *
-	 *             ps->power * cpu_util
-	 *   cpu_nrg = --------------------                          (2)
-	 *                   ps->cap
+	 *                ps->power
+	 *   cpu_nrg = --------------- * cpu_util                    (1)
+	 *             ps->performance
 	 *
-	 * since 'cpu_util / ps->cap' represents its percentage of busy time.
+	 * The 'cpu_util / ps->performance' represents its percentage of
+	 * busy time. The idle cost is ignored (it's not available in the EM).
 	 *
 	 *   NOTE: Although the result of this computation actually is in
 	 *         units of power, it can be manipulated as an energy value
 	 *         over a scheduling period, since it is assumed to be
 	 *         constant during that interval.
 	 *
-	 * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
-	 * of two terms:
-	 *
-	 *             ps->power * cpu_max_freq   cpu_util
-	 *   cpu_nrg = ------------------------ * ---------          (3)
-	 *                    ps->freq            scale_cpu
-	 *
 	 * The first term is static, and is stored in the em_perf_state struct
 	 * as 'ps->cost'.
 	 *
@@ -323,11 +275,9 @@  static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 * total energy of the domain (which is the simple sum of the energy of
 	 * all of its CPUs) can be factorized as:
 	 *
-	 *            ps->cost * \Sum cpu_util
-	 *   pd_nrg = ------------------------                       (4)
-	 *                  scale_cpu
+	 *   pd_nrg = ps->cost * \Sum cpu_util                       (2)
 	 */
-	return em_estimate_energy(ps->cost, sum_util, scale_cpu);
+	return ps->cost * sum_util;
 }
 
 /**
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index d3fa5a77de80..c6e5f35a5129 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -204,11 +204,9 @@  static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 			    unsigned long flags)
 {
 	unsigned long prev_cost = ULONG_MAX;
-	u64 fmax;
 	int i, ret;
 
 	/* Compute the cost of each performance state. */
-	fmax = (u64) table[nr_states - 1].frequency;
 	for (i = nr_states - 1; i >= 0; i--) {
 		unsigned long power_res, cost;
 
@@ -220,8 +218,9 @@  static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 				return -EINVAL;
 			}
 		} else {
-			power_res = table[i].power;
-			cost = div64_u64(fmax * power_res, table[i].frequency);
+			/* increase resolution of 'cost' precision */
+			power_res = table[i].power * 10;
+			cost = power_res / table[i].performance;
 		}
 
 		table[i].cost = cost;