mbox series

[v3,0/5] Rework system pressure interface to the scheduler

Message ID 20240108134843.429769-1-vincent.guittot@linaro.org
Headers show
Series Rework system pressure interface to the scheduler | expand

Message

Vincent Guittot Jan. 8, 2024, 1:48 p.m. UTC
Following the consolidation and cleanup of CPU capacity in [1], this serie
reworks how the scheduler gets the pressures on CPUs. We need to take into
account all pressures applied by cpufreq on the compute capacity of a CPU
for dozens of ms or more and not only cpufreq cooling device or HW
mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
- one from cpufreq and freq_qos
- one from HW high freq mitigiation.

The next step will be to add a dedicated interface for long standing
capping of the CPU capacity (i.e. for seconds or more) like the
scaling_max_freq of cpufreq sysfs. The latter is already taken into
account by this serie but as a temporary pressure which is not always the
best choice when we know that it will happen for seconds or more.

[1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/

Change since v1:
- Rework cpufreq_update_pressure()

Change since v1:
- Use struct cpufreq_policy as parameter of cpufreq_update_pressure()
- Fix typos and comments
- Make sched_thermal_decay_shift boot param as deprecated

Vincent Guittot (5):
  cpufreq: Add a cpufreq pressure feedback for the scheduler
  sched: Take cpufreq feedback into account
  thermal/cpufreq: Remove arch_update_thermal_pressure()
  sched: Rename arch_update_thermal_pressure into
    arch_update_hw_pressure
  sched/pelt: Remove shift of thermal clock

 .../admin-guide/kernel-parameters.txt         |  1 +
 arch/arm/include/asm/topology.h               |  6 +-
 arch/arm64/include/asm/topology.h             |  6 +-
 drivers/base/arch_topology.c                  | 26 ++++----
 drivers/cpufreq/cpufreq.c                     | 36 +++++++++++
 drivers/cpufreq/qcom-cpufreq-hw.c             |  4 +-
 drivers/thermal/cpufreq_cooling.c             |  3 -
 include/linux/arch_topology.h                 |  8 +--
 include/linux/cpufreq.h                       | 10 +++
 include/linux/sched/topology.h                |  8 +--
 .../{thermal_pressure.h => hw_pressure.h}     | 14 ++---
 include/trace/events/sched.h                  |  2 +-
 init/Kconfig                                  | 12 ++--
 kernel/sched/core.c                           |  8 +--
 kernel/sched/fair.c                           | 63 +++++++++----------
 kernel/sched/pelt.c                           | 18 +++---
 kernel/sched/pelt.h                           | 16 ++---
 kernel/sched/sched.h                          | 22 +------
 18 files changed, 144 insertions(+), 119 deletions(-)
 rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)

Comments

Vincent Guittot Jan. 8, 2024, 4:46 p.m. UTC | #1
On Mon, 8 Jan 2024 at 17:35, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 08/01/2024 14:48, Vincent Guittot wrote:
> > Provide to the scheduler a feedback about the temporary max available
> > capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
> > filtered as the pressure will happen for dozens ms or more.
>
> Is this then related to the 'medium pace system pressure' you mentioned
> in your OSPM '23 talk?
>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  drivers/cpufreq/cpufreq.c | 36 ++++++++++++++++++++++++++++++++++++
> >  include/linux/cpufreq.h   | 10 ++++++++++
> >  2 files changed, 46 insertions(+)
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 44db4f59c4cc..fa2e2ea26f7f 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -2563,6 +2563,40 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu)
> >  }
> >  EXPORT_SYMBOL(cpufreq_get_policy);
> >
> > +DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
> > +
> > +/**
> > + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs
> > + * @policy: cpufreq policy of the CPUs.
> > + *
> > + * Update the value of cpufreq pressure for all @cpus in the policy.
> > + */
> > +static void cpufreq_update_pressure(struct cpufreq_policy *policy)
> > +{
> > +     unsigned long max_capacity, capped_freq, pressure;
> > +     u32 max_freq;
> > +     int cpu;
> > +
> > +     /*
> > +      * Handle properly the boost frequencies, which should simply clean
> > +      * the thermal pressure value.
>                ^^^^^^^
> IMHO, this is a copy & paste error from topology_update_thermal_pressure()?
>
> > +      */
> > +     if (max_freq <= capped_freq) {
>
> max_freq seems to be uninitialized.

argh yes, I made crap while cleaning up
both max_freq and capped_freq are uninitialized

>
> > +             pressure = 0;
>
> Is this x86 (turbo boost) specific? IMHO at arm we follow this max freq
> (including boost) relates to 1024 in capacity? Or haven't we made this
> discussion yet?

This is not x86 specific. We can have capped_freq > max_freq on Arm too

Also this bypass all calculation below when max_freq == capped_freq
which is the most common case

>
> > +     } else {
> > +             cpu = cpumask_first(policy->related_cpus);
> > +             max_capacity = arch_scale_cpu_capacity(cpu);
> > +             capped_freq = policy->max;
> > +             max_freq = arch_scale_freq_ref(cpu);
> > +
> > +             pressure = max_capacity -
> > +                        mult_frac(max_capacity, capped_freq, max_freq);
> > +     }
> > +
> > +     for_each_cpu(cpu, policy->related_cpus)
> > +             WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure);
> > +}
> > +
>
> [...]
>
Vincent Guittot Jan. 9, 2024, 11:24 a.m. UTC | #2
On Mon, 8 Jan 2024 at 17:35, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 08/01/2024 14:48, Vincent Guittot wrote:
> > Provide to the scheduler a feedback about the temporary max available
> > capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
> > filtered as the pressure will happen for dozens ms or more.
>
> Is this then related to the 'medium pace system pressure' you mentioned
> in your OSPM '23 talk?

Sorry I forgot to answer this question. Yes this is the medium pace
system pressure that I mentioned at OSPM'23
>
> >
Dietmar Eggemann Jan. 9, 2024, 11:33 a.m. UTC | #3
On 08/01/2024 14:48, Vincent Guittot wrote:
> Following the consolidation and cleanup of CPU capacity in [1], this serie
> reworks how the scheduler gets the pressures on CPUs. We need to take into
> account all pressures applied by cpufreq on the compute capacity of a CPU
> for dozens of ms or more and not only cpufreq cooling device or HW
> mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
> - one from cpufreq and freq_qos
> - one from HW high freq mitigiation.
> 
> The next step will be to add a dedicated interface for long standing
> capping of the CPU capacity (i.e. for seconds or more) like the
> scaling_max_freq of cpufreq sysfs. The latter is already taken into
> account by this serie but as a temporary pressure which is not always the
> best choice when we know that it will happen for seconds or more.

I guess this is related to the 'user space system pressure' (*) slide of
your OSPM '23 talk.

Where do you draw the line when it comes to time between (*) and the
'medium pace system pressure' (e.g. thermal and FREQ_QOS).

IIRC, with (*) you want to rebuild the sched domains etc.

> 
> [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/
> 
> Change since v1:
> - Rework cpufreq_update_pressure()
> 
> Change since v1:
> - Use struct cpufreq_policy as parameter of cpufreq_update_pressure()
> - Fix typos and comments
> - Make sched_thermal_decay_shift boot param as deprecated
> 
> Vincent Guittot (5):
>   cpufreq: Add a cpufreq pressure feedback for the scheduler
>   sched: Take cpufreq feedback into account
>   thermal/cpufreq: Remove arch_update_thermal_pressure()
>   sched: Rename arch_update_thermal_pressure into
>     arch_update_hw_pressure
>   sched/pelt: Remove shift of thermal clock
> 
>  .../admin-guide/kernel-parameters.txt         |  1 +
>  arch/arm/include/asm/topology.h               |  6 +-
>  arch/arm64/include/asm/topology.h             |  6 +-
>  drivers/base/arch_topology.c                  | 26 ++++----
>  drivers/cpufreq/cpufreq.c                     | 36 +++++++++++
>  drivers/cpufreq/qcom-cpufreq-hw.c             |  4 +-
>  drivers/thermal/cpufreq_cooling.c             |  3 -
>  include/linux/arch_topology.h                 |  8 +--
>  include/linux/cpufreq.h                       | 10 +++
>  include/linux/sched/topology.h                |  8 +--
>  .../{thermal_pressure.h => hw_pressure.h}     | 14 ++---
>  include/trace/events/sched.h                  |  2 +-
>  init/Kconfig                                  | 12 ++--
>  kernel/sched/core.c                           |  8 +--
>  kernel/sched/fair.c                           | 63 +++++++++----------
>  kernel/sched/pelt.c                           | 18 +++---
>  kernel/sched/pelt.h                           | 16 ++---
>  kernel/sched/sched.h                          | 22 +------
>  18 files changed, 144 insertions(+), 119 deletions(-)
>  rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
Vincent Guittot Jan. 9, 2024, 1:29 p.m. UTC | #4
On Tue, 9 Jan 2024 at 12:34, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 08/01/2024 14:48, Vincent Guittot wrote:
> > Following the consolidation and cleanup of CPU capacity in [1], this serie
> > reworks how the scheduler gets the pressures on CPUs. We need to take into
> > account all pressures applied by cpufreq on the compute capacity of a CPU
> > for dozens of ms or more and not only cpufreq cooling device or HW
> > mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
> > - one from cpufreq and freq_qos
> > - one from HW high freq mitigiation.
> >
> > The next step will be to add a dedicated interface for long standing
> > capping of the CPU capacity (i.e. for seconds or more) like the
> > scaling_max_freq of cpufreq sysfs. The latter is already taken into
> > account by this serie but as a temporary pressure which is not always the
> > best choice when we know that it will happen for seconds or more.
>
> I guess this is related to the 'user space system pressure' (*) slide of
> your OSPM '23 talk.

yes

>
> Where do you draw the line when it comes to time between (*) and the
> 'medium pace system pressure' (e.g. thermal and FREQ_QOS).

My goal is to consider the /sys/../scaling_max_freq as the 'user space
system pressure'

>
> IIRC, with (*) you want to rebuild the sched domains etc.

The easiest way would be to rebuild the sched_domain but the cost is
not small so I would prefer to skip the rebuild and add a new signal
that keep track on this capped capacity

>
> >
> > [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/
> >
> > Change since v1:
> > - Rework cpufreq_update_pressure()
> >
> > Change since v1:
> > - Use struct cpufreq_policy as parameter of cpufreq_update_pressure()
> > - Fix typos and comments
> > - Make sched_thermal_decay_shift boot param as deprecated
> >
> > Vincent Guittot (5):
> >   cpufreq: Add a cpufreq pressure feedback for the scheduler
> >   sched: Take cpufreq feedback into account
> >   thermal/cpufreq: Remove arch_update_thermal_pressure()
> >   sched: Rename arch_update_thermal_pressure into
> >     arch_update_hw_pressure
> >   sched/pelt: Remove shift of thermal clock
> >
> >  .../admin-guide/kernel-parameters.txt         |  1 +
> >  arch/arm/include/asm/topology.h               |  6 +-
> >  arch/arm64/include/asm/topology.h             |  6 +-
> >  drivers/base/arch_topology.c                  | 26 ++++----
> >  drivers/cpufreq/cpufreq.c                     | 36 +++++++++++
> >  drivers/cpufreq/qcom-cpufreq-hw.c             |  4 +-
> >  drivers/thermal/cpufreq_cooling.c             |  3 -
> >  include/linux/arch_topology.h                 |  8 +--
> >  include/linux/cpufreq.h                       | 10 +++
> >  include/linux/sched/topology.h                |  8 +--
> >  .../{thermal_pressure.h => hw_pressure.h}     | 14 ++---
> >  include/trace/events/sched.h                  |  2 +-
> >  init/Kconfig                                  | 12 ++--
> >  kernel/sched/core.c                           |  8 +--
> >  kernel/sched/fair.c                           | 63 +++++++++----------
> >  kernel/sched/pelt.c                           | 18 +++---
> >  kernel/sched/pelt.h                           | 16 ++---
> >  kernel/sched/sched.h                          | 22 +------
> >  18 files changed, 144 insertions(+), 119 deletions(-)
> >  rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
>
Dietmar Eggemann Jan. 10, 2024, 6:10 p.m. UTC | #5
On 09/01/2024 14:29, Vincent Guittot wrote:
> On Tue, 9 Jan 2024 at 12:34, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 08/01/2024 14:48, Vincent Guittot wrote:
>>> Following the consolidation and cleanup of CPU capacity in [1], this serie
>>> reworks how the scheduler gets the pressures on CPUs. We need to take into
>>> account all pressures applied by cpufreq on the compute capacity of a CPU
>>> for dozens of ms or more and not only cpufreq cooling device or HW
>>> mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
>>> - one from cpufreq and freq_qos
>>> - one from HW high freq mitigiation.
>>>
>>> The next step will be to add a dedicated interface for long standing
>>> capping of the CPU capacity (i.e. for seconds or more) like the
>>> scaling_max_freq of cpufreq sysfs. The latter is already taken into
>>> account by this serie but as a temporary pressure which is not always the
>>> best choice when we know that it will happen for seconds or more.
>>
>> I guess this is related to the 'user space system pressure' (*) slide of
>> your OSPM '23 talk.
> 
> yes
> 
>>
>> Where do you draw the line when it comes to time between (*) and the
>> 'medium pace system pressure' (e.g. thermal and FREQ_QOS).
> 
> My goal is to consider the /sys/../scaling_max_freq as the 'user space
> system pressure'
> 
>>
>> IIRC, with (*) you want to rebuild the sched domains etc.
> 
> The easiest way would be to rebuild the sched_domain but the cost is
> not small so I would prefer to skip the rebuild and add a new signal
> that keep track on this capped capacity

Are you saying that you don't need to rebuild sched domains since
cpu_capacity information of the sched domain hierarchy is
independently updated via: 

update_sd_lb_stats() {

  update_group_capacity() {

    if (!child)
      update_cpu_capacity(sd, cpu) {

        capacity = scale_rt_capacity(cpu) {

          max = get_actual_cpu_capacity(cpu) <- (*)
        }

        sdg->sgc->capacity = capacity;
        sdg->sgc->min_capacity = capacity;
        sdg->sgc->max_capacity = capacity;
      }

  }

}
        
(*) influence of temporary and permanent (to be added) frequency
pressure on cpu_capacity (per-cpu and in sd data)


example: hackbench on h960 with IPA:
                                                                                  cap  min  max
...
hackbench-2284 [007] .Ns..  2170.796726: update_group_capacity: sdg !child cpu=7 1017 1017 1017
hackbench-2456 [007] ..s..  2170.920729: update_group_capacity: sdg !child cpu=7 1018 1018 1018
    <...>-2314 [007] ..s1.  2171.044724: update_group_capacity: sdg !child cpu=7 1011 1011 1011
hackbench-2541 [007] ..s..  2171.168734: update_group_capacity: sdg !child cpu=7  918  918  918
hackbench-2558 [007] .Ns..  2171.228716: update_group_capacity: sdg !child cpu=7  912  912  912
    <...>-2321 [007] ..s..  2171.352718: update_group_capacity: sdg !child cpu=7  812  812  812
hackbench-2553 [007] ..s..  2171.476721: update_group_capacity: sdg !child cpu=7  640  640  640
    <...>-2446 [007] ..s2.  2171.600743: update_group_capacity: sdg !child cpu=7  610  610  610
hackbench-2347 [007] ..s..  2171.724738: update_group_capacity: sdg !child cpu=7  406  406  406
hackbench-2331 [007] .Ns1.  2171.848768: update_group_capacity: sdg !child cpu=7  390  390  390
hackbench-2421 [007] ..s..  2171.972733: update_group_capacity: sdg !child cpu=7  388  388  388
...
Vincent Guittot Jan. 19, 2024, 5:57 p.m. UTC | #6
On Wed, 10 Jan 2024 at 19:10, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 09/01/2024 14:29, Vincent Guittot wrote:
> > On Tue, 9 Jan 2024 at 12:34, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 08/01/2024 14:48, Vincent Guittot wrote:
> >>> Following the consolidation and cleanup of CPU capacity in [1], this serie
> >>> reworks how the scheduler gets the pressures on CPUs. We need to take into
> >>> account all pressures applied by cpufreq on the compute capacity of a CPU
> >>> for dozens of ms or more and not only cpufreq cooling device or HW
> >>> mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
> >>> - one from cpufreq and freq_qos
> >>> - one from HW high freq mitigiation.
> >>>
> >>> The next step will be to add a dedicated interface for long standing
> >>> capping of the CPU capacity (i.e. for seconds or more) like the
> >>> scaling_max_freq of cpufreq sysfs. The latter is already taken into
> >>> account by this serie but as a temporary pressure which is not always the
> >>> best choice when we know that it will happen for seconds or more.
> >>
> >> I guess this is related to the 'user space system pressure' (*) slide of
> >> your OSPM '23 talk.
> >
> > yes
> >
> >>
> >> Where do you draw the line when it comes to time between (*) and the
> >> 'medium pace system pressure' (e.g. thermal and FREQ_QOS).
> >
> > My goal is to consider the /sys/../scaling_max_freq as the 'user space
> > system pressure'
> >
> >>
> >> IIRC, with (*) you want to rebuild the sched domains etc.
> >
> > The easiest way would be to rebuild the sched_domain but the cost is
> > not small so I would prefer to skip the rebuild and add a new signal
> > that keep track on this capped capacity
>
> Are you saying that you don't need to rebuild sched domains since
> cpu_capacity information of the sched domain hierarchy is
> independently updated via:
>
> update_sd_lb_stats() {
>
>   update_group_capacity() {
>
>     if (!child)
>       update_cpu_capacity(sd, cpu) {
>
>         capacity = scale_rt_capacity(cpu) {
>
>           max = get_actual_cpu_capacity(cpu) <- (*)
>         }
>
>         sdg->sgc->capacity = capacity;
>         sdg->sgc->min_capacity = capacity;
>         sdg->sgc->max_capacity = capacity;
>       }
>
>   }
>
> }
>
> (*) influence of temporary and permanent (to be added) frequency
> pressure on cpu_capacity (per-cpu and in sd data)


I'm more concerned by rd->max_cpu_capacity which remains at original
capacity and triggers spurious LB if we take into account the
userspace max freq instead of the original max compute capacity of a
CPU. And also how to manage this in RT and DL

>
>
> example: hackbench on h960 with IPA:
>                                                                                   cap  min  max
> ...
> hackbench-2284 [007] .Ns..  2170.796726: update_group_capacity: sdg !child cpu=7 1017 1017 1017
> hackbench-2456 [007] ..s..  2170.920729: update_group_capacity: sdg !child cpu=7 1018 1018 1018
>     <...>-2314 [007] ..s1.  2171.044724: update_group_capacity: sdg !child cpu=7 1011 1011 1011
> hackbench-2541 [007] ..s..  2171.168734: update_group_capacity: sdg !child cpu=7  918  918  918
> hackbench-2558 [007] .Ns..  2171.228716: update_group_capacity: sdg !child cpu=7  912  912  912
>     <...>-2321 [007] ..s..  2171.352718: update_group_capacity: sdg !child cpu=7  812  812  812
> hackbench-2553 [007] ..s..  2171.476721: update_group_capacity: sdg !child cpu=7  640  640  640
>     <...>-2446 [007] ..s2.  2171.600743: update_group_capacity: sdg !child cpu=7  610  610  610
> hackbench-2347 [007] ..s..  2171.724738: update_group_capacity: sdg !child cpu=7  406  406  406
> hackbench-2331 [007] .Ns1.  2171.848768: update_group_capacity: sdg !child cpu=7  390  390  390
> hackbench-2421 [007] ..s..  2171.972733: update_group_capacity: sdg !child cpu=7  388  388  388
> ...