diff mbox series

[v2,8/8] sched/pelt: Introduce PELT multiplier

Message ID 20231208002342.367117-9-qyousef@layalina.io
State New
Headers show
Series sched: cpufreq: Remove magic hardcoded numbers from margins | expand

Commit Message

Qais Yousef Dec. 8, 2023, 12:23 a.m. UTC
From: Vincent Donnefort <vincent.donnefort@arm.com>

The new sched_pelt_multiplier boot param allows a user to set a clock
multiplier to x2 or x4 (x1 being the default). This clock multiplier
artificially speeds up PELT ramp up/down similarly to use a faster
half-life than the default 32ms.

  - x1: 32ms half-life
  - x2: 16ms half-life
  - x4: 8ms  half-life

Internally, a new clock is created: rq->clock_task_mult. It sits in the
clock hierarchy between rq->clock_task and rq->clock_pelt.

The param is set as read only and can only be changed at boot time via

	kernel.sched_pelt_multiplier=[1, 2, 4]

PELT has a big impact on the overall system response and reactiveness to
change. Smaller PELT HF means it'll require less time to reach the
maximum performance point of the system when the system become fully
busy; and equally shorter time to go back to lowest performance point
when the system goes back to idle.

This faster reaction impacts both dvfs response and migration time
between clusters in HMP system.

Smaller PELT values are expected to give better performance at the cost
of more power. Under powered systems can particularly benefit from
smaller values. Powerful systems can still benefit from smaller values
if they want to be tuned towards perf more and power is not the major
concern for them.

This combined with respone_time_ms from schedutil should give the user
and sysadmin a deterministic way to control the triangular power, perf
and thermals for their system. The default response_time_ms will half
as PELT HF halves.

Update approximate_{util_avg, runtime}() to take into account the PELT
HALFLIFE multiplier.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[Converted from sysctl to boot param and updated commit message]
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/core.c  |  2 +-
 kernel/sched/pelt.c  | 52 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/pelt.h  | 42 +++++++++++++++++++++++++++++++----
 kernel/sched/sched.h |  1 +
 4 files changed, 90 insertions(+), 7 deletions(-)

Comments

Ashay Jaiswal Jan. 20, 2024, 7:52 a.m. UTC | #1
Hello Qais Yousef,

We ran few benchmarks with PELT multiplier patch on a Snapdragon 8Gen2
based internal Android device and we are observing significant
improvements with PELT8 configuration compared to PELT32.

Following are some of the benchmark results with PELT32 and PELT8
configuration:

+-----------------+---------------+----------------+----------------+
| Test case                       |     PELT32     |     PELT8      |
+-----------------+---------------+----------------+----------------+
|                 |    Overall    |     711543     |     971275     |
|                 +---------------+----------------+----------------+
|                 |    CPU        |     193704     |     224378     |
|                 +---------------+----------------+----------------+
|ANTUTU V9.3.9    |    GPU        |     284650     |     424774     |
|                 +---------------+----------------+----------------+
|                 |    MEM        |     125207     |     160548     |
|                 +---------------+----------------+----------------+
|                 |    UX         |     107982     |     161575     |
+-----------------+---------------+----------------+----------------+
|                 |   Single core |     1170       |     1268       |
|GeekBench V5.4.4 +---------------+----------------+----------------+
|                 |   Multi core  |     2530       |     3797       |
+-----------------+---------------+----------------+----------------+
|                 |    Twitter    |     >50 Janks  |     0          |
|     SCROLL      +---------------+----------------+----------------+
|                 |    Contacts   |     >30 Janks  |     0          |
+-----------------+---------------+----------------+----------------+

Please let us know if you need any support with running any further
workloads for PELT32/PELT8 experiments, we can help with running the
experiments.

Thank you,
Ashay Jaiswal

On 12/8/2023 5:53 AM, Qais Yousef wrote:
> From: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> The new sched_pelt_multiplier boot param allows a user to set a clock
> multiplier to x2 or x4 (x1 being the default). This clock multiplier
> artificially speeds up PELT ramp up/down similarly to use a faster
> half-life than the default 32ms.
> 
>   - x1: 32ms half-life
>   - x2: 16ms half-life
>   - x4: 8ms  half-life
> 
> Internally, a new clock is created: rq->clock_task_mult. It sits in the
> clock hierarchy between rq->clock_task and rq->clock_pelt.
> 
> The param is set as read only and can only be changed at boot time via
> 
> 	kernel.sched_pelt_multiplier=[1, 2, 4]
> 
> PELT has a big impact on the overall system response and reactiveness to
> change. Smaller PELT HF means it'll require less time to reach the
> maximum performance point of the system when the system become fully
> busy; and equally shorter time to go back to lowest performance point
> when the system goes back to idle.
> 
> This faster reaction impacts both dvfs response and migration time
> between clusters in HMP system.
> 
> Smaller PELT values are expected to give better performance at the cost
> of more power. Under powered systems can particularly benefit from
> smaller values. Powerful systems can still benefit from smaller values
> if they want to be tuned towards perf more and power is not the major
> concern for them.
> 
> This combined with respone_time_ms from schedutil should give the user
> and sysadmin a deterministic way to control the triangular power, perf
> and thermals for their system. The default response_time_ms will half
> as PELT HF halves.
> 
> Update approximate_{util_avg, runtime}() to take into account the PELT
> HALFLIFE multiplier.
> 
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> [Converted from sysctl to boot param and updated commit message]
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  kernel/sched/core.c  |  2 +-
>  kernel/sched/pelt.c  | 52 ++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/pelt.h  | 42 +++++++++++++++++++++++++++++++----
>  kernel/sched/sched.h |  1 +
>  4 files changed, 90 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b4a1c8ea9e12..9c8626b4ddff 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -745,7 +745,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>  	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
>  		update_irq_load_avg(rq, irq_delta + steal);
>  #endif
> -	update_rq_clock_pelt(rq, delta);
> +	update_rq_clock_task_mult(rq, delta);
>  }
>  
>  void update_rq_clock(struct rq *rq)
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 00a1b9c1bf16..0a10e56f76c7 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -468,6 +468,54 @@ int update_irq_load_avg(struct rq *rq, u64 running)
>  }
>  #endif /* CONFIG_HAVE_SCHED_AVG_IRQ */
>  
> +__read_mostly unsigned int sched_pelt_lshift;
> +static unsigned int sched_pelt_multiplier = 1;
> +
> +static int set_sched_pelt_multiplier(const char *val, const struct kernel_param *kp)
> +{
> +	int ret;
> +
> +	ret = param_set_int(val, kp);
> +	if (ret)
> +		goto error;
> +
> +	switch (sched_pelt_multiplier)  {
> +	case 1:
> +		fallthrough;
> +	case 2:
> +		fallthrough;
> +	case 4:
> +		WRITE_ONCE(sched_pelt_lshift,
> +			   sched_pelt_multiplier >> 1);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		goto error;
> +	}
> +
> +	return 0;
> +
> +error:
> +	sched_pelt_multiplier = 1;
> +	return ret;
> +}
> +
> +static const struct kernel_param_ops sched_pelt_multiplier_ops = {
> +	.set = set_sched_pelt_multiplier,
> +	.get = param_get_int,
> +};
> +
> +#ifdef MODULE_PARAM_PREFIX
> +#undef MODULE_PARAM_PREFIX
> +#endif
> +/* XXX: should we use sched as prefix? */
> +#define MODULE_PARAM_PREFIX "kernel."
> +module_param_cb(sched_pelt_multiplier, &sched_pelt_multiplier_ops, &sched_pelt_multiplier, 0444);
> +MODULE_PARM_DESC(sched_pelt_multiplier, "PELT HALFLIFE helps control the responsiveness of the system.");
> +MODULE_PARM_DESC(sched_pelt_multiplier, "Accepted value: 1 32ms PELT HALIFE - roughly 200ms to go from 0 to max performance point (default).");
> +MODULE_PARM_DESC(sched_pelt_multiplier, "                2 16ms PELT HALIFE - roughly 100ms to go from 0 to max performance point.");
> +MODULE_PARM_DESC(sched_pelt_multiplier, "                4  8ms PELT HALIFE - roughly  50ms to go from 0 to max performance point.");
> +
>  /*
>   * Approximate the new util_avg value assuming an entity has continued to run
>   * for @delta us.
> @@ -482,7 +530,7 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
>  	if (unlikely(!delta))
>  		return util;
>  
> -	accumulate_sum(delta, &sa, 1, 0, 1);
> +	accumulate_sum(delta << sched_pelt_lshift, &sa, 1, 0, 1);
>  	___update_load_avg(&sa, 0);
>  
>  	return sa.util_avg;
> @@ -494,7 +542,7 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
>  u64 approximate_runtime(unsigned long util)
>  {
>  	struct sched_avg sa = {};
> -	u64 delta = 1024; // period = 1024 = ~1ms
> +	u64 delta = 1024 << sched_pelt_lshift; // period = 1024 = ~1ms
>  	u64 runtime = 0;
>  
>  	if (unlikely(!util))
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index 3a0e0dc28721..9b35b5072bae 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
>  	WRITE_ONCE(avg->util_est.enqueued, enqueued);
>  }
>  
> +static inline u64 rq_clock_task_mult(struct rq *rq)
> +{
> +	lockdep_assert_rq_held(rq);
> +	assert_clock_updated(rq);
> +
> +	return rq->clock_task_mult;
> +}
> +
>  static inline u64 rq_clock_pelt(struct rq *rq)
>  {
>  	lockdep_assert_rq_held(rq);
> @@ -72,7 +80,7 @@ static inline u64 rq_clock_pelt(struct rq *rq)
>  /* The rq is idle, we can sync to clock_task */
>  static inline void _update_idle_rq_clock_pelt(struct rq *rq)
>  {
> -	rq->clock_pelt  = rq_clock_task(rq);
> +	rq->clock_pelt = rq_clock_task_mult(rq);
>  
>  	u64_u32_store(rq->clock_idle, rq_clock(rq));
>  	/* Paired with smp_rmb in migrate_se_pelt_lag() */
> @@ -121,6 +129,27 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
>  	rq->clock_pelt += delta;
>  }
>  
> +extern unsigned int sched_pelt_lshift;
> +
> +/*
> + * absolute time   |1      |2      |3      |4      |5      |6      |
> + * @ mult = 1      --------****************--------****************-
> + * @ mult = 2      --------********----------------********---------
> + * @ mult = 4      --------****--------------------****-------------
> + * clock task mult
> + * @ mult = 2      |   |   |2  |3  |   |   |   |   |5  |6  |   |   |
> + * @ mult = 4      | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
> + *
> + */
> +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
> +{
> +	delta <<= READ_ONCE(sched_pelt_lshift);
> +
> +	rq->clock_task_mult += delta;
> +
> +	update_rq_clock_pelt(rq, delta);
> +}
> +
>  /*
>   * When rq becomes idle, we have to check if it has lost idle time
>   * because it was fully busy. A rq is fully used when the /Sum util_sum
> @@ -147,7 +176,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
>  	 * rq's clock_task.
>  	 */
>  	if (util_sum >= divider)
> -		rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> +		rq->lost_idle_time += rq_clock_task_mult(rq) - rq->clock_pelt;
>  
>  	_update_idle_rq_clock_pelt(rq);
>  }
> @@ -218,13 +247,18 @@ update_irq_load_avg(struct rq *rq, u64 running)
>  	return 0;
>  }
>  
> -static inline u64 rq_clock_pelt(struct rq *rq)
> +static inline u64 rq_clock_task_mult(struct rq *rq)
>  {
>  	return rq_clock_task(rq);
>  }
>  
> +static inline u64 rq_clock_pelt(struct rq *rq)
> +{
> +	return rq_clock_task_mult(rq);
> +}
> +
>  static inline void
> -update_rq_clock_pelt(struct rq *rq, s64 delta) { }
> +update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
>  
>  static inline void
>  update_idle_rq_clock_pelt(struct rq *rq) { }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bbece0eb053a..a7c89c623250 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1029,6 +1029,7 @@ struct rq {
>  	u64			clock;
>  	/* Ensure that all clocks are in the same cache line */
>  	u64			clock_task ____cacheline_aligned;
> +	u64			clock_task_mult;
>  	u64			clock_pelt;
>  	unsigned long		lost_idle_time;
>  	u64			clock_pelt_idle;
Qais Yousef Jan. 21, 2024, 12:04 a.m. UTC | #2
Hi Ashay

On 01/20/24 13:22, Ashay Jaiswal wrote:
> Hello Qais Yousef,
> 
> We ran few benchmarks with PELT multiplier patch on a Snapdragon 8Gen2
> based internal Android device and we are observing significant
> improvements with PELT8 configuration compared to PELT32.
> 
> Following are some of the benchmark results with PELT32 and PELT8
> configuration:
> 
> +-----------------+---------------+----------------+----------------+
> | Test case                       |     PELT32     |     PELT8      |
> +-----------------+---------------+----------------+----------------+
> |                 |    Overall    |     711543     |     971275     |
> |                 +---------------+----------------+----------------+
> |                 |    CPU        |     193704     |     224378     |
> |                 +---------------+----------------+----------------+
> |ANTUTU V9.3.9    |    GPU        |     284650     |     424774     |
> |                 +---------------+----------------+----------------+
> |                 |    MEM        |     125207     |     160548     |
> |                 +---------------+----------------+----------------+
> |                 |    UX         |     107982     |     161575     |
> +-----------------+---------------+----------------+----------------+
> |                 |   Single core |     1170       |     1268       |
> |GeekBench V5.4.4 +---------------+----------------+----------------+
> |                 |   Multi core  |     2530       |     3797       |
> +-----------------+---------------+----------------+----------------+
> |                 |    Twitter    |     >50 Janks  |     0          |
> |     SCROLL      +---------------+----------------+----------------+
> |                 |    Contacts   |     >30 Janks  |     0          |
> +-----------------+---------------+----------------+----------------+
> 
> Please let us know if you need any support with running any further
> workloads for PELT32/PELT8 experiments, we can help with running the
> experiments.

Thanks a lot for the test results. Was this tried with this patch alone or
the whole series applied?

Have you tried to tweak each policy response_time_ms introduced in patch
7 instead? With the series applied, boot with PELT8, record the response time
values for each policy, then boot back again to PELT32 and use those values.
Does this produce similar results?

You didn't share power numbers which I assume the perf gains are more important
than the power cost for you.


Thanks!

--
Qais Yousef
Vincent Guittot Jan. 30, 2024, 5:28 p.m. UTC | #3
On Sun, 28 Jan 2024 at 17:22, Ashay Jaiswal <quic_ashayj@quicinc.com> wrote:
>
> Hello Qais Yousef,
>
> Thank you for your response.
>
> On 1/21/2024 5:34 AM, Qais Yousef wrote:
> > Hi Ashay
> >
> > On 01/20/24 13:22, Ashay Jaiswal wrote:
> >> Hello Qais Yousef,
> >>
> >> We ran few benchmarks with PELT multiplier patch on a Snapdragon 8Gen2
> >> based internal Android device and we are observing significant
> >> improvements with PELT8 configuration compared to PELT32.
> >>
> >> Following are some of the benchmark results with PELT32 and PELT8
> >> configuration:
> >>
> >> +-----------------+---------------+----------------+----------------+
> >> | Test case                       |     PELT32     |     PELT8      |
> >> +-----------------+---------------+----------------+----------------+
> >> |                 |    Overall    |     711543     |     971275     |
> >> |                 +---------------+----------------+----------------+
> >> |                 |    CPU        |     193704     |     224378     |
> >> |                 +---------------+----------------+----------------+
> >> |ANTUTU V9.3.9    |    GPU        |     284650     |     424774     |
> >> |                 +---------------+----------------+----------------+
> >> |                 |    MEM        |     125207     |     160548     |
> >> |                 +---------------+----------------+----------------+
> >> |                 |    UX         |     107982     |     161575     |
> >> +-----------------+---------------+----------------+----------------+
> >> |                 |   Single core |     1170       |     1268       |
> >> |GeekBench V5.4.4 +---------------+----------------+----------------+
> >> |                 |   Multi core  |     2530       |     3797       |
> >> +-----------------+---------------+----------------+----------------+
> >> |                 |    Twitter    |     >50 Janks  |     0          |
> >> |     SCROLL      +---------------+----------------+----------------+
> >> |                 |    Contacts   |     >30 Janks  |     0          |
> >> +-----------------+---------------+----------------+----------------+
> >>
> >> Please let us know if you need any support with running any further
> >> workloads for PELT32/PELT8 experiments, we can help with running the
> >> experiments.
> >
> > Thanks a lot for the test results. Was this tried with this patch alone or
> > the whole series applied?
> >
> I have only applied patch8(sched/pelt: Introduce PELT multiplier) for the tests.
>
> > Have you tried to tweak each policy response_time_ms introduced in patch
> > 7 instead? With the series applied, boot with PELT8, record the response time
> > values for each policy, then boot back again to PELT32 and use those values.
> > Does this produce similar results?
> >
> As the device is based on 5.15 kernel, I will try to pull all the 8 patches
> along with the dependency patches on 5.15 and try out the experiments as
> suggested.

Generally speaking, it would be better to compare with the latest
kernel or at least close and which includes new features added since
v5.15 (which is more than 2 years old now). I understand that this is
not always easy or doable but you could be surprised by the benefit of
some features like [0] merged since v5.15

[0] https://lore.kernel.org/lkml/249816c9-c2b5-8016-f9ce-dab7b7d384e4@arm.com/

>
> > You didn't share power numbers which I assume the perf gains are more important
> > than the power cost for you.
> >
> If possible I will try to collect the power number for future test and share the
> details.
>
> >
> > Thanks!
> >
> > --
> > Qais Yousef
Qais Yousef Feb. 1, 2024, 10:24 p.m. UTC | #4
On 01/30/24 18:38, Vincent Guittot wrote:
> Le vendredi 08 déc. 2023 à 00:23:42 (+0000), Qais Yousef a écrit :
> > From: Vincent Donnefort <vincent.donnefort@arm.com>
> > 
> > The new sched_pelt_multiplier boot param allows a user to set a clock
> > multiplier to x2 or x4 (x1 being the default). This clock multiplier
> > artificially speeds up PELT ramp up/down similarly to use a faster
> > half-life than the default 32ms.
> > 
> >   - x1: 32ms half-life
> >   - x2: 16ms half-life
> >   - x4: 8ms  half-life
> > 
> > Internally, a new clock is created: rq->clock_task_mult. It sits in the
> > clock hierarchy between rq->clock_task and rq->clock_pelt.
> > 
> > The param is set as read only and can only be changed at boot time via
> > 
> > 	kernel.sched_pelt_multiplier=[1, 2, 4]
> > 
> > PELT has a big impact on the overall system response and reactiveness to
> > change. Smaller PELT HF means it'll require less time to reach the
> > maximum performance point of the system when the system become fully
> > busy; and equally shorter time to go back to lowest performance point
> > when the system goes back to idle.
> > 
> > This faster reaction impacts both dvfs response and migration time
> > between clusters in HMP system.
> > 
> > Smaller PELT values are expected to give better performance at the cost
> > of more power. Under powered systems can particularly benefit from
> > smaller values. Powerful systems can still benefit from smaller values
> > if they want to be tuned towards perf more and power is not the major
> > concern for them.
> > 
> > This combined with respone_time_ms from schedutil should give the user
> > and sysadmin a deterministic way to control the triangular power, perf
> > and thermals for their system. The default response_time_ms will half
> > as PELT HF halves.
> > 
> > Update approximate_{util_avg, runtime}() to take into account the PELT
> > HALFLIFE multiplier.
> > 
> > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > [Converted from sysctl to boot param and updated commit message]
> > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > ---
> >  kernel/sched/core.c  |  2 +-
> >  kernel/sched/pelt.c  | 52 ++++++++++++++++++++++++++++++++++++++++++--
> >  kernel/sched/pelt.h  | 42 +++++++++++++++++++++++++++++++----
> >  kernel/sched/sched.h |  1 +
> >  4 files changed, 90 insertions(+), 7 deletions(-)
> > 
> 
> ...
> 
> > +__read_mostly unsigned int sched_pelt_lshift;
> > +static unsigned int sched_pelt_multiplier = 1;
> > +
> > +static int set_sched_pelt_multiplier(const char *val, const struct kernel_param *kp)
> > +{
> > +	int ret;
> > +
> > +	ret = param_set_int(val, kp);
> > +	if (ret)
> > +		goto error;
> > +
> > +	switch (sched_pelt_multiplier)  {
> > +	case 1:
> > +		fallthrough;
> > +	case 2:
> > +		fallthrough;
> > +	case 4:
> > +		WRITE_ONCE(sched_pelt_lshift,
> > +			   sched_pelt_multiplier >> 1);
> > +		break;
> > +	default:
> > +		ret = -EINVAL;
> > +		goto error;
> > +	}
> > +
> > +	return 0;
> > +
> > +error:
> > +	sched_pelt_multiplier = 1;
> > +	return ret;
> > +}
> > +
> > +static const struct kernel_param_ops sched_pelt_multiplier_ops = {
> > +	.set = set_sched_pelt_multiplier,
> > +	.get = param_get_int,
> > +};
> > +
> > +#ifdef MODULE_PARAM_PREFIX
> > +#undef MODULE_PARAM_PREFIX
> > +#endif
> > +/* XXX: should we use sched as prefix? */
> > +#define MODULE_PARAM_PREFIX "kernel."
> > +module_param_cb(sched_pelt_multiplier, &sched_pelt_multiplier_ops, &sched_pelt_multiplier, 0444);
> > +MODULE_PARM_DESC(sched_pelt_multiplier, "PELT HALFLIFE helps control the responsiveness of the system.");
> > +MODULE_PARM_DESC(sched_pelt_multiplier, "Accepted value: 1 32ms PELT HALIFE - roughly 200ms to go from 0 to max performance point (default).");
> > +MODULE_PARM_DESC(sched_pelt_multiplier, "                2 16ms PELT HALIFE - roughly 100ms to go from 0 to max performance point.");
> > +MODULE_PARM_DESC(sched_pelt_multiplier, "                4  8ms PELT HALIFE - roughly  50ms to go from 0 to max performance point.");
> > +
> >  /*
> >   * Approximate the new util_avg value assuming an entity has continued to run
> >   * for @delta us.
> 
> ...
> 
> > +
> >  static inline void
> > -update_rq_clock_pelt(struct rq *rq, s64 delta) { }
> > +update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
> >  
> >  static inline void
> >  update_idle_rq_clock_pelt(struct rq *rq) { }
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index bbece0eb053a..a7c89c623250 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1029,6 +1029,7 @@ struct rq {
> >  	u64			clock;
> >  	/* Ensure that all clocks are in the same cache line */
> >  	u64			clock_task ____cacheline_aligned;
> > +	u64			clock_task_mult;
> 
> I'm not sure that we want yet another clock and this doesn't apply for irq_avg.
> 
> What about the below is simpler and I think cover all cases ?

Looks better, yes. I'll change to this and if no issues come up I'll add your
signed-off-by if that's okay with you for the next version.


Thanks!

--
Qais Yousef

> 
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index f951c44f1d52..5cdd147b7abe 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -180,6 +180,7 @@ static __always_inline int
>  ___update_load_sum(u64 now, struct sched_avg *sa,
>  		  unsigned long load, unsigned long runnable, int running)
>  {
> +	int time_shift;
>  	u64 delta;
> 
>  	delta = now - sa->last_update_time;
> @@ -195,12 +196,17 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
>  	/*
>  	 * Use 1024ns as the unit of measurement since it's a reasonable
>  	 * approximation of 1us and fast to compute.
> +	 * On top of this, we can change the half-time period from the default
> +	 * 32ms to a shorter value. This is equivalent to left shifting the
> +	 * time.
> +	 * Merge both right and left shifts in one single right shift
>  	 */
> -	delta >>= 10;
> +	time_shift = 10 - sched_pelt_lshift;
> +	delta >>= time_shift;
>  	if (!delta)
>  		return 0;
> 
> -	sa->last_update_time += delta << 10;
> +	sa->last_update_time += delta << time_shift;
> 
>  	/*
>  	 * running is a subset of runnable (weight) so running can't be set if
> 
> 
> 
> >  	u64			clock_pelt;
> >  	unsigned long		lost_idle_time;
> >  	u64			clock_pelt_idle;
> > -- 
> > 2.34.1
> >
Vincent Guittot Feb. 4, 2024, 11:32 a.m. UTC | #5
On Thu, 1 Feb 2024 at 23:24, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 01/30/24 18:38, Vincent Guittot wrote:
> > Le vendredi 08 déc. 2023 à 00:23:42 (+0000), Qais Yousef a écrit :
> > > From: Vincent Donnefort <vincent.donnefort@arm.com>
> > >
> > > The new sched_pelt_multiplier boot param allows a user to set a clock
> > > multiplier to x2 or x4 (x1 being the default). This clock multiplier
> > > artificially speeds up PELT ramp up/down similarly to use a faster
> > > half-life than the default 32ms.
> > >
> > >   - x1: 32ms half-life
> > >   - x2: 16ms half-life
> > >   - x4: 8ms  half-life
> > >
> > > Internally, a new clock is created: rq->clock_task_mult. It sits in the
> > > clock hierarchy between rq->clock_task and rq->clock_pelt.
> > >
> > > The param is set as read only and can only be changed at boot time via
> > >
> > >     kernel.sched_pelt_multiplier=[1, 2, 4]
> > >
> > > PELT has a big impact on the overall system response and reactiveness to
> > > change. Smaller PELT HF means it'll require less time to reach the
> > > maximum performance point of the system when the system become fully
> > > busy; and equally shorter time to go back to lowest performance point
> > > when the system goes back to idle.
> > >
> > > This faster reaction impacts both dvfs response and migration time
> > > between clusters in HMP system.
> > >
> > > Smaller PELT values are expected to give better performance at the cost
> > > of more power. Under powered systems can particularly benefit from
> > > smaller values. Powerful systems can still benefit from smaller values
> > > if they want to be tuned towards perf more and power is not the major
> > > concern for them.
> > >
> > > This combined with respone_time_ms from schedutil should give the user
> > > and sysadmin a deterministic way to control the triangular power, perf
> > > and thermals for their system. The default response_time_ms will half
> > > as PELT HF halves.
> > >
> > > Update approximate_{util_avg, runtime}() to take into account the PELT
> > > HALFLIFE multiplier.
> > >
> > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > [Converted from sysctl to boot param and updated commit message]
> > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > > ---
> > >  kernel/sched/core.c  |  2 +-
> > >  kernel/sched/pelt.c  | 52 ++++++++++++++++++++++++++++++++++++++++++--
> > >  kernel/sched/pelt.h  | 42 +++++++++++++++++++++++++++++++----
> > >  kernel/sched/sched.h |  1 +
> > >  4 files changed, 90 insertions(+), 7 deletions(-)
> > >
> >
> > ...
> >
> > > +__read_mostly unsigned int sched_pelt_lshift;
> > > +static unsigned int sched_pelt_multiplier = 1;
> > > +
> > > +static int set_sched_pelt_multiplier(const char *val, const struct kernel_param *kp)
> > > +{
> > > +   int ret;
> > > +
> > > +   ret = param_set_int(val, kp);
> > > +   if (ret)
> > > +           goto error;
> > > +
> > > +   switch (sched_pelt_multiplier)  {
> > > +   case 1:
> > > +           fallthrough;
> > > +   case 2:
> > > +           fallthrough;
> > > +   case 4:
> > > +           WRITE_ONCE(sched_pelt_lshift,
> > > +                      sched_pelt_multiplier >> 1);
> > > +           break;
> > > +   default:
> > > +           ret = -EINVAL;
> > > +           goto error;
> > > +   }
> > > +
> > > +   return 0;
> > > +
> > > +error:
> > > +   sched_pelt_multiplier = 1;
> > > +   return ret;
> > > +}
> > > +
> > > +static const struct kernel_param_ops sched_pelt_multiplier_ops = {
> > > +   .set = set_sched_pelt_multiplier,
> > > +   .get = param_get_int,
> > > +};
> > > +
> > > +#ifdef MODULE_PARAM_PREFIX
> > > +#undef MODULE_PARAM_PREFIX
> > > +#endif
> > > +/* XXX: should we use sched as prefix? */
> > > +#define MODULE_PARAM_PREFIX "kernel."
> > > +module_param_cb(sched_pelt_multiplier, &sched_pelt_multiplier_ops, &sched_pelt_multiplier, 0444);
> > > +MODULE_PARM_DESC(sched_pelt_multiplier, "PELT HALFLIFE helps control the responsiveness of the system.");
> > > +MODULE_PARM_DESC(sched_pelt_multiplier, "Accepted value: 1 32ms PELT HALIFE - roughly 200ms to go from 0 to max performance point (default).");
> > > +MODULE_PARM_DESC(sched_pelt_multiplier, "                2 16ms PELT HALIFE - roughly 100ms to go from 0 to max performance point.");
> > > +MODULE_PARM_DESC(sched_pelt_multiplier, "                4  8ms PELT HALIFE - roughly  50ms to go from 0 to max performance point.");
> > > +
> > >  /*
> > >   * Approximate the new util_avg value assuming an entity has continued to run
> > >   * for @delta us.
> >
> > ...
> >
> > > +
> > >  static inline void
> > > -update_rq_clock_pelt(struct rq *rq, s64 delta) { }
> > > +update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
> > >
> > >  static inline void
> > >  update_idle_rq_clock_pelt(struct rq *rq) { }
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index bbece0eb053a..a7c89c623250 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -1029,6 +1029,7 @@ struct rq {
> > >     u64                     clock;
> > >     /* Ensure that all clocks are in the same cache line */
> > >     u64                     clock_task ____cacheline_aligned;
> > > +   u64                     clock_task_mult;
> >
> > I'm not sure that we want yet another clock and this doesn't apply for irq_avg.
> >
> > What about the below is simpler and I think cover all cases ?
>
> Looks better, yes. I'll change to this and if no issues come up I'll add your
> signed-off-by if that's okay with you for the next version.

Yes, that's okay

>
>
> Thanks!
>
> --
> Qais Yousef
>
> >
> > diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> > index f951c44f1d52..5cdd147b7abe 100644
> > --- a/kernel/sched/pelt.c
> > +++ b/kernel/sched/pelt.c
> > @@ -180,6 +180,7 @@ static __always_inline int
> >  ___update_load_sum(u64 now, struct sched_avg *sa,
> >                 unsigned long load, unsigned long runnable, int running)
> >  {
> > +     int time_shift;
> >       u64 delta;
> >
> >       delta = now - sa->last_update_time;
> > @@ -195,12 +196,17 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
> >       /*
> >        * Use 1024ns as the unit of measurement since it's a reasonable
> >        * approximation of 1us and fast to compute.
> > +      * On top of this, we can change the half-time period from the default
> > +      * 32ms to a shorter value. This is equivalent to left shifting the
> > +      * time.
> > +      * Merge both right and left shifts in one single right shift
> >        */
> > -     delta >>= 10;
> > +     time_shift = 10 - sched_pelt_lshift;
> > +     delta >>= time_shift;
> >       if (!delta)
> >               return 0;
> >
> > -     sa->last_update_time += delta << 10;
> > +     sa->last_update_time += delta << time_shift;
> >
> >       /*
> >        * running is a subset of runnable (weight) so running can't be set if
> >
> >
> >
> > >     u64                     clock_pelt;
> > >     unsigned long           lost_idle_time;
> > >     u64                     clock_pelt_idle;
> > > --
> > > 2.34.1
> > >
Ashay Jaiswal April 12, 2024, 10:06 a.m. UTC | #6
On 2/6/2024 10:37 PM, Ashay Jaiswal wrote:
> 
> 
> On 1/30/2024 10:58 PM, Vincent Guittot wrote:
>> On Sun, 28 Jan 2024 at 17:22, Ashay Jaiswal <quic_ashayj@quicinc.com> wrote:
>>>
>>> Hello Qais Yousef,
>>>
>>> Thank you for your response.
>>>
>>> On 1/21/2024 5:34 AM, Qais Yousef wrote:
>>>> Hi Ashay
>>>>
>>>> On 01/20/24 13:22, Ashay Jaiswal wrote:
>>>>> Hello Qais Yousef,
>>>>>
>>>>> We ran few benchmarks with PELT multiplier patch on a Snapdragon 8Gen2
>>>>> based internal Android device and we are observing significant
>>>>> improvements with PELT8 configuration compared to PELT32.
>>>>>
>>>>> Following are some of the benchmark results with PELT32 and PELT8
>>>>> configuration:
>>>>>
>>>>> +-----------------+---------------+----------------+----------------+
>>>>> | Test case                       |     PELT32     |     PELT8      |
>>>>> +-----------------+---------------+----------------+----------------+
>>>>> |                 |    Overall    |     711543     |     971275     |
>>>>> |                 +---------------+----------------+----------------+
>>>>> |                 |    CPU        |     193704     |     224378     |
>>>>> |                 +---------------+----------------+----------------+
>>>>> |ANTUTU V9.3.9    |    GPU        |     284650     |     424774     |
>>>>> |                 +---------------+----------------+----------------+
>>>>> |                 |    MEM        |     125207     |     160548     |
>>>>> |                 +---------------+----------------+----------------+
>>>>> |                 |    UX         |     107982     |     161575     |
>>>>> +-----------------+---------------+----------------+----------------+
>>>>> |                 |   Single core |     1170       |     1268       |
>>>>> |GeekBench V5.4.4 +---------------+----------------+----------------+
>>>>> |                 |   Multi core  |     2530       |     3797       |
>>>>> +-----------------+---------------+----------------+----------------+
>>>>> |                 |    Twitter    |     >50 Janks  |     0          |
>>>>> |     SCROLL      +---------------+----------------+----------------+
>>>>> |                 |    Contacts   |     >30 Janks  |     0          |
>>>>> +-----------------+---------------+----------------+----------------+
>>>>>
>>>>> Please let us know if you need any support with running any further
>>>>> workloads for PELT32/PELT8 experiments, we can help with running the
>>>>> experiments.
>>>>
>>>> Thanks a lot for the test results. Was this tried with this patch alone or
>>>> the whole series applied?
>>>>
>>> I have only applied patch8(sched/pelt: Introduce PELT multiplier) for the tests.
>>>
>>>> Have you tried to tweak each policy response_time_ms introduced in patch
>>>> 7 instead? With the series applied, boot with PELT8, record the response time
>>>> values for each policy, then boot back again to PELT32 and use those values.
>>>> Does this produce similar results?
>>>>
>>> As the device is based on 5.15 kernel, I will try to pull all the 8 patches
>>> along with the dependency patches on 5.15 and try out the experiments as
>>> suggested.
>>
>> Generally speaking, it would be better to compare with the latest
>> kernel or at least close and which includes new features added since
>> v5.15 (which is more than 2 years old now). I understand that this is
>> not always easy or doable but you could be surprised by the benefit of
>> some features like [0] merged since v5.15
>>
>> [0] https://lore.kernel.org/lkml/249816c9-c2b5-8016-f9ce-dab7b7d384e4@arm.com/
>>
> Thank you Vincent for the suggestion, I will try to get the results on device running
> with most recent kernel and update.
> 
> Thanks,
> Ashay Jaiswal

Hello Qais Yousef and Vincent,

Sorry for the delay, setting up internal device on latest kernel is taking more time than anticipated.
We are trying to bring-up latest kernel on the device and will complete the testing with the latest
cpufreq patches as you suggested.

Regarding PELT multiplier patch [1], are we planning to merge it separately or will it be merged
altogether with the cpufreq patches?

[1]: https://lore.kernel.org/all/20231208002342.367117-9-qyousef@layalina.io/

Thanks and Regards,
Ashay Jaiswal

>>>
>>>> You didn't share power numbers which I assume the perf gains are more important
>>>> than the power cost for you.
>>>>
>>> If possible I will try to collect the power number for future test and share the
>>> details.
>>>
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Qais Yousef
diff mbox series

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b4a1c8ea9e12..9c8626b4ddff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -745,7 +745,7 @@  static void update_rq_clock_task(struct rq *rq, s64 delta)
 	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
 		update_irq_load_avg(rq, irq_delta + steal);
 #endif
-	update_rq_clock_pelt(rq, delta);
+	update_rq_clock_task_mult(rq, delta);
 }
 
 void update_rq_clock(struct rq *rq)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 00a1b9c1bf16..0a10e56f76c7 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -468,6 +468,54 @@  int update_irq_load_avg(struct rq *rq, u64 running)
 }
 #endif /* CONFIG_HAVE_SCHED_AVG_IRQ */
 
+__read_mostly unsigned int sched_pelt_lshift;
+static unsigned int sched_pelt_multiplier = 1;
+
+static int set_sched_pelt_multiplier(const char *val, const struct kernel_param *kp)
+{
+	int ret;
+
+	ret = param_set_int(val, kp);
+	if (ret)
+		goto error;
+
+	switch (sched_pelt_multiplier)  {
+	case 1:
+		fallthrough;
+	case 2:
+		fallthrough;
+	case 4:
+		WRITE_ONCE(sched_pelt_lshift,
+			   sched_pelt_multiplier >> 1);
+		break;
+	default:
+		ret = -EINVAL;
+		goto error;
+	}
+
+	return 0;
+
+error:
+	sched_pelt_multiplier = 1;
+	return ret;
+}
+
+static const struct kernel_param_ops sched_pelt_multiplier_ops = {
+	.set = set_sched_pelt_multiplier,
+	.get = param_get_int,
+};
+
+#ifdef MODULE_PARAM_PREFIX
+#undef MODULE_PARAM_PREFIX
+#endif
+/* XXX: should we use sched as prefix? */
+#define MODULE_PARAM_PREFIX "kernel."
+module_param_cb(sched_pelt_multiplier, &sched_pelt_multiplier_ops, &sched_pelt_multiplier, 0444);
+MODULE_PARM_DESC(sched_pelt_multiplier, "PELT HALFLIFE helps control the responsiveness of the system.");
+MODULE_PARM_DESC(sched_pelt_multiplier, "Accepted value: 1 32ms PELT HALIFE - roughly 200ms to go from 0 to max performance point (default).");
+MODULE_PARM_DESC(sched_pelt_multiplier, "                2 16ms PELT HALIFE - roughly 100ms to go from 0 to max performance point.");
+MODULE_PARM_DESC(sched_pelt_multiplier, "                4  8ms PELT HALIFE - roughly  50ms to go from 0 to max performance point.");
+
 /*
  * Approximate the new util_avg value assuming an entity has continued to run
  * for @delta us.
@@ -482,7 +530,7 @@  unsigned long approximate_util_avg(unsigned long util, u64 delta)
 	if (unlikely(!delta))
 		return util;
 
-	accumulate_sum(delta, &sa, 1, 0, 1);
+	accumulate_sum(delta << sched_pelt_lshift, &sa, 1, 0, 1);
 	___update_load_avg(&sa, 0);
 
 	return sa.util_avg;
@@ -494,7 +542,7 @@  unsigned long approximate_util_avg(unsigned long util, u64 delta)
 u64 approximate_runtime(unsigned long util)
 {
 	struct sched_avg sa = {};
-	u64 delta = 1024; // period = 1024 = ~1ms
+	u64 delta = 1024 << sched_pelt_lshift; // period = 1024 = ~1ms
 	u64 runtime = 0;
 
 	if (unlikely(!util))
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 3a0e0dc28721..9b35b5072bae 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -61,6 +61,14 @@  static inline void cfs_se_util_change(struct sched_avg *avg)
 	WRITE_ONCE(avg->util_est.enqueued, enqueued);
 }
 
+static inline u64 rq_clock_task_mult(struct rq *rq)
+{
+	lockdep_assert_rq_held(rq);
+	assert_clock_updated(rq);
+
+	return rq->clock_task_mult;
+}
+
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
 	lockdep_assert_rq_held(rq);
@@ -72,7 +80,7 @@  static inline u64 rq_clock_pelt(struct rq *rq)
 /* The rq is idle, we can sync to clock_task */
 static inline void _update_idle_rq_clock_pelt(struct rq *rq)
 {
-	rq->clock_pelt  = rq_clock_task(rq);
+	rq->clock_pelt = rq_clock_task_mult(rq);
 
 	u64_u32_store(rq->clock_idle, rq_clock(rq));
 	/* Paired with smp_rmb in migrate_se_pelt_lag() */
@@ -121,6 +129,27 @@  static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
 	rq->clock_pelt += delta;
 }
 
+extern unsigned int sched_pelt_lshift;
+
+/*
+ * absolute time   |1      |2      |3      |4      |5      |6      |
+ * @ mult = 1      --------****************--------****************-
+ * @ mult = 2      --------********----------------********---------
+ * @ mult = 4      --------****--------------------****-------------
+ * clock task mult
+ * @ mult = 2      |   |   |2  |3  |   |   |   |   |5  |6  |   |   |
+ * @ mult = 4      | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
+ *
+ */
+static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
+{
+	delta <<= READ_ONCE(sched_pelt_lshift);
+
+	rq->clock_task_mult += delta;
+
+	update_rq_clock_pelt(rq, delta);
+}
+
 /*
  * When rq becomes idle, we have to check if it has lost idle time
  * because it was fully busy. A rq is fully used when the /Sum util_sum
@@ -147,7 +176,7 @@  static inline void update_idle_rq_clock_pelt(struct rq *rq)
 	 * rq's clock_task.
 	 */
 	if (util_sum >= divider)
-		rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
+		rq->lost_idle_time += rq_clock_task_mult(rq) - rq->clock_pelt;
 
 	_update_idle_rq_clock_pelt(rq);
 }
@@ -218,13 +247,18 @@  update_irq_load_avg(struct rq *rq, u64 running)
 	return 0;
 }
 
-static inline u64 rq_clock_pelt(struct rq *rq)
+static inline u64 rq_clock_task_mult(struct rq *rq)
 {
 	return rq_clock_task(rq);
 }
 
+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+	return rq_clock_task_mult(rq);
+}
+
 static inline void
-update_rq_clock_pelt(struct rq *rq, s64 delta) { }
+update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
 
 static inline void
 update_idle_rq_clock_pelt(struct rq *rq) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bbece0eb053a..a7c89c623250 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1029,6 +1029,7 @@  struct rq {
 	u64			clock;
 	/* Ensure that all clocks are in the same cache line */
 	u64			clock_task ____cacheline_aligned;
+	u64			clock_task_mult;
 	u64			clock_pelt;
 	unsigned long		lost_idle_time;
 	u64			clock_pelt_idle;