diff mbox series

[v2] sched: Consolidate cpufreq updates

Message ID 20240505233103.168766-1-qyousef@layalina.io
State New
Headers show
Series [v2] sched: Consolidate cpufreq updates | expand

Commit Message

Qais Yousef May 5, 2024, 11:31 p.m. UTC
Improve the interaction with cpufreq governors by making the
cpufreq_update_util() calls more intentional.

At the moment we send them when load is updated for CFS, bandwidth for
DL and at enqueue/dequeue for RT. But this can lead to too many updates
sent in a short period of time and potentially be ignored at a critical
moment due to the rate_limit_us in schedutil.

For example, simultaneous task enqueue on the CPU where 2nd task is
bigger and requires higher freq. The trigger to cpufreq_update_util() by
the first task will lead to dropping the 2nd request until tick. Or
another CPU in the same policy triggers a freq update shortly after.

Updates at enqueue for RT are not strictly required. Though they do help
to reduce the delay for switching the frequency and the potential
observation of lower frequency during this delay. But current logic
doesn't intentionally (at least to my understanding) try to speed up the
request.

To help reduce the amount of cpufreq updates and make them more
purposeful, consolidate them into these locations:

1. context_switch()
2. task_tick_fair()
3. {attach, detach}_entity_load_avg()
4. update_blocked_averages()

The update at context switch should help guarantee that DL and RT get
the right frequency straightaway when they're RUNNING. As mentioned
though the update will happen slightly after enqueue_task(); though in
an ideal world these tasks should be RUNNING ASAP and this additional
delay should be negligible. For fair tasks we need to make sure we send
a single update for every decay for the root cfs_rq. Any changes to the
rq will be deferred until the next task is ready to run, or we hit TICK.
But we are guaranteed the task is running at a level that meets its
requirements after enqueue.

To guarantee RT and DL tasks updates are never missed, we add a new
SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
already running at the right freq, the governor will end up doing
nothing, but we eliminate the risk of the task ending up accidentally
running at the wrong freq due to rate_limit_us.

Similarly for iowait boost, we ignore rate limits. We also handle a case
of a boost reset prematurely by adding a guard in sugov_iowait_apply()
to reduce the boost after 1ms which seems iowait boost mechanism relied
on rate_limit_us and cfs_rq.decay preventing any updates to happen soon
after iowait boost.

The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
time stamps otherwise we can end up delaying updates for normal
requests.

As a simple optimization, we avoid sending cpufreq updates when
switching from RT to another RT as RT tasks run at max freq by default.
If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if
uclamp_min is different to avoid unnecessary cpufreq update as most RT
tasks are likely to be running at the same performance level, so we can
avoid unnecessary overhead of forced updates when there's nothing to do.

We also also ensure to ignore cpufreq udpates for sugov workers at
context switch. It doesn't make sense for the kworker that applies the
frequency update (which is a DL task) to trigger a frequency update
itself.

The update at task_tick_fair will guarantee that the governor will
follow any updates to load for tasks/CPU or due to new enqueues/dequeues
to the rq. Since DL and RT always run at constant frequencies and have
no load tracking, this is only required for fair tasks.

The update at attach/detach_entity_load_avg() will ensure we adapt to
big changes when tasks are added/removed from cgroups.

The update at update_blocked_averages() will ensure we decay frequency
as the CPU becomes idle for long enough.

Results of

	taskset 1 perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe

on AMD 3900X to verify any potential overhead because of the addition at
context switch against v6.8.7 stable kernel

v6.8.7: schedutil:
------------------

 Performance counter stats for 'perf bench sched pipe' (10 runs):

       850,276,689      cycles:u                  #    0.078 GHz                      ( +-  0.88% )
        82,724,245      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
         10,881.41 msec task-clock:u              #    0.995 CPUs utilized            ( +-  0.12% )

           10.9377 +- 0.0135 seconds time elapsed  ( +-  0.12% )

v6.8.7: performance:
--------------------

 Performance counter stats for 'perf bench sched pipe' (10 runs):

       874,154,415      cycles:u                  #    0.080 GHz                      ( +-  0.78% )
        82,724,420      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
         10,916.47 msec task-clock:u              #    0.999 CPUs utilized            ( +-  0.09% )

           10.9308 +- 0.0100 seconds time elapsed  ( +-  0.09% )

v6.8.7+patch: schedutil:
------------------------

 Performance counter stats for 'perf bench sched pipe' (10 runs):

       816,938,281      cycles:u                  #    0.075 GHz                      ( +-  0.84% )
        82,724,163      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
         10,907.62 msec task-clock:u              #    1.004 CPUs utilized            ( +-  0.11% )

           10.8627 +- 0.0121 seconds time elapsed  ( +-  0.11% )

v6.8.7+patch: performance:
--------------------------

 Performance counter stats for 'perf bench sched pipe' (10 runs):

       814,038,416      cycles:u                  #    0.074 GHz                      ( +-  1.21% )
        82,724,356      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
         10,886.69 msec task-clock:u              #    0.996 CPUs utilized            ( +-  0.17% )

           10.9298 +- 0.0181 seconds time elapsed  ( +-  0.17% )

Note worthy that we still have the following race condition on systems
that have shared policy:

* CPUs with shared policy can end up sending simultaneous cpufreq
  updates requests where the 2nd one will be unlucky and get blocked by
  the rate_limit_us (schedutil).

We can potentially address this limitation later, but it is out of the
scope of this patch.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---

Changes since v1:

	* Use taskset and measure with performance governor as Ingo suggested
	* Remove the static key as I found out we always register a function
	  for cpu_dbs in cpufreq_governor.c; and as Christian pointed out it
	  trigger a lock debug warning.
	* Improve detection of sugov workers by using SCHED_FLAG_SUGOV
	* Guard against NSEC_PER_MSEC instead of TICK_USEC to avoid prematurely
	  reducing iowait boost as the latter was a NOP and like
	  sugov_iowait_reset() like Christian pointed out.

v1 discussion: https://lore.kernel.org/all/20240324020139.1032473-1-qyousef@layalina.io/

 include/linux/sched/cpufreq.h    |  3 +-
 kernel/sched/core.c              | 68 +++++++++++++++++++++++++++++++-
 kernel/sched/cpufreq_schedutil.c | 55 +++++++++++++++++++-------
 kernel/sched/deadline.c          |  4 --
 kernel/sched/fair.c              | 53 ++++---------------------
 kernel/sched/rt.c                |  8 +---
 kernel/sched/sched.h             |  5 +++
 7 files changed, 122 insertions(+), 74 deletions(-)

Comments

Peter Zijlstra May 6, 2024, 10:05 a.m. UTC | #1
On Mon, May 06, 2024 at 12:31:03AM +0100, Qais Yousef wrote:

> +static inline void update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> +{
> +#ifdef CONFIG_CPU_FREQ
> +	unsigned int flags = 0;
> +
> +#ifdef CONFIG_SMP
> +	if (unlikely(current->sched_class == &stop_sched_class))
> +		return;
> +#endif

why do we care about the stop class? It shouldn't, in general, consume a
lot of cycles.

> +
> +	if (unlikely(current->sched_class == &idle_sched_class))
> +		return;

And why do we care about idle? Specifically this test doesn't capture
force-idle threads. Notably see is_idle_task().

> +
> +	if (unlikely(task_has_idle_policy(current)))
> +		return;
> +
> +	if (likely(fair_policy(current->policy))) {
> +
> +		if (unlikely(current->in_iowait)) {
> +			flags |= SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE;
> +			goto force_update;
> +		}
> +
> +#ifdef CONFIG_SMP
> +		/*
> +		 * Allow cpufreq updates once for every update_load_avg() decay.
> +		 */
> +		if (unlikely(rq->cfs.decayed)) {
> +			rq->cfs.decayed = false;
> +			goto force_update;
> +		}
> +#endif
> +		return;
> +	}
> +
> +	/*
> +	 * RT and DL should always send a freq update. But we can do some
> +	 * simple checks to avoid it when we know it's not necessary.
> +	 */
> +	if (rt_task(current) && rt_task(prev)) {

IIRC dl tasks also match rt_task, so your else clause might not work the
way you've intended.

> +#ifdef CONFIG_UCLAMP_TASK
> +		unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
> +		unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN);
> +
> +		if (curr_uclamp_min == prev_uclamp_min)
> +#endif
> +			return;
> +	} else if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV) {

Notably DL tasks also match rt_task(), so I don't think this clause
exactly does as you expect. Also, isn't the flags check sufficient on
it's own?

> +		/* Ignore sugov kthreads, they're responding to our requests */
> +		return;
> +	}
> +
> +	flags |= SCHED_CPUFREQ_FORCE_UPDATE;
> +
> +force_update:
> +	cpufreq_update_util(rq, flags);
> +#endif
> +}

But over-all the thing seems very messy, mixing sched_class, policy and
prio based selection methods.

Can't this be cleaned up somewhat?


Notably, if you structure it something like so:

	if (fair_policy(current)) {
		...
		return;
	}

	if (rt_policy(current)) {
		if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV)
			return;
		if (rt_policy(prev) && uclamps_match(current, prev))
			return;
		...
		return;
	}

	/* everybody else gets nothing */
	return;

You get a lot less branches in the common paths, no?
Qais Yousef May 7, 2024, 12:56 a.m. UTC | #2
On 05/06/24 12:05, Peter Zijlstra wrote:
> On Mon, May 06, 2024 at 12:31:03AM +0100, Qais Yousef wrote:
> 
> > +static inline void update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> > +{
> > +#ifdef CONFIG_CPU_FREQ
> > +	unsigned int flags = 0;
> > +
> > +#ifdef CONFIG_SMP
> > +	if (unlikely(current->sched_class == &stop_sched_class))
> > +		return;
> > +#endif
> 
> why do we care about the stop class? It shouldn't, in general, consume a
> lot of cycles.
> 
> > +
> > +	if (unlikely(current->sched_class == &idle_sched_class))
> > +		return;
> 
> And why do we care about idle? Specifically this test doesn't capture
> force-idle threads. Notably see is_idle_task().

It's just We don't want these tasks to 'pollute' cpufreq updates since they
shouldn't care or contribute to what frequency the CPU should be running at.

Yes I missed the is_idle_task() from the exclusion list - which can be
simplified as you suggest later.

> 
> > +
> > +	if (unlikely(task_has_idle_policy(current)))
> > +		return;
> > +
> > +	if (likely(fair_policy(current->policy))) {
> > +
> > +		if (unlikely(current->in_iowait)) {
> > +			flags |= SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE;
> > +			goto force_update;
> > +		}
> > +
> > +#ifdef CONFIG_SMP
> > +		/*
> > +		 * Allow cpufreq updates once for every update_load_avg() decay.
> > +		 */
> > +		if (unlikely(rq->cfs.decayed)) {
> > +			rq->cfs.decayed = false;
> > +			goto force_update;
> > +		}
> > +#endif
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * RT and DL should always send a freq update. But we can do some
> > +	 * simple checks to avoid it when we know it's not necessary.
> > +	 */
> > +	if (rt_task(current) && rt_task(prev)) {
> 
> IIRC dl tasks also match rt_task, so your else clause might not work the
> way you've intended.
> 
> > +#ifdef CONFIG_UCLAMP_TASK
> > +		unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
> > +		unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN);
> > +
> > +		if (curr_uclamp_min == prev_uclamp_min)
> > +#endif
> > +			return;
> > +	} else if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV) {
> 
> Notably DL tasks also match rt_task(), so I don't think this clause

Hmm yes. dl priority is negative and rt_task() will capture this. Shouldn't we
fix the function? Can send a separate patch.

	static inline int rt_task(struct task_struct *p)
	{
		return rt_prio(p->prio) && !dl_prio();
	}

> exactly does as you expect. Also, isn't the flags check sufficient on
> it's own?

I considered this, but opted to keep the dl_task() reservedly assuming access
to dl structure should only be considered valid for dl tasks. It seemed safer
to me against potential future changes to the access pattern.

Happy to drop it if this is too reserved.

> 
> > +		/* Ignore sugov kthreads, they're responding to our requests */
> > +		return;
> > +	}
> > +
> > +	flags |= SCHED_CPUFREQ_FORCE_UPDATE;
> > +
> > +force_update:
> > +	cpufreq_update_util(rq, flags);
> > +#endif
> > +}
> 
> But over-all the thing seems very messy, mixing sched_class, policy and
> prio based selection methods.

Yeah, I started with basic conditions and started walking my way on what things
should be excluded. We don't have fair_task() so used fair_policy() and out of
habit continued to use rt/dl_task() without realizing the caveat you
highlighted.

> 
> Can't this be cleaned up somewhat?
> 
> 
> Notably, if you structure it something like so:
> 
> 	if (fair_policy(current)) {
> 		...
> 		return;
> 	}
> 
> 	if (rt_policy(current)) {
> 		if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV)
> 			return;
> 		if (rt_policy(prev) && uclamps_match(current, prev))
> 			return;
> 		...
> 		return;
> 	}
> 
> 	/* everybody else gets nothing */
> 	return;
> 
> You get a lot less branches in the common paths, no?

Yes. How about this? Since stopper class appears as RT, we should still check
for this class specifically.

Thanks!

--->8---

static inline void update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
{
#ifdef CONFIG_CPU_FREQ
	if (likely(fair_policy(current->policy))) {

		if (unlikely(current->in_iowait)) {
			cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE);
			return;
		}

#ifdef CONFIG_SMP
		/*
		 * Allow cpufreq updates once for every update_load_avg() decay.
		 */
		if (unlikely(rq->cfs.decayed)) {
			rq->cfs.decayed = false;
			cpufreq_update_util(rq, 0);
			return;
		}
#endif
		return;
	}

	/*
	 * RT and DL should always send a freq update. But we can do some
	 * simple checks to avoid it when we know it's not necessary.
	 */
	if (task_is_realtime(current)) {
		if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV) {
			/* Ignore sugov kthreads, they're responding to our requests */
			return;
		}

		if (rt_task(current) && rt_task(prev)) {
#ifdef CONFIG_UCLAMP_TASK
			unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
			unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN);

			if (curr_uclamp_min == prev_uclamp_min)
#endif
				return;
		}

#ifdef CONFIG_SMP
		if (unlikely(current->sched_class == &stop_sched_class))
			return;
#endif

		cpufreq_update_util(rq, SCHED_CPUFREQ_FORCE_UPDATE);
		return;
	}

	/* Everything else shouldn't trigger a cpufreq update */
	return;
#endif
}
Peter Zijlstra May 7, 2024, 8:02 a.m. UTC | #3
On Tue, May 07, 2024 at 01:56:59AM +0100, Qais Yousef wrote:

> Yes. How about this? Since stopper class appears as RT, we should still check
> for this class specifically.

Much nicer!

> static inline void update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> {
> #ifdef CONFIG_CPU_FREQ
> 	if (likely(fair_policy(current->policy))) {
> 
> 		if (unlikely(current->in_iowait)) {
> 			cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE);
> 			return;
> 		}
> 
> #ifdef CONFIG_SMP
> 		/*
> 		 * Allow cpufreq updates once for every update_load_avg() decay.
> 		 */
> 		if (unlikely(rq->cfs.decayed)) {
> 			rq->cfs.decayed = false;
> 			cpufreq_update_util(rq, 0);
> 			return;
> 		}
> #endif
> 		return;
> 	}
> 
> 	/*
> 	 * RT and DL should always send a freq update. But we can do some
> 	 * simple checks to avoid it when we know it's not necessary.
> 	 */
> 	if (task_is_realtime(current)) {
> 		if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV) {
> 			/* Ignore sugov kthreads, they're responding to our requests */
> 			return;
> 		}
> 
> 		if (rt_task(current) && rt_task(prev)) {

doesn't task_is_realtime() impy rt_task() ?

Also, this clause still includes DL tasks, is that okay?

> #ifdef CONFIG_UCLAMP_TASK
> 			unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
> 			unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN);
> 
> 			if (curr_uclamp_min == prev_uclamp_min)
> #endif
> 				return;
> 		}
> 
> #ifdef CONFIG_SMP
> 		if (unlikely(current->sched_class == &stop_sched_class))
> 			return;
> #endif
> 
> 		cpufreq_update_util(rq, SCHED_CPUFREQ_FORCE_UPDATE);
> 		return;
> 	}
> 
> 	/* Everything else shouldn't trigger a cpufreq update */
> 	return;
> #endif
> }
Vincent Guittot May 7, 2024, 8:58 a.m. UTC | #4
On Mon, 6 May 2024 at 01:31, Qais Yousef <qyousef@layalina.io> wrote:
>
> Improve the interaction with cpufreq governors by making the
> cpufreq_update_util() calls more intentional.
>
> At the moment we send them when load is updated for CFS, bandwidth for
> DL and at enqueue/dequeue for RT. But this can lead to too many updates
> sent in a short period of time and potentially be ignored at a critical
> moment due to the rate_limit_us in schedutil.
>
> For example, simultaneous task enqueue on the CPU where 2nd task is
> bigger and requires higher freq. The trigger to cpufreq_update_util() by
> the first task will lead to dropping the 2nd request until tick. Or
> another CPU in the same policy triggers a freq update shortly after.
>
> Updates at enqueue for RT are not strictly required. Though they do help
> to reduce the delay for switching the frequency and the potential
> observation of lower frequency during this delay. But current logic
> doesn't intentionally (at least to my understanding) try to speed up the
> request.
>
> To help reduce the amount of cpufreq updates and make them more
> purposeful, consolidate them into these locations:
>
> 1. context_switch()

I don't see any cpufreq update when switching from idle to CFS. We
have to wait for the next tick to get a freq update whatever the value
of util_est and uclamp

> 2. task_tick_fair()

Updating only during tick is ok with a tick at 1000hz/1000us when we
compare it with the1048us slice of pelt but what about 4ms or even
10ms tick ? we can have an increase of almost 200 in 10ms

> 3. {attach, detach}_entity_load_avg()

At enqueue/dequeue, the util_est will be updated and can make cpu
utilization quite different especially with long sleeping tasks. The
same applies for uclamp_min/max hints of a newly enqueued task. We
might end up waiting 4/10ms depending of the tick period.

> 4. update_blocked_averages()
>
> The update at context switch should help guarantee that DL and RT get
> the right frequency straightaway when they're RUNNING. As mentioned
> though the update will happen slightly after enqueue_task(); though in
> an ideal world these tasks should be RUNNING ASAP and this additional
> delay should be negligible. For fair tasks we need to make sure we send
> a single update for every decay for the root cfs_rq. Any changes to the
> rq will be deferred until the next task is ready to run, or we hit TICK.
> But we are guaranteed the task is running at a level that meets its
> requirements after enqueue.
>
> To guarantee RT and DL tasks updates are never missed, we add a new
> SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
> already running at the right freq, the governor will end up doing
> nothing, but we eliminate the risk of the task ending up accidentally
> running at the wrong freq due to rate_limit_us.
>
> Similarly for iowait boost, we ignore rate limits. We also handle a case
> of a boost reset prematurely by adding a guard in sugov_iowait_apply()
> to reduce the boost after 1ms which seems iowait boost mechanism relied
> on rate_limit_us and cfs_rq.decay preventing any updates to happen soon
> after iowait boost.
>
> The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
> time stamps otherwise we can end up delaying updates for normal
> requests.
>
> As a simple optimization, we avoid sending cpufreq updates when
> switching from RT to another RT as RT tasks run at max freq by default.
> If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if
> uclamp_min is different to avoid unnecessary cpufreq update as most RT
> tasks are likely to be running at the same performance level, so we can
> avoid unnecessary overhead of forced updates when there's nothing to do.
>
> We also also ensure to ignore cpufreq udpates for sugov workers at
> context switch. It doesn't make sense for the kworker that applies the
> frequency update (which is a DL task) to trigger a frequency update
> itself.
>
> The update at task_tick_fair will guarantee that the governor will
> follow any updates to load for tasks/CPU or due to new enqueues/dequeues
> to the rq. Since DL and RT always run at constant frequencies and have
> no load tracking, this is only required for fair tasks.
>
> The update at attach/detach_entity_load_avg() will ensure we adapt to
> big changes when tasks are added/removed from cgroups.
>
> The update at update_blocked_averages() will ensure we decay frequency
> as the CPU becomes idle for long enough.
>
> Results of
>
>         taskset 1 perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe
>
> on AMD 3900X to verify any potential overhead because of the addition at
> context switch against v6.8.7 stable kernel
>
> v6.8.7: schedutil:
> ------------------
>
>  Performance counter stats for 'perf bench sched pipe' (10 runs):
>
>        850,276,689      cycles:u                  #    0.078 GHz                      ( +-  0.88% )
>         82,724,245      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
>          10,881.41 msec task-clock:u              #    0.995 CPUs utilized            ( +-  0.12% )
>
>            10.9377 +- 0.0135 seconds time elapsed  ( +-  0.12% )
>
> v6.8.7: performance:
> --------------------
>
>  Performance counter stats for 'perf bench sched pipe' (10 runs):
>
>        874,154,415      cycles:u                  #    0.080 GHz                      ( +-  0.78% )
>         82,724,420      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
>          10,916.47 msec task-clock:u              #    0.999 CPUs utilized            ( +-  0.09% )
>
>            10.9308 +- 0.0100 seconds time elapsed  ( +-  0.09% )
>
> v6.8.7+patch: schedutil:
> ------------------------
>
>  Performance counter stats for 'perf bench sched pipe' (10 runs):
>
>        816,938,281      cycles:u                  #    0.075 GHz                      ( +-  0.84% )
>         82,724,163      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
>          10,907.62 msec task-clock:u              #    1.004 CPUs utilized            ( +-  0.11% )
>
>            10.8627 +- 0.0121 seconds time elapsed  ( +-  0.11% )
>
> v6.8.7+patch: performance:
> --------------------------
>
>  Performance counter stats for 'perf bench sched pipe' (10 runs):
>
>        814,038,416      cycles:u                  #    0.074 GHz                      ( +-  1.21% )
>         82,724,356      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
>          10,886.69 msec task-clock:u              #    0.996 CPUs utilized            ( +-  0.17% )
>
>            10.9298 +- 0.0181 seconds time elapsed  ( +-  0.17% )
>
> Note worthy that we still have the following race condition on systems
> that have shared policy:
>
> * CPUs with shared policy can end up sending simultaneous cpufreq
>   updates requests where the 2nd one will be unlucky and get blocked by
>   the rate_limit_us (schedutil).
>
> We can potentially address this limitation later, but it is out of the
> scope of this patch.
>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
>
> Changes since v1:
>
>         * Use taskset and measure with performance governor as Ingo suggested
>         * Remove the static key as I found out we always register a function
>           for cpu_dbs in cpufreq_governor.c; and as Christian pointed out it
>           trigger a lock debug warning.
>         * Improve detection of sugov workers by using SCHED_FLAG_SUGOV
>         * Guard against NSEC_PER_MSEC instead of TICK_USEC to avoid prematurely
>           reducing iowait boost as the latter was a NOP and like
>           sugov_iowait_reset() like Christian pointed out.
>
> v1 discussion: https://lore.kernel.org/all/20240324020139.1032473-1-qyousef@layalina.io/
>
>  include/linux/sched/cpufreq.h    |  3 +-
>  kernel/sched/core.c              | 68 +++++++++++++++++++++++++++++++-
>  kernel/sched/cpufreq_schedutil.c | 55 +++++++++++++++++++-------
>  kernel/sched/deadline.c          |  4 --
>  kernel/sched/fair.c              | 53 ++++---------------------
>  kernel/sched/rt.c                |  8 +---
>  kernel/sched/sched.h             |  5 +++
>  7 files changed, 122 insertions(+), 74 deletions(-)
>
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index bdd31ab93bc5..2d0a45aba16f 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -8,7 +8,8 @@
>   * Interface between cpufreq drivers and the scheduler:
>   */
>
> -#define SCHED_CPUFREQ_IOWAIT   (1U << 0)
> +#define SCHED_CPUFREQ_IOWAIT           (1U << 0)
> +#define SCHED_CPUFREQ_FORCE_UPDATE     (1U << 1) /* ignore transition_delay_us */
>
>  #ifdef CONFIG_CPU_FREQ
>  struct cpufreq_policy;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a914388144a..e6fe7dbd1f89 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5134,6 +5134,65 @@ static inline void balance_callbacks(struct rq *rq, struct balance_callback *hea
>
>  #endif
>
> +static inline void update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> +{
> +#ifdef CONFIG_CPU_FREQ
> +       unsigned int flags = 0;
> +
> +#ifdef CONFIG_SMP
> +       if (unlikely(current->sched_class == &stop_sched_class))
> +               return;
> +#endif
> +
> +       if (unlikely(current->sched_class == &idle_sched_class))
> +               return;
> +
> +       if (unlikely(task_has_idle_policy(current)))
> +               return;
> +
> +       if (likely(fair_policy(current->policy))) {
> +
> +               if (unlikely(current->in_iowait)) {
> +                       flags |= SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE;
> +                       goto force_update;
> +               }
> +
> +#ifdef CONFIG_SMP
> +               /*
> +                * Allow cpufreq updates once for every update_load_avg() decay.
> +                */
> +               if (unlikely(rq->cfs.decayed)) {
> +                       rq->cfs.decayed = false;
> +                       goto force_update;
> +               }
> +#endif
> +               return;
> +       }
> +
> +       /*
> +        * RT and DL should always send a freq update. But we can do some
> +        * simple checks to avoid it when we know it's not necessary.
> +        */
> +       if (rt_task(current) && rt_task(prev)) {
> +#ifdef CONFIG_UCLAMP_TASK
> +               unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
> +               unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN);
> +
> +               if (curr_uclamp_min == prev_uclamp_min)
> +#endif
> +                       return;
> +       } else if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV) {
> +               /* Ignore sugov kthreads, they're responding to our requests */
> +               return;
> +       }
> +
> +       flags |= SCHED_CPUFREQ_FORCE_UPDATE;
> +
> +force_update:
> +       cpufreq_update_util(rq, flags);
> +#endif
> +}
> +
>  static inline void
>  prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
>  {
> @@ -5151,7 +5210,7 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
>  #endif
>  }
>
> -static inline void finish_lock_switch(struct rq *rq)
> +static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
>  {
>         /*
>          * If we are tracking spinlock dependencies then we have to
> @@ -5160,6 +5219,11 @@ static inline void finish_lock_switch(struct rq *rq)
>          */
>         spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
>         __balance_callbacks(rq);
> +       /*
> +        * Request freq update after __balance_callbacks to take into account
> +        * any changes to rq.
> +        */
> +       update_cpufreq_ctx_switch(rq, prev);
>         raw_spin_rq_unlock_irq(rq);
>  }
>
> @@ -5278,7 +5342,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>         perf_event_task_sched_in(prev, current);
>         finish_task(prev);
>         tick_nohz_task_switch();
> -       finish_lock_switch(rq);
> +       finish_lock_switch(rq, prev);
>         finish_arch_post_lock_switch();
>         kcov_finish_switch(current);
>         /*
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eece6244f9d2..e8b65b75e7f3 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -59,7 +59,8 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>
>  /************************ Governor internals ***********************/
>
> -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time,
> +                                    unsigned int flags)
>  {
>         s64 delta_ns;
>
> @@ -87,13 +88,16 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>                 return true;
>         }
>
> +       if (unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> +               return true;
> +
>         delta_ns = time - sg_policy->last_freq_update_time;
>
>         return delta_ns >= sg_policy->freq_update_delay_ns;
>  }
>
>  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> -                                  unsigned int next_freq)
> +                                  unsigned int next_freq, unsigned int flags)
>  {
>         if (sg_policy->need_freq_update)
>                 sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS);
> @@ -101,7 +105,9 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
>                 return false;
>
>         sg_policy->next_freq = next_freq;
> -       sg_policy->last_freq_update_time = time;
> +
> +       if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> +               sg_policy->last_freq_update_time = time;
>
>         return true;
>  }
> @@ -249,9 +255,10 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
>                                unsigned int flags)
>  {
>         bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
> +       bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
>
>         /* Reset boost if the CPU appears to have been idle enough */
> -       if (sg_cpu->iowait_boost &&
> +       if (sg_cpu->iowait_boost && !forced_update &&
>             sugov_iowait_reset(sg_cpu, time, set_iowait_boost))
>                 return;
>
> @@ -294,17 +301,34 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
>   * being more conservative on tasks which does sporadic IO operations.
>   */
>  static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> -                              unsigned long max_cap)
> +                              unsigned long max_cap, unsigned int flags)
>  {
> +       bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
> +       s64 delta_ns = time - sg_cpu->last_update;
> +
>         /* No boost currently required */
>         if (!sg_cpu->iowait_boost)
>                 return 0;
>
> +       if (forced_update)
> +               goto apply_boost;
> +
>         /* Reset boost if the CPU appears to have been idle enough */
>         if (sugov_iowait_reset(sg_cpu, time, false))
>                 return 0;
>
>         if (!sg_cpu->iowait_boost_pending) {
> +               /*
> +                * This logic relied on PELT signal decays happening once every
> +                * 1ms. But due to changes to how updates are done now, we can
> +                * end up with more request coming up leading to iowait boost
> +                * to be prematurely reduced. Make the assumption explicit
> +                * until we improve the iowait boost logic to be better in
> +                * general as it is due for an overhaul.
> +                */
> +               if (delta_ns <= NSEC_PER_MSEC)
> +                       goto apply_boost;
> +
>                 /*
>                  * No boost pending; reduce the boost value.
>                  */
> @@ -315,6 +339,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
>                 }
>         }
>
> +apply_boost:
>         sg_cpu->iowait_boost_pending = false;
>
>         /*
> @@ -358,10 +383,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
>
>         ignore_dl_rate_limit(sg_cpu);
>
> -       if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> +       if (!sugov_should_update_freq(sg_cpu->sg_policy, time, flags))
>                 return false;
>
> -       boost = sugov_iowait_apply(sg_cpu, time, max_cap);
> +       boost = sugov_iowait_apply(sg_cpu, time, max_cap, flags);
>         sugov_get_util(sg_cpu, boost);
>
>         return true;
> @@ -397,7 +422,7 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
>                 sg_policy->cached_raw_freq = cached_freq;
>         }
>
> -       if (!sugov_update_next_freq(sg_policy, time, next_f))
> +       if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
>                 return;
>
>         /*
> @@ -449,10 +474,12 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
>         cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
>                                    sg_cpu->util, max_cap);
>
> -       sg_cpu->sg_policy->last_freq_update_time = time;
> +       if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> +               sg_cpu->sg_policy->last_freq_update_time = time;
>  }
>
> -static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> +static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time,
> +                                          unsigned int flags)
>  {
>         struct sugov_policy *sg_policy = sg_cpu->sg_policy;
>         struct cpufreq_policy *policy = sg_policy->policy;
> @@ -465,7 +492,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>                 struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
>                 unsigned long boost;
>
> -               boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
> +               boost = sugov_iowait_apply(j_sg_cpu, time, max_cap, flags);
>                 sugov_get_util(j_sg_cpu, boost);
>
>                 util = max(j_sg_cpu->util, util);
> @@ -488,10 +515,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>
>         ignore_dl_rate_limit(sg_cpu);
>
> -       if (sugov_should_update_freq(sg_policy, time)) {
> -               next_f = sugov_next_freq_shared(sg_cpu, time);
> +       if (sugov_should_update_freq(sg_policy, time, flags)) {
> +               next_f = sugov_next_freq_shared(sg_cpu, time, flags);
>
> -               if (!sugov_update_next_freq(sg_policy, time, next_f))
> +               if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
>                         goto unlock;
>
>                 if (sg_policy->policy->fast_switch_enabled)
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index a04a436af8cc..02c9c2488091 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -252,8 +252,6 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
>         dl_rq->running_bw += dl_bw;
>         SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
>         SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
> -       /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> -       cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
>  }
>
>  static inline
> @@ -266,8 +264,6 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
>         SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
>         if (dl_rq->running_bw > old)
>                 dl_rq->running_bw = 0;
> -       /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> -       cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
>  }
>
>  static inline
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9eb63573110c..cbe79c8ac2ed 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3982,29 +3982,6 @@ static inline void update_cfs_group(struct sched_entity *se)
>  }
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> -{
> -       struct rq *rq = rq_of(cfs_rq);
> -
> -       if (&rq->cfs == cfs_rq) {
> -               /*
> -                * There are a few boundary cases this might miss but it should
> -                * get called often enough that that should (hopefully) not be
> -                * a real problem.
> -                *
> -                * It will not get called when we go idle, because the idle
> -                * thread is a different class (!fair), nor will the utilization
> -                * number include things like RT tasks.
> -                *
> -                * As is, the util number is not freq-invariant (we'd have to
> -                * implement arch_scale_freq_capacity() for that).
> -                *
> -                * See cpu_util_cfs().
> -                */
> -               cpufreq_update_util(rq, flags);
> -       }
> -}
> -
>  #ifdef CONFIG_SMP
>  static inline bool load_avg_is_decayed(struct sched_avg *sa)
>  {
> @@ -4682,7 +4659,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
>         add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
>
> -       cfs_rq_util_change(cfs_rq, 0);
> +       cpufreq_update_util(rq_of(cfs_rq), 0);
>
>         trace_pelt_cfs_tp(cfs_rq);
>  }
> @@ -4712,7 +4689,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
>         add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
>
> -       cfs_rq_util_change(cfs_rq, 0);
> +       cpufreq_update_util(rq_of(cfs_rq), 0);
>
>         trace_pelt_cfs_tp(cfs_rq);
>  }
> @@ -4729,7 +4706,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>         u64 now = cfs_rq_clock_pelt(cfs_rq);
> -       int decayed;
>
>         /*
>          * Track task load average for carrying it to new CPU after migrated, and
> @@ -4738,8 +4714,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>         if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
>                 __update_load_avg_se(now, cfs_rq, se);
>
> -       decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> -       decayed |= propagate_entity_load_avg(se);
> +       cfs_rq->decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> +       cfs_rq->decayed |= propagate_entity_load_avg(se);
>
>         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
>
> @@ -4760,11 +4736,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>                  */
>                 detach_entity_load_avg(cfs_rq, se);
>                 update_tg_load_avg(cfs_rq);
> -       } else if (decayed) {
> -               cfs_rq_util_change(cfs_rq, 0);
> -
> -               if (flags & UPDATE_TG)
> -                       update_tg_load_avg(cfs_rq);
> +       } else if (cfs_rq->decayed && (flags & UPDATE_TG)) {
> +               update_tg_load_avg(cfs_rq);
>         }
>  }
>
> @@ -5139,7 +5112,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
>  {
> -       cfs_rq_util_change(cfs_rq, 0);
>  }
>
>  static inline void remove_entity_load_avg(struct sched_entity *se) {}
> @@ -6754,14 +6726,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>          */
>         util_est_enqueue(&rq->cfs, p);
>
> -       /*
> -        * If in_iowait is set, the code below may not trigger any cpufreq
> -        * utilization updates, so do it here explicitly with the IOWAIT flag
> -        * passed.
> -        */
> -       if (p->in_iowait)
> -               cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> -
>         for_each_sched_entity(se) {
>                 if (se->on_rq)
>                         break;
> @@ -9351,10 +9315,6 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
>         unsigned long hw_pressure;
>         bool decayed;
>
> -       /*
> -        * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
> -        * DL and IRQ signals have been updated before updating CFS.
> -        */
>         curr_class = rq->curr->sched_class;
>
>         hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> @@ -12685,6 +12645,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
>         update_misfit_status(curr, rq);
>         check_update_overutilized_status(task_rq(curr));
> +       cpufreq_update_util(rq, 0);
>
>         task_tick_core(rq, curr);
>  }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 3261b067b67e..fe6d8b0ffa95 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -556,11 +556,8 @@ static void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
>
>         rt_se = rt_rq->tg->rt_se[cpu];
>
> -       if (!rt_se) {
> +       if (!rt_se)
>                 dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
> -               /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> -               cpufreq_update_util(rq_of_rt_rq(rt_rq), 0);
> -       }
>         else if (on_rt_rq(rt_se))
>                 dequeue_rt_entity(rt_se, 0);
>  }
> @@ -1065,9 +1062,6 @@ enqueue_top_rt_rq(struct rt_rq *rt_rq)
>                 add_nr_running(rq, rt_rq->rt_nr_running);
>                 rt_rq->rt_queued = 1;
>         }
> -
> -       /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> -       cpufreq_update_util(rq, 0);
>  }
>
>  #if defined CONFIG_SMP
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index cb3792c04eea..86cec2145221 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -632,6 +632,11 @@ struct cfs_rq {
>                 unsigned long   runnable_avg;
>         } removed;
>
> +       /*
> +        * Store whether last update_load_avg() has decayed
> +        */
> +       bool                    decayed;
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         u64                     last_update_tg_load_avg;
>         unsigned long           tg_load_avg_contrib;
> --
> 2.34.1
>
Vincent Guittot May 7, 2024, 10:21 a.m. UTC | #5
On Tue, 7 May 2024 at 10:58, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Mon, 6 May 2024 at 01:31, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > Improve the interaction with cpufreq governors by making the
> > cpufreq_update_util() calls more intentional.
> >
> > At the moment we send them when load is updated for CFS, bandwidth for
> > DL and at enqueue/dequeue for RT. But this can lead to too many updates
> > sent in a short period of time and potentially be ignored at a critical
> > moment due to the rate_limit_us in schedutil.
> >
> > For example, simultaneous task enqueue on the CPU where 2nd task is
> > bigger and requires higher freq. The trigger to cpufreq_update_util() by
> > the first task will lead to dropping the 2nd request until tick. Or
> > another CPU in the same policy triggers a freq update shortly after.
> >
> > Updates at enqueue for RT are not strictly required. Though they do help
> > to reduce the delay for switching the frequency and the potential
> > observation of lower frequency during this delay. But current logic
> > doesn't intentionally (at least to my understanding) try to speed up the
> > request.
> >
> > To help reduce the amount of cpufreq updates and make them more
> > purposeful, consolidate them into these locations:
> >
> > 1. context_switch()
>
> I don't see any cpufreq update when switching from idle to CFS. We
> have to wait for the next tick to get a freq update whatever the value
> of util_est and uclamp

This seems to happen when the tick is not stopped

>
> > 2. task_tick_fair()
>
> Updating only during tick is ok with a tick at 1000hz/1000us when we
> compare it with the1048us slice of pelt but what about 4ms or even
> 10ms tick ? we can have an increase of almost 200 in 10ms
>
> > 3. {attach, detach}_entity_load_avg()
>
> At enqueue/dequeue, the util_est will be updated and can make cpu
> utilization quite different especially with long sleeping tasks. The
> same applies for uclamp_min/max hints of a newly enqueued task. We
> might end up waiting 4/10ms depending of the tick period.
>
> > 4. update_blocked_averages()
> >
> > The update at context switch should help guarantee that DL and RT get
> > the right frequency straightaway when they're RUNNING. As mentioned
> > though the update will happen slightly after enqueue_task(); though in
> > an ideal world these tasks should be RUNNING ASAP and this additional
> > delay should be negligible. For fair tasks we need to make sure we send
> > a single update for every decay for the root cfs_rq. Any changes to the
> > rq will be deferred until the next task is ready to run, or we hit TICK.
> > But we are guaranteed the task is running at a level that meets its
> > requirements after enqueue.
> >
> > To guarantee RT and DL tasks updates are never missed, we add a new
> > SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
> > already running at the right freq, the governor will end up doing
> > nothing, but we eliminate the risk of the task ending up accidentally
> > running at the wrong freq due to rate_limit_us.
> >
> > Similarly for iowait boost, we ignore rate limits. We also handle a case
> > of a boost reset prematurely by adding a guard in sugov_iowait_apply()
> > to reduce the boost after 1ms which seems iowait boost mechanism relied
> > on rate_limit_us and cfs_rq.decay preventing any updates to happen soon
> > after iowait boost.
> >
> > The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
> > time stamps otherwise we can end up delaying updates for normal
> > requests.
> >
> > As a simple optimization, we avoid sending cpufreq updates when
> > switching from RT to another RT as RT tasks run at max freq by default.
> > If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if
> > uclamp_min is different to avoid unnecessary cpufreq update as most RT
> > tasks are likely to be running at the same performance level, so we can
> > avoid unnecessary overhead of forced updates when there's nothing to do.
> >
> > We also also ensure to ignore cpufreq udpates for sugov workers at
> > context switch. It doesn't make sense for the kworker that applies the
> > frequency update (which is a DL task) to trigger a frequency update
> > itself.
> >
> > The update at task_tick_fair will guarantee that the governor will
> > follow any updates to load for tasks/CPU or due to new enqueues/dequeues
> > to the rq. Since DL and RT always run at constant frequencies and have
> > no load tracking, this is only required for fair tasks.
> >
> > The update at attach/detach_entity_load_avg() will ensure we adapt to
> > big changes when tasks are added/removed from cgroups.
> >
> > The update at update_blocked_averages() will ensure we decay frequency
> > as the CPU becomes idle for long enough.
> >
> > Results of
> >
> >         taskset 1 perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe
> >
> > on AMD 3900X to verify any potential overhead because of the addition at
> > context switch against v6.8.7 stable kernel
> >
> > v6.8.7: schedutil:
> > ------------------
> >
> >  Performance counter stats for 'perf bench sched pipe' (10 runs):
> >
> >        850,276,689      cycles:u                  #    0.078 GHz                      ( +-  0.88% )
> >         82,724,245      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
> >          10,881.41 msec task-clock:u              #    0.995 CPUs utilized            ( +-  0.12% )
> >
> >            10.9377 +- 0.0135 seconds time elapsed  ( +-  0.12% )
> >
> > v6.8.7: performance:
> > --------------------
> >
> >  Performance counter stats for 'perf bench sched pipe' (10 runs):
> >
> >        874,154,415      cycles:u                  #    0.080 GHz                      ( +-  0.78% )
> >         82,724,420      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
> >          10,916.47 msec task-clock:u              #    0.999 CPUs utilized            ( +-  0.09% )
> >
> >            10.9308 +- 0.0100 seconds time elapsed  ( +-  0.09% )
> >
> > v6.8.7+patch: schedutil:
> > ------------------------
> >
> >  Performance counter stats for 'perf bench sched pipe' (10 runs):
> >
> >        816,938,281      cycles:u                  #    0.075 GHz                      ( +-  0.84% )
> >         82,724,163      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
> >          10,907.62 msec task-clock:u              #    1.004 CPUs utilized            ( +-  0.11% )
> >
> >            10.8627 +- 0.0121 seconds time elapsed  ( +-  0.11% )
> >
> > v6.8.7+patch: performance:
> > --------------------------
> >
> >  Performance counter stats for 'perf bench sched pipe' (10 runs):
> >
> >        814,038,416      cycles:u                  #    0.074 GHz                      ( +-  1.21% )
> >         82,724,356      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
> >          10,886.69 msec task-clock:u              #    0.996 CPUs utilized            ( +-  0.17% )
> >
> >            10.9298 +- 0.0181 seconds time elapsed  ( +-  0.17% )
> >
> > Note worthy that we still have the following race condition on systems
> > that have shared policy:
> >
> > * CPUs with shared policy can end up sending simultaneous cpufreq
> >   updates requests where the 2nd one will be unlucky and get blocked by
> >   the rate_limit_us (schedutil).
> >
> > We can potentially address this limitation later, but it is out of the
> > scope of this patch.
> >
> > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > ---
> >
> > Changes since v1:
> >
> >         * Use taskset and measure with performance governor as Ingo suggested
> >         * Remove the static key as I found out we always register a function
> >           for cpu_dbs in cpufreq_governor.c; and as Christian pointed out it
> >           trigger a lock debug warning.
> >         * Improve detection of sugov workers by using SCHED_FLAG_SUGOV
> >         * Guard against NSEC_PER_MSEC instead of TICK_USEC to avoid prematurely
> >           reducing iowait boost as the latter was a NOP and like
> >           sugov_iowait_reset() like Christian pointed out.
> >
> > v1 discussion: https://lore.kernel.org/all/20240324020139.1032473-1-qyousef@layalina.io/
> >
> >  include/linux/sched/cpufreq.h    |  3 +-
> >  kernel/sched/core.c              | 68 +++++++++++++++++++++++++++++++-
> >  kernel/sched/cpufreq_schedutil.c | 55 +++++++++++++++++++-------
> >  kernel/sched/deadline.c          |  4 --
> >  kernel/sched/fair.c              | 53 ++++---------------------
> >  kernel/sched/rt.c                |  8 +---
> >  kernel/sched/sched.h             |  5 +++
> >  7 files changed, 122 insertions(+), 74 deletions(-)
> >
> > diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> > index bdd31ab93bc5..2d0a45aba16f 100644
> > --- a/include/linux/sched/cpufreq.h
> > +++ b/include/linux/sched/cpufreq.h
> > @@ -8,7 +8,8 @@
> >   * Interface between cpufreq drivers and the scheduler:
> >   */
> >
> > -#define SCHED_CPUFREQ_IOWAIT   (1U << 0)
> > +#define SCHED_CPUFREQ_IOWAIT           (1U << 0)
> > +#define SCHED_CPUFREQ_FORCE_UPDATE     (1U << 1) /* ignore transition_delay_us */
> >
> >  #ifdef CONFIG_CPU_FREQ
> >  struct cpufreq_policy;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1a914388144a..e6fe7dbd1f89 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5134,6 +5134,65 @@ static inline void balance_callbacks(struct rq *rq, struct balance_callback *hea
> >
> >  #endif
> >
> > +static inline void update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> > +{
> > +#ifdef CONFIG_CPU_FREQ
> > +       unsigned int flags = 0;
> > +
> > +#ifdef CONFIG_SMP
> > +       if (unlikely(current->sched_class == &stop_sched_class))
> > +               return;
> > +#endif
> > +
> > +       if (unlikely(current->sched_class == &idle_sched_class))
> > +               return;
> > +
> > +       if (unlikely(task_has_idle_policy(current)))
> > +               return;
> > +
> > +       if (likely(fair_policy(current->policy))) {
> > +
> > +               if (unlikely(current->in_iowait)) {
> > +                       flags |= SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE;
> > +                       goto force_update;
> > +               }
> > +
> > +#ifdef CONFIG_SMP
> > +               /*
> > +                * Allow cpufreq updates once for every update_load_avg() decay.
> > +                */
> > +               if (unlikely(rq->cfs.decayed)) {
> > +                       rq->cfs.decayed = false;
> > +                       goto force_update;
> > +               }
> > +#endif
> > +               return;
> > +       }
> > +
> > +       /*
> > +        * RT and DL should always send a freq update. But we can do some
> > +        * simple checks to avoid it when we know it's not necessary.
> > +        */
> > +       if (rt_task(current) && rt_task(prev)) {
> > +#ifdef CONFIG_UCLAMP_TASK
> > +               unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
> > +               unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN);
> > +
> > +               if (curr_uclamp_min == prev_uclamp_min)
> > +#endif
> > +                       return;
> > +       } else if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV) {
> > +               /* Ignore sugov kthreads, they're responding to our requests */
> > +               return;
> > +       }
> > +
> > +       flags |= SCHED_CPUFREQ_FORCE_UPDATE;
> > +
> > +force_update:
> > +       cpufreq_update_util(rq, flags);
> > +#endif
> > +}
> > +
> >  static inline void
> >  prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> >  {
> > @@ -5151,7 +5210,7 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
> >  #endif
> >  }
> >
> > -static inline void finish_lock_switch(struct rq *rq)
> > +static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
> >  {
> >         /*
> >          * If we are tracking spinlock dependencies then we have to
> > @@ -5160,6 +5219,11 @@ static inline void finish_lock_switch(struct rq *rq)
> >          */
> >         spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
> >         __balance_callbacks(rq);
> > +       /*
> > +        * Request freq update after __balance_callbacks to take into account
> > +        * any changes to rq.
> > +        */
> > +       update_cpufreq_ctx_switch(rq, prev);
> >         raw_spin_rq_unlock_irq(rq);
> >  }
> >
> > @@ -5278,7 +5342,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> >         perf_event_task_sched_in(prev, current);
> >         finish_task(prev);
> >         tick_nohz_task_switch();
> > -       finish_lock_switch(rq);
> > +       finish_lock_switch(rq, prev);
> >         finish_arch_post_lock_switch();
> >         kcov_finish_switch(current);
> >         /*
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index eece6244f9d2..e8b65b75e7f3 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -59,7 +59,8 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> >
> >  /************************ Governor internals ***********************/
> >
> > -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time,
> > +                                    unsigned int flags)
> >  {
> >         s64 delta_ns;
> >
> > @@ -87,13 +88,16 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> >                 return true;
> >         }
> >
> > +       if (unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> > +               return true;
> > +
> >         delta_ns = time - sg_policy->last_freq_update_time;
> >
> >         return delta_ns >= sg_policy->freq_update_delay_ns;
> >  }
> >
> >  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> > -                                  unsigned int next_freq)
> > +                                  unsigned int next_freq, unsigned int flags)
> >  {
> >         if (sg_policy->need_freq_update)
> >                 sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS);
> > @@ -101,7 +105,9 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> >                 return false;
> >
> >         sg_policy->next_freq = next_freq;
> > -       sg_policy->last_freq_update_time = time;
> > +
> > +       if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> > +               sg_policy->last_freq_update_time = time;
> >
> >         return true;
> >  }
> > @@ -249,9 +255,10 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> >                                unsigned int flags)
> >  {
> >         bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
> > +       bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
> >
> >         /* Reset boost if the CPU appears to have been idle enough */
> > -       if (sg_cpu->iowait_boost &&
> > +       if (sg_cpu->iowait_boost && !forced_update &&
> >             sugov_iowait_reset(sg_cpu, time, set_iowait_boost))
> >                 return;
> >
> > @@ -294,17 +301,34 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> >   * being more conservative on tasks which does sporadic IO operations.
> >   */
> >  static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> > -                              unsigned long max_cap)
> > +                              unsigned long max_cap, unsigned int flags)
> >  {
> > +       bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
> > +       s64 delta_ns = time - sg_cpu->last_update;
> > +
> >         /* No boost currently required */
> >         if (!sg_cpu->iowait_boost)
> >                 return 0;
> >
> > +       if (forced_update)
> > +               goto apply_boost;
> > +
> >         /* Reset boost if the CPU appears to have been idle enough */
> >         if (sugov_iowait_reset(sg_cpu, time, false))
> >                 return 0;
> >
> >         if (!sg_cpu->iowait_boost_pending) {
> > +               /*
> > +                * This logic relied on PELT signal decays happening once every
> > +                * 1ms. But due to changes to how updates are done now, we can
> > +                * end up with more request coming up leading to iowait boost
> > +                * to be prematurely reduced. Make the assumption explicit
> > +                * until we improve the iowait boost logic to be better in
> > +                * general as it is due for an overhaul.
> > +                */
> > +               if (delta_ns <= NSEC_PER_MSEC)
> > +                       goto apply_boost;
> > +
> >                 /*
> >                  * No boost pending; reduce the boost value.
> >                  */
> > @@ -315,6 +339,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> >                 }
> >         }
> >
> > +apply_boost:
> >         sg_cpu->iowait_boost_pending = false;
> >
> >         /*
> > @@ -358,10 +383,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
> >
> >         ignore_dl_rate_limit(sg_cpu);
> >
> > -       if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> > +       if (!sugov_should_update_freq(sg_cpu->sg_policy, time, flags))
> >                 return false;
> >
> > -       boost = sugov_iowait_apply(sg_cpu, time, max_cap);
> > +       boost = sugov_iowait_apply(sg_cpu, time, max_cap, flags);
> >         sugov_get_util(sg_cpu, boost);
> >
> >         return true;
> > @@ -397,7 +422,7 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
> >                 sg_policy->cached_raw_freq = cached_freq;
> >         }
> >
> > -       if (!sugov_update_next_freq(sg_policy, time, next_f))
> > +       if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
> >                 return;
> >
> >         /*
> > @@ -449,10 +474,12 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> >         cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
> >                                    sg_cpu->util, max_cap);
> >
> > -       sg_cpu->sg_policy->last_freq_update_time = time;
> > +       if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> > +               sg_cpu->sg_policy->last_freq_update_time = time;
> >  }
> >
> > -static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> > +static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time,
> > +                                          unsigned int flags)
> >  {
> >         struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> >         struct cpufreq_policy *policy = sg_policy->policy;
> > @@ -465,7 +492,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> >                 struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
> >                 unsigned long boost;
> >
> > -               boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
> > +               boost = sugov_iowait_apply(j_sg_cpu, time, max_cap, flags);
> >                 sugov_get_util(j_sg_cpu, boost);
> >
> >                 util = max(j_sg_cpu->util, util);
> > @@ -488,10 +515,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
> >
> >         ignore_dl_rate_limit(sg_cpu);
> >
> > -       if (sugov_should_update_freq(sg_policy, time)) {
> > -               next_f = sugov_next_freq_shared(sg_cpu, time);
> > +       if (sugov_should_update_freq(sg_policy, time, flags)) {
> > +               next_f = sugov_next_freq_shared(sg_cpu, time, flags);
> >
> > -               if (!sugov_update_next_freq(sg_policy, time, next_f))
> > +               if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
> >                         goto unlock;
> >
> >                 if (sg_policy->policy->fast_switch_enabled)
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index a04a436af8cc..02c9c2488091 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -252,8 +252,6 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> >         dl_rq->running_bw += dl_bw;
> >         SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
> >         SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
> > -       /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> > -       cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
> >  }
> >
> >  static inline
> > @@ -266,8 +264,6 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> >         SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
> >         if (dl_rq->running_bw > old)
> >                 dl_rq->running_bw = 0;
> > -       /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> > -       cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
> >  }
> >
> >  static inline
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9eb63573110c..cbe79c8ac2ed 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3982,29 +3982,6 @@ static inline void update_cfs_group(struct sched_entity *se)
> >  }
> >  #endif /* CONFIG_FAIR_GROUP_SCHED */
> >
> > -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> > -{
> > -       struct rq *rq = rq_of(cfs_rq);
> > -
> > -       if (&rq->cfs == cfs_rq) {
> > -               /*
> > -                * There are a few boundary cases this might miss but it should
> > -                * get called often enough that that should (hopefully) not be
> > -                * a real problem.
> > -                *
> > -                * It will not get called when we go idle, because the idle
> > -                * thread is a different class (!fair), nor will the utilization
> > -                * number include things like RT tasks.
> > -                *
> > -                * As is, the util number is not freq-invariant (we'd have to
> > -                * implement arch_scale_freq_capacity() for that).
> > -                *
> > -                * See cpu_util_cfs().
> > -                */
> > -               cpufreq_update_util(rq, flags);
> > -       }
> > -}
> > -
> >  #ifdef CONFIG_SMP
> >  static inline bool load_avg_is_decayed(struct sched_avg *sa)
> >  {
> > @@ -4682,7 +4659,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> >         add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
> >
> > -       cfs_rq_util_change(cfs_rq, 0);
> > +       cpufreq_update_util(rq_of(cfs_rq), 0);
> >
> >         trace_pelt_cfs_tp(cfs_rq);
> >  }
> > @@ -4712,7 +4689,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> >         add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
> >
> > -       cfs_rq_util_change(cfs_rq, 0);
> > +       cpufreq_update_util(rq_of(cfs_rq), 0);
> >
> >         trace_pelt_cfs_tp(cfs_rq);
> >  }
> > @@ -4729,7 +4706,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  {
> >         u64 now = cfs_rq_clock_pelt(cfs_rq);
> > -       int decayed;
> >
> >         /*
> >          * Track task load average for carrying it to new CPU after migrated, and
> > @@ -4738,8 +4714,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >         if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> >                 __update_load_avg_se(now, cfs_rq, se);
> >
> > -       decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > -       decayed |= propagate_entity_load_avg(se);
> > +       cfs_rq->decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > +       cfs_rq->decayed |= propagate_entity_load_avg(se);
> >
> >         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> >
> > @@ -4760,11 +4736,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >                  */
> >                 detach_entity_load_avg(cfs_rq, se);
> >                 update_tg_load_avg(cfs_rq);
> > -       } else if (decayed) {
> > -               cfs_rq_util_change(cfs_rq, 0);
> > -
> > -               if (flags & UPDATE_TG)
> > -                       update_tg_load_avg(cfs_rq);
> > +       } else if (cfs_rq->decayed && (flags & UPDATE_TG)) {
> > +               update_tg_load_avg(cfs_rq);
> >         }
> >  }
> >
> > @@ -5139,7 +5112,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> >
> >  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
> >  {
> > -       cfs_rq_util_change(cfs_rq, 0);
> >  }
> >
> >  static inline void remove_entity_load_avg(struct sched_entity *se) {}
> > @@ -6754,14 +6726,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >          */
> >         util_est_enqueue(&rq->cfs, p);
> >
> > -       /*
> > -        * If in_iowait is set, the code below may not trigger any cpufreq
> > -        * utilization updates, so do it here explicitly with the IOWAIT flag
> > -        * passed.
> > -        */
> > -       if (p->in_iowait)
> > -               cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> > -
> >         for_each_sched_entity(se) {
> >                 if (se->on_rq)
> >                         break;
> > @@ -9351,10 +9315,6 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
> >         unsigned long hw_pressure;
> >         bool decayed;
> >
> > -       /*
> > -        * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
> > -        * DL and IRQ signals have been updated before updating CFS.
> > -        */
> >         curr_class = rq->curr->sched_class;
> >
> >         hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> > @@ -12685,6 +12645,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> >
> >         update_misfit_status(curr, rq);
> >         check_update_overutilized_status(task_rq(curr));
> > +       cpufreq_update_util(rq, 0);
> >
> >         task_tick_core(rq, curr);
> >  }
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index 3261b067b67e..fe6d8b0ffa95 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -556,11 +556,8 @@ static void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
> >
> >         rt_se = rt_rq->tg->rt_se[cpu];
> >
> > -       if (!rt_se) {
> > +       if (!rt_se)
> >                 dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
> > -               /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> > -               cpufreq_update_util(rq_of_rt_rq(rt_rq), 0);
> > -       }
> >         else if (on_rt_rq(rt_se))
> >                 dequeue_rt_entity(rt_se, 0);
> >  }
> > @@ -1065,9 +1062,6 @@ enqueue_top_rt_rq(struct rt_rq *rt_rq)
> >                 add_nr_running(rq, rt_rq->rt_nr_running);
> >                 rt_rq->rt_queued = 1;
> >         }
> > -
> > -       /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> > -       cpufreq_update_util(rq, 0);
> >  }
> >
> >  #if defined CONFIG_SMP
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index cb3792c04eea..86cec2145221 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -632,6 +632,11 @@ struct cfs_rq {
> >                 unsigned long   runnable_avg;
> >         } removed;
> >
> > +       /*
> > +        * Store whether last update_load_avg() has decayed
> > +        */
> > +       bool                    decayed;
> > +
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >         u64                     last_update_tg_load_avg;
> >         unsigned long           tg_load_avg_contrib;
> > --
> > 2.34.1
> >
Qais Yousef May 7, 2024, 10:42 a.m. UTC | #6
On 05/07/24 10:02, Peter Zijlstra wrote:
> On Tue, May 07, 2024 at 01:56:59AM +0100, Qais Yousef wrote:
> 
> > Yes. How about this? Since stopper class appears as RT, we should still check
> > for this class specifically.
> 
> Much nicer!
> 
> > static inline void update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> > {
> > #ifdef CONFIG_CPU_FREQ
> > 	if (likely(fair_policy(current->policy))) {
> > 
> > 		if (unlikely(current->in_iowait)) {
> > 			cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE);
> > 			return;
> > 		}
> > 
> > #ifdef CONFIG_SMP
> > 		/*
> > 		 * Allow cpufreq updates once for every update_load_avg() decay.
> > 		 */
> > 		if (unlikely(rq->cfs.decayed)) {
> > 			rq->cfs.decayed = false;
> > 			cpufreq_update_util(rq, 0);
> > 			return;
> > 		}
> > #endif
> > 		return;
> > 	}
> > 
> > 	/*
> > 	 * RT and DL should always send a freq update. But we can do some
> > 	 * simple checks to avoid it when we know it's not necessary.
> > 	 */
> > 	if (task_is_realtime(current)) {
> > 		if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV) {
> > 			/* Ignore sugov kthreads, they're responding to our requests */
> > 			return;
> > 		}
> > 
> > 		if (rt_task(current) && rt_task(prev)) {
> 
> doesn't task_is_realtime() impy rt_task() ?
> 
> Also, this clause still includes DL tasks, is that okay?

Ugh, yes. The earlier check for dl_task() is not good enough. I should send
a patch to fix the definition of rt_task()!

I think at this stage open coding the policy check with a switch statement
is the best thing to do

static inline void update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
{
#ifdef CONFIG_CPU_FREQ
	/*
	 * RT and DL should always send a freq update. But we can do some
	 * simple checks to avoid it when we know it's not necessary.
	 *
	 * iowait_boost will always trigger a freq update too.
	 *
	 * Fair tasks will only trigger an update if the root cfs_rq has
	 * decayed.
	 *
	 * Everything else should do nothing.
	 */
	switch (current->policy) {
	case SCHED_NORMAL:
	case SCHED_BATCH:
		if (unlikely(current->in_iowait)) {
			cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE);
			return;
		}

#ifdef CONFIG_SMP
		if (unlikely(rq->cfs.decayed)) {
			rq->cfs.decayed = false;
			cpufreq_update_util(rq, 0);
			return;
		}
#endif
		return;
	case SCHED_FIFO:
	case SCHED_RR:
		if (rt_policy(prev)) {
#ifdef CONFIG_UCLAMP_TASK
			unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
			unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN);

			if (curr_uclamp_min == prev_uclamp_min)
#endif
				return;
		}
#ifdef CONFIG_SMP
		/* Stopper task masquerades as RT */
		if (unlikely(current->sched_class == &stop_sched_class))
			return;
#endif
		cpufreq_update_util(rq, SCHED_CPUFREQ_FORCE_UPDATE);
		return;
	case SCHED_DEADLINE:
		if (current->dl.flags & SCHED_FLAG_SUGOV) {
			/* Ignore sugov kthreads, they're responding to our requests */
			return;
		}
		cpufreq_update_util(rq, SCHED_CPUFREQ_FORCE_UPDATE);
		return;
	default:
		return;
	}
#endif
}
Qais Yousef May 7, 2024, 11:08 a.m. UTC | #7
On 05/07/24 10:58, Vincent Guittot wrote:
> On Mon, 6 May 2024 at 01:31, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > Improve the interaction with cpufreq governors by making the
> > cpufreq_update_util() calls more intentional.
> >
> > At the moment we send them when load is updated for CFS, bandwidth for
> > DL and at enqueue/dequeue for RT. But this can lead to too many updates
> > sent in a short period of time and potentially be ignored at a critical
> > moment due to the rate_limit_us in schedutil.
> >
> > For example, simultaneous task enqueue on the CPU where 2nd task is
> > bigger and requires higher freq. The trigger to cpufreq_update_util() by
> > the first task will lead to dropping the 2nd request until tick. Or
> > another CPU in the same policy triggers a freq update shortly after.
> >
> > Updates at enqueue for RT are not strictly required. Though they do help
> > to reduce the delay for switching the frequency and the potential
> > observation of lower frequency during this delay. But current logic
> > doesn't intentionally (at least to my understanding) try to speed up the
> > request.
> >
> > To help reduce the amount of cpufreq updates and make them more
> > purposeful, consolidate them into these locations:
> >
> > 1. context_switch()
> 
> I don't see any cpufreq update when switching from idle to CFS. We

You mean SCHED_IDLE to SCHED_NORMAL, right? Yes, if we switch policies even
from fair to RT an update could be missed.

I'll need to think more about it, but I think adding an update when we switch
policies in the syscall looks sufficient to me, if the task is on rq already.
Agreed?

> have to wait for the next tick to get a freq update whatever the value
> of util_est and uclamp
> 
> > 2. task_tick_fair()
> 
> Updating only during tick is ok with a tick at 1000hz/1000us when we
> compare it with the1048us slice of pelt but what about 4ms or even
> 10ms tick ? we can have an increase of almost 200 in 10ms

IMHO the current code can still fail with these setups to update frequencies in
time. If there's a single task on the rq, then the only freq update will happen
at tick. So this is an existing problem.

The way I see it is that setting such high TICK values implies low
responsiveness by definition. So the person who selects this setup needs to
cater that their worst case scenario is that and be happy with it. And this
worst case scenario does not change.

That said, the right way to cater for this is via my other series to remove the
magic margins. DVFS headroom should rely on TICK value to ensure we run at
adequate frequency until the next worst case scenario update, which relies on
TICK. Which is sufficient to handle util_est changes. See below for uclamp.

Wake up preemption should cause context switches to happen sooner than a tick
too as we add more tasks on the rq. So I think the worst case scenario is not
really changing that much. In my view, it's better to be consistent about the
behavior.

> 
> > 3. {attach, detach}_entity_load_avg()
> 
> At enqueue/dequeue, the util_est will be updated and can make cpu
> utilization quite different especially with long sleeping tasks. The
> same applies for uclamp_min/max hints of a newly enqueued task. We
> might end up waiting 4/10ms depending of the tick period.

uclamp_min is a property of the task. And waiting for the task that needs the
boost to run is fine IMHO. And I am actually hoping to remove uclamp max()
aggregation in favour of applying boosts/caps when tasks are RUNNING. But more
things need to be improved first.

We are missing a freq update when uclamp values change by the way. This is
a known bug and I keep forgetting to post a patch to fix it. Let me do this
along update freq when policy changes.

Thanks!
Vincent Guittot May 7, 2024, 12:53 p.m. UTC | #8
On Tue, 7 May 2024 at 13:08, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 05/07/24 10:58, Vincent Guittot wrote:
> > On Mon, 6 May 2024 at 01:31, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > Improve the interaction with cpufreq governors by making the
> > > cpufreq_update_util() calls more intentional.
> > >
> > > At the moment we send them when load is updated for CFS, bandwidth for
> > > DL and at enqueue/dequeue for RT. But this can lead to too many updates
> > > sent in a short period of time and potentially be ignored at a critical
> > > moment due to the rate_limit_us in schedutil.
> > >
> > > For example, simultaneous task enqueue on the CPU where 2nd task is
> > > bigger and requires higher freq. The trigger to cpufreq_update_util() by
> > > the first task will lead to dropping the 2nd request until tick. Or
> > > another CPU in the same policy triggers a freq update shortly after.
> > >
> > > Updates at enqueue for RT are not strictly required. Though they do help
> > > to reduce the delay for switching the frequency and the potential
> > > observation of lower frequency during this delay. But current logic
> > > doesn't intentionally (at least to my understanding) try to speed up the
> > > request.
> > >
> > > To help reduce the amount of cpufreq updates and make them more
> > > purposeful, consolidate them into these locations:
> > >
> > > 1. context_switch()
> >
> > I don't see any cpufreq update when switching from idle to CFS. We
>
> You mean SCHED_IDLE to SCHED_NORMAL, right? Yes, if we switch policies even
> from fair to RT an update could be missed.

No I mean going out of idle. On an idle cpu, nothing happens at CFS
task wakeup and we have to wait for the next tick to apply the new
freq. This happens for both short task with uclamp min or long
running/sleeping task (i.e. with high util_est)

>
> I'll need to think more about it, but I think adding an update when we switch
> policies in the syscall looks sufficient to me, if the task is on rq already.
> Agreed?
>
> > have to wait for the next tick to get a freq update whatever the value
> > of util_est and uclamp
> >
> > > 2. task_tick_fair()
> >
> > Updating only during tick is ok with a tick at 1000hz/1000us when we
> > compare it with the1048us slice of pelt but what about 4ms or even
> > 10ms tick ? we can have an increase of almost 200 in 10ms
>
> IMHO the current code can still fail with these setups to update frequencies in
> time. If there's a single task on the rq, then the only freq update will happen
> at tick. So this is an existing problem.

But any newly enqueued task can trigger a freq update without waiting
1/4/10ms whereas we need to wait for next tick with this patch

>
> The way I see it is that setting such high TICK values implies low
> responsiveness by definition. So the person who selects this setup needs to
> cater that their worst case scenario is that and be happy with it. And this
> worst case scenario does not change.
>
> That said, the right way to cater for this is via my other series to remove the
> magic margins. DVFS headroom should rely on TICK value to ensure we run at
> adequate frequency until the next worst case scenario update, which relies on
> TICK. Which is sufficient to handle util_est changes. See below for uclamp.
>
> Wake up preemption should cause context switches to happen sooner than a tick
> too as we add more tasks on the rq. So I think the worst case scenario is not
> really changing that much. In my view, it's better to be consistent about the
> behavior.
>
> >
> > > 3. {attach, detach}_entity_load_avg()
> >
> > At enqueue/dequeue, the util_est will be updated and can make cpu
> > utilization quite different especially with long sleeping tasks. The
> > same applies for uclamp_min/max hints of a newly enqueued task. We
> > might end up waiting 4/10ms depending of the tick period.
>
> uclamp_min is a property of the task. And waiting for the task that needs the
> boost to run is fine IMHO. And I am actually hoping to remove uclamp max()

But you will delay all CPU work and the running time fo the task

And what about util_est ?

> aggregation in favour of applying boosts/caps when tasks are RUNNING. But more
> things need to be improved first.
>
> We are missing a freq update when uclamp values change by the way. This is
> a known bug and I keep forgetting to post a patch to fix it. Let me do this
> along update freq when policy changes.
>
> Thanks!
Qais Yousef May 9, 2024, 12:40 p.m. UTC | #9
On 05/07/24 14:53, Vincent Guittot wrote:
> On Tue, 7 May 2024 at 13:08, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 05/07/24 10:58, Vincent Guittot wrote:
> > > On Mon, 6 May 2024 at 01:31, Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > Improve the interaction with cpufreq governors by making the
> > > > cpufreq_update_util() calls more intentional.
> > > >
> > > > At the moment we send them when load is updated for CFS, bandwidth for
> > > > DL and at enqueue/dequeue for RT. But this can lead to too many updates
> > > > sent in a short period of time and potentially be ignored at a critical
> > > > moment due to the rate_limit_us in schedutil.
> > > >
> > > > For example, simultaneous task enqueue on the CPU where 2nd task is
> > > > bigger and requires higher freq. The trigger to cpufreq_update_util() by
> > > > the first task will lead to dropping the 2nd request until tick. Or
> > > > another CPU in the same policy triggers a freq update shortly after.
> > > >
> > > > Updates at enqueue for RT are not strictly required. Though they do help
> > > > to reduce the delay for switching the frequency and the potential
> > > > observation of lower frequency during this delay. But current logic
> > > > doesn't intentionally (at least to my understanding) try to speed up the
> > > > request.
> > > >
> > > > To help reduce the amount of cpufreq updates and make them more
> > > > purposeful, consolidate them into these locations:
> > > >
> > > > 1. context_switch()
> > >
> > > I don't see any cpufreq update when switching from idle to CFS. We
> >
> > You mean SCHED_IDLE to SCHED_NORMAL, right? Yes, if we switch policies even
> > from fair to RT an update could be missed.
> 
> No I mean going out of idle. On an idle cpu, nothing happens at CFS
> task wakeup and we have to wait for the next tick to apply the new
> freq. This happens for both short task with uclamp min or long
> running/sleeping task (i.e. with high util_est)

And without my patch you see a freq change? If no stats were updated to cause
a decay, we will skip the cpufreq update at this context switch.

I'll audit the code again in case I missed a place where there's a decay. You
could be hitting a race condition with update_blocked_avg() sending a cpufreq
update and this could cause the context switch cpufreq update to be dropped by
rate limit.. You could try to reduce your rate_limit_us to see if this helps.

I'll try to reproduce and investigate. FWIW, I did test this on M1 mac mini and
pixel device running speedometer and some iowait workloads and didn't observe
problems.

> 
> >
> > I'll need to think more about it, but I think adding an update when we switch
> > policies in the syscall looks sufficient to me, if the task is on rq already.
> > Agreed?
> >
> > > have to wait for the next tick to get a freq update whatever the value
> > > of util_est and uclamp
> > >
> > > > 2. task_tick_fair()
> > >
> > > Updating only during tick is ok with a tick at 1000hz/1000us when we
> > > compare it with the1048us slice of pelt but what about 4ms or even
> > > 10ms tick ? we can have an increase of almost 200 in 10ms
> >
> > IMHO the current code can still fail with these setups to update frequencies in
> > time. If there's a single task on the rq, then the only freq update will happen
> > at tick. So this is an existing problem.
> 
> But any newly enqueued task can trigger a freq update without waiting
> 1/4/10ms whereas we need to wait for next tick with this patch

But it is racy. By deferring the decision (sampling point) we ensure better the
RUNNING task is reflecting the current state of the rq taken into account any
past events.

Note if there's no enqueue/dequeue, the problem is not fixed either way. If we
get two consecutive enqueues, the 2nd one will be dropped. And we'll end up
with delays.

I think this way we'd be just more consistently failing or working.

Systems with high rate_limit_us or TICK generally should ask for generous
headroom and I think this is the best way to address this issue in a scalable
way. People with fast systems/configurations can be more exact and frequent in
their requests. Systems/configurations that are slow will tend to exaggerate
each request to cater for the slow response. But the requests themselves are
done at better defined points of time. That's my hope at least, so I appreciate
the reviews :)

My other hope is that by doing the sampling at context switch we can better
handle uclamp and iowait boost requests which requires special cpufreq
constrains to be applied for this specifically RUNNING task. Ultimately leading
to removing uclamp max() aggregation at enqueue/dequeue.

> 
> >
> > The way I see it is that setting such high TICK values implies low
> > responsiveness by definition. So the person who selects this setup needs to
> > cater that their worst case scenario is that and be happy with it. And this
> > worst case scenario does not change.
> >
> > That said, the right way to cater for this is via my other series to remove the
> > magic margins. DVFS headroom should rely on TICK value to ensure we run at
> > adequate frequency until the next worst case scenario update, which relies on
> > TICK. Which is sufficient to handle util_est changes. See below for uclamp.
> >
> > Wake up preemption should cause context switches to happen sooner than a tick
> > too as we add more tasks on the rq. So I think the worst case scenario is not
> > really changing that much. In my view, it's better to be consistent about the
> > behavior.
> >
> > >
> > > > 3. {attach, detach}_entity_load_avg()
> > >
> > > At enqueue/dequeue, the util_est will be updated and can make cpu
> > > utilization quite different especially with long sleeping tasks. The
> > > same applies for uclamp_min/max hints of a newly enqueued task. We
> > > might end up waiting 4/10ms depending of the tick period.
> >
> > uclamp_min is a property of the task. And waiting for the task that needs the
> > boost to run is fine IMHO. And I am actually hoping to remove uclamp max()
> 
> But you will delay all CPU work and the running time fo the task

uclamp_min shouldn't cause other tasks to run faster? From my perspective this
is wasting power actually as what we want is only this task to run faster.

If the task wants better wake up latency (which I assume what you're referring
to by making things run faster then this task should get to RUNNING faster)
then we need to annotate this separately IMHO. We shouldn't rely on the boost
which is there to ensure this tasks gets work done at an acceptable rate, not
to impact its wake up latency.

> 
> And what about util_est ?

Hmm yeah this won't cause cfs_rq.decayed to trigger so we could miss it at
context switch. And could be what's causing the issue you're seeing above.
Qais Yousef May 11, 2024, 2:03 a.m. UTC | #10
On 05/07/24 14:53, Vincent Guittot wrote:

> No I mean going out of idle. On an idle cpu, nothing happens at CFS
> task wakeup and we have to wait for the next tick to apply the new
> freq. This happens for both short task with uclamp min or long
> running/sleeping task (i.e. with high util_est)

I think I found the problem, thanks for catching it!

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cbe79c8ac2ed..dea9383a906e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4714,7 +4714,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
        if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
                __update_load_avg_se(now, cfs_rq, se);

-       cfs_rq->decayed  = update_cfs_rq_load_avg(now, cfs_rq);
+       cfs_rq->decayed |= update_cfs_rq_load_avg(now, cfs_rq);
        cfs_rq->decayed |= propagate_entity_load_avg(se);

        if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
Vincent Guittot May 13, 2024, 8:49 a.m. UTC | #11
On Thu, 9 May 2024 at 14:40, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 05/07/24 14:53, Vincent Guittot wrote:
> > On Tue, 7 May 2024 at 13:08, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 05/07/24 10:58, Vincent Guittot wrote:
> > > > On Mon, 6 May 2024 at 01:31, Qais Yousef <qyousef@layalina.io> wrote:
> > > > >
> > > > > Improve the interaction with cpufreq governors by making the
> > > > > cpufreq_update_util() calls more intentional.
> > > > >
> > > > > At the moment we send them when load is updated for CFS, bandwidth for
> > > > > DL and at enqueue/dequeue for RT. But this can lead to too many updates
> > > > > sent in a short period of time and potentially be ignored at a critical
> > > > > moment due to the rate_limit_us in schedutil.
> > > > >
> > > > > For example, simultaneous task enqueue on the CPU where 2nd task is
> > > > > bigger and requires higher freq. The trigger to cpufreq_update_util() by
> > > > > the first task will lead to dropping the 2nd request until tick. Or
> > > > > another CPU in the same policy triggers a freq update shortly after.
> > > > >
> > > > > Updates at enqueue for RT are not strictly required. Though they do help
> > > > > to reduce the delay for switching the frequency and the potential
> > > > > observation of lower frequency during this delay. But current logic
> > > > > doesn't intentionally (at least to my understanding) try to speed up the
> > > > > request.
> > > > >
> > > > > To help reduce the amount of cpufreq updates and make them more
> > > > > purposeful, consolidate them into these locations:
> > > > >
> > > > > 1. context_switch()
> > > >
> > > > I don't see any cpufreq update when switching from idle to CFS. We
> > >
> > > You mean SCHED_IDLE to SCHED_NORMAL, right? Yes, if we switch policies even
> > > from fair to RT an update could be missed.
> >
> > No I mean going out of idle. On an idle cpu, nothing happens at CFS
> > task wakeup and we have to wait for the next tick to apply the new
> > freq. This happens for both short task with uclamp min or long
> > running/sleeping task (i.e. with high util_est)
>
> And without my patch you see a freq change? If no stats were updated to cause

Yes, the behavior is correct without your patch

> a decay, we will skip the cpufreq update at this context switch.
>
> I'll audit the code again in case I missed a place where there's a decay. You
> could be hitting a race condition with update_blocked_avg() sending a cpufreq
> update and this could cause the context switch cpufreq update to be dropped by
> rate limit.. You could try to reduce your rate_limit_us to see if this helpsits

I already checked this. There is no update happening before the
wakeup. In fact I thought it was linked to a 1000 hz tick and the cpu
not stopping its tick but the problem is still there even after
removing such potential problems.

>
> I'll try to reproduce and investigate. FWIW, I did test this on M1 mac mini and
> pixel device running speedometer and some iowait workloads and didn't observe
> problems.

I'm using a dragonboard RB5 and I have a simple use case with a task
waking up periodically around 100ms for running 40ms.  As usual I'm
not using a strict 100ms to stay unsync with tick and cover more cases

>
> >
> > >
> > > I'll need to think more about it, but I think adding an update when we switch
> > > policies in the syscall looks sufficient to me, if the task is on rq already.
> > > Agreed?
> > >
> > > > have to wait for the next tick to get a freq update whatever the value
> > > > of util_est and uclamp
> > > >
> > > > > 2. task_tick_fair()
> > > >
> > > > Updating only during tick is ok with a tick at 1000hz/1000us when we
> > > > compare it with the1048us slice of pelt but what about 4ms or even
> > > > 10ms tick ? we can have an increase of almost 200 in 10ms
> > >
> > > IMHO the current code can still fail with these setups to update frequencies in
> > > time. If there's a single task on the rq, then the only freq update will happen
> > > at tick. So this is an existing problem.
> >
> > But any newly enqueued task can trigger a freq update without waiting
> > 1/4/10ms whereas we need to wait for next tick with this patch
>
> But it is racy. By deferring the decision (sampling point) we ensure better the
> RUNNING task is reflecting the current state of the rq taken into account any
> past events.
>
> Note if there's no enqueue/dequeue, the problem is not fixed either way. If we
> get two consecutive enqueues, the 2nd one will be dropped. And we'll end up
> with delays.
>
> I think this way we'd be just more consistently failing or working.
>
> Systems with high rate_limit_us or TICK generally should ask for generous
> headroom and I think this is the best way to address this issue in a scalable
> way. People with fast systems/configurations can be more exact and frequent in
> their requests. Systems/configurations that are slow will tend to exaggerate
> each request to cater for the slow response. But the requests themselves are
> done at better defined points of time. That's my hope at least, so I appreciate
> the reviews :)
>
> My other hope is that by doing the sampling at context switch we can better
> handle uclamp and iowait boost requests which requires special cpufreq
> constrains to be applied for this specifically RUNNING task. Ultimately leading
> to removing uclamp max() aggregation at enqueue/dequeue.
>
> >
> > >
> > > The way I see it is that setting such high TICK values implies low
> > > responsiveness by definition. So the person who selects this setup needs to
> > > cater that their worst case scenario is that and be happy with it. And this
> > > worst case scenario does not change.
> > >
> > > That said, the right way to cater for this is via my other series to remove the
> > > magic margins. DVFS headroom should rely on TICK value to ensure we run at
> > > adequate frequency until the next worst case scenario update, which relies on
> > > TICK. Which is sufficient to handle util_est changes. See below for uclamp.
> > >
> > > Wake up preemption should cause context switches to happen sooner than a tick
> > > too as we add more tasks on the rq. So I think the worst case scenario is not
> > > really changing that much. In my view, it's better to be consistent about the
> > > behavior.
> > >
> > > >
> > > > > 3. {attach, detach}_entity_load_avg()
> > > >
> > > > At enqueue/dequeue, the util_est will be updated and can make cpu
> > > > utilization quite different especially with long sleeping tasks. The
> > > > same applies for uclamp_min/max hints of a newly enqueued task. We
> > > > might end up waiting 4/10ms depending of the tick period.
> > >
> > > uclamp_min is a property of the task. And waiting for the task that needs the
> > > boost to run is fine IMHO. And I am actually hoping to remove uclamp max()
> >
> > But you will delay all CPU work and the running time fo the task
>
> uclamp_min shouldn't cause other tasks to run faster? From my perspective this
> is wasting power actually as what we want is only this task to run faster.
>
> If the task wants better wake up latency (which I assume what you're referring
> to by making things run faster then this task should get to RUNNING faster)
> then we need to annotate this separately IMHO. We shouldn't rely on the boost
> which is there to ensure this tasks gets work done at an acceptable rate, not
> to impact its wake up latency.
>
> >
> > And what about util_est ?
>
> Hmm yeah this won't cause cfs_rq.decayed to trigger so we could miss it at
> context switch. And could be what's causing the issue you're seeing above.
Vincent Guittot May 13, 2024, 8:50 a.m. UTC | #12
On Sat, 11 May 2024 at 04:03, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 05/07/24 14:53, Vincent Guittot wrote:
>
> > No I mean going out of idle. On an idle cpu, nothing happens at CFS
> > task wakeup and we have to wait for the next tick to apply the new
> > freq. This happens for both short task with uclamp min or long
> > running/sleeping task (i.e. with high util_est)
>
> I think I found the problem, thanks for catching it!

Ok I will test it when I will be back and have access to the board

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cbe79c8ac2ed..dea9383a906e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4714,7 +4714,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>         if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
>                 __update_load_avg_se(now, cfs_rq, se);
>
> -       cfs_rq->decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> +       cfs_rq->decayed |= update_cfs_rq_load_avg(now, cfs_rq);
>         cfs_rq->decayed |= propagate_entity_load_avg(se);
>
>         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
diff mbox series

Patch

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index bdd31ab93bc5..2d0a45aba16f 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -8,7 +8,8 @@ 
  * Interface between cpufreq drivers and the scheduler:
  */
 
-#define SCHED_CPUFREQ_IOWAIT	(1U << 0)
+#define SCHED_CPUFREQ_IOWAIT		(1U << 0)
+#define SCHED_CPUFREQ_FORCE_UPDATE	(1U << 1) /* ignore transition_delay_us */
 
 #ifdef CONFIG_CPU_FREQ
 struct cpufreq_policy;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a914388144a..e6fe7dbd1f89 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5134,6 +5134,65 @@  static inline void balance_callbacks(struct rq *rq, struct balance_callback *hea
 
 #endif
 
+static inline void update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
+{
+#ifdef CONFIG_CPU_FREQ
+	unsigned int flags = 0;
+
+#ifdef CONFIG_SMP
+	if (unlikely(current->sched_class == &stop_sched_class))
+		return;
+#endif
+
+	if (unlikely(current->sched_class == &idle_sched_class))
+		return;
+
+	if (unlikely(task_has_idle_policy(current)))
+		return;
+
+	if (likely(fair_policy(current->policy))) {
+
+		if (unlikely(current->in_iowait)) {
+			flags |= SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE;
+			goto force_update;
+		}
+
+#ifdef CONFIG_SMP
+		/*
+		 * Allow cpufreq updates once for every update_load_avg() decay.
+		 */
+		if (unlikely(rq->cfs.decayed)) {
+			rq->cfs.decayed = false;
+			goto force_update;
+		}
+#endif
+		return;
+	}
+
+	/*
+	 * RT and DL should always send a freq update. But we can do some
+	 * simple checks to avoid it when we know it's not necessary.
+	 */
+	if (rt_task(current) && rt_task(prev)) {
+#ifdef CONFIG_UCLAMP_TASK
+		unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
+		unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN);
+
+		if (curr_uclamp_min == prev_uclamp_min)
+#endif
+			return;
+	} else if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV) {
+		/* Ignore sugov kthreads, they're responding to our requests */
+		return;
+	}
+
+	flags |= SCHED_CPUFREQ_FORCE_UPDATE;
+
+force_update:
+	cpufreq_update_util(rq, flags);
+#endif
+}
+
 static inline void
 prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 {
@@ -5151,7 +5210,7 @@  prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
 #endif
 }
 
-static inline void finish_lock_switch(struct rq *rq)
+static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
 {
 	/*
 	 * If we are tracking spinlock dependencies then we have to
@@ -5160,6 +5219,11 @@  static inline void finish_lock_switch(struct rq *rq)
 	 */
 	spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
 	__balance_callbacks(rq);
+	/*
+	 * Request freq update after __balance_callbacks to take into account
+	 * any changes to rq.
+	 */
+	update_cpufreq_ctx_switch(rq, prev);
 	raw_spin_rq_unlock_irq(rq);
 }
 
@@ -5278,7 +5342,7 @@  static struct rq *finish_task_switch(struct task_struct *prev)
 	perf_event_task_sched_in(prev, current);
 	finish_task(prev);
 	tick_nohz_task_switch();
-	finish_lock_switch(rq);
+	finish_lock_switch(rq, prev);
 	finish_arch_post_lock_switch();
 	kcov_finish_switch(current);
 	/*
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index eece6244f9d2..e8b65b75e7f3 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -59,7 +59,8 @@  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
 
 /************************ Governor internals ***********************/
 
-static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time,
+				     unsigned int flags)
 {
 	s64 delta_ns;
 
@@ -87,13 +88,16 @@  static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 		return true;
 	}
 
+	if (unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
+		return true;
+
 	delta_ns = time - sg_policy->last_freq_update_time;
 
 	return delta_ns >= sg_policy->freq_update_delay_ns;
 }
 
 static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
-				   unsigned int next_freq)
+				   unsigned int next_freq, unsigned int flags)
 {
 	if (sg_policy->need_freq_update)
 		sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS);
@@ -101,7 +105,9 @@  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
 		return false;
 
 	sg_policy->next_freq = next_freq;
-	sg_policy->last_freq_update_time = time;
+
+	if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
+		sg_policy->last_freq_update_time = time;
 
 	return true;
 }
@@ -249,9 +255,10 @@  static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
 			       unsigned int flags)
 {
 	bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
+	bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
 
 	/* Reset boost if the CPU appears to have been idle enough */
-	if (sg_cpu->iowait_boost &&
+	if (sg_cpu->iowait_boost && !forced_update &&
 	    sugov_iowait_reset(sg_cpu, time, set_iowait_boost))
 		return;
 
@@ -294,17 +301,34 @@  static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
  * being more conservative on tasks which does sporadic IO operations.
  */
 static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
-			       unsigned long max_cap)
+			       unsigned long max_cap, unsigned int flags)
 {
+	bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
+	s64 delta_ns = time - sg_cpu->last_update;
+
 	/* No boost currently required */
 	if (!sg_cpu->iowait_boost)
 		return 0;
 
+	if (forced_update)
+		goto apply_boost;
+
 	/* Reset boost if the CPU appears to have been idle enough */
 	if (sugov_iowait_reset(sg_cpu, time, false))
 		return 0;
 
 	if (!sg_cpu->iowait_boost_pending) {
+		/*
+		 * This logic relied on PELT signal decays happening once every
+		 * 1ms. But due to changes to how updates are done now, we can
+		 * end up with more request coming up leading to iowait boost
+		 * to be prematurely reduced. Make the assumption explicit
+		 * until we improve the iowait boost logic to be better in
+		 * general as it is due for an overhaul.
+		 */
+		if (delta_ns <= NSEC_PER_MSEC)
+			goto apply_boost;
+
 		/*
 		 * No boost pending; reduce the boost value.
 		 */
@@ -315,6 +339,7 @@  static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 		}
 	}
 
+apply_boost:
 	sg_cpu->iowait_boost_pending = false;
 
 	/*
@@ -358,10 +383,10 @@  static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
 
 	ignore_dl_rate_limit(sg_cpu);
 
-	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
+	if (!sugov_should_update_freq(sg_cpu->sg_policy, time, flags))
 		return false;
 
-	boost = sugov_iowait_apply(sg_cpu, time, max_cap);
+	boost = sugov_iowait_apply(sg_cpu, time, max_cap, flags);
 	sugov_get_util(sg_cpu, boost);
 
 	return true;
@@ -397,7 +422,7 @@  static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
 		sg_policy->cached_raw_freq = cached_freq;
 	}
 
-	if (!sugov_update_next_freq(sg_policy, time, next_f))
+	if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
 		return;
 
 	/*
@@ -449,10 +474,12 @@  static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 	cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
 				   sg_cpu->util, max_cap);
 
-	sg_cpu->sg_policy->last_freq_update_time = time;
+	if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
+		sg_cpu->sg_policy->last_freq_update_time = time;
 }
 
-static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
+static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time,
+					   unsigned int flags)
 {
 	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
 	struct cpufreq_policy *policy = sg_policy->policy;
@@ -465,7 +492,7 @@  static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
 		unsigned long boost;
 
-		boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
+		boost = sugov_iowait_apply(j_sg_cpu, time, max_cap, flags);
 		sugov_get_util(j_sg_cpu, boost);
 
 		util = max(j_sg_cpu->util, util);
@@ -488,10 +515,10 @@  sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
 
 	ignore_dl_rate_limit(sg_cpu);
 
-	if (sugov_should_update_freq(sg_policy, time)) {
-		next_f = sugov_next_freq_shared(sg_cpu, time);
+	if (sugov_should_update_freq(sg_policy, time, flags)) {
+		next_f = sugov_next_freq_shared(sg_cpu, time, flags);
 
-		if (!sugov_update_next_freq(sg_policy, time, next_f))
+		if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
 			goto unlock;
 
 		if (sg_policy->policy->fast_switch_enabled)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a04a436af8cc..02c9c2488091 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -252,8 +252,6 @@  void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 	dl_rq->running_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
 	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
-	/* kick cpufreq (see the comment in kernel/sched/sched.h). */
-	cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
 }
 
 static inline
@@ -266,8 +264,6 @@  void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
 	if (dl_rq->running_bw > old)
 		dl_rq->running_bw = 0;
-	/* kick cpufreq (see the comment in kernel/sched/sched.h). */
-	cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
 }
 
 static inline
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9eb63573110c..cbe79c8ac2ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3982,29 +3982,6 @@  static inline void update_cfs_group(struct sched_entity *se)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
-{
-	struct rq *rq = rq_of(cfs_rq);
-
-	if (&rq->cfs == cfs_rq) {
-		/*
-		 * There are a few boundary cases this might miss but it should
-		 * get called often enough that that should (hopefully) not be
-		 * a real problem.
-		 *
-		 * It will not get called when we go idle, because the idle
-		 * thread is a different class (!fair), nor will the utilization
-		 * number include things like RT tasks.
-		 *
-		 * As is, the util number is not freq-invariant (we'd have to
-		 * implement arch_scale_freq_capacity() for that).
-		 *
-		 * See cpu_util_cfs().
-		 */
-		cpufreq_update_util(rq, flags);
-	}
-}
-
 #ifdef CONFIG_SMP
 static inline bool load_avg_is_decayed(struct sched_avg *sa)
 {
@@ -4682,7 +4659,7 @@  static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
 
-	cfs_rq_util_change(cfs_rq, 0);
+	cpufreq_update_util(rq_of(cfs_rq), 0);
 
 	trace_pelt_cfs_tp(cfs_rq);
 }
@@ -4712,7 +4689,7 @@  static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
 
-	cfs_rq_util_change(cfs_rq, 0);
+	cpufreq_update_util(rq_of(cfs_rq), 0);
 
 	trace_pelt_cfs_tp(cfs_rq);
 }
@@ -4729,7 +4706,6 @@  static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 now = cfs_rq_clock_pelt(cfs_rq);
-	int decayed;
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -4738,8 +4714,8 @@  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
 		__update_load_avg_se(now, cfs_rq, se);
 
-	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
-	decayed |= propagate_entity_load_avg(se);
+	cfs_rq->decayed  = update_cfs_rq_load_avg(now, cfs_rq);
+	cfs_rq->decayed |= propagate_entity_load_avg(se);
 
 	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
 
@@ -4760,11 +4736,8 @@  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		 */
 		detach_entity_load_avg(cfs_rq, se);
 		update_tg_load_avg(cfs_rq);
-	} else if (decayed) {
-		cfs_rq_util_change(cfs_rq, 0);
-
-		if (flags & UPDATE_TG)
-			update_tg_load_avg(cfs_rq);
+	} else if (cfs_rq->decayed && (flags & UPDATE_TG)) {
+		update_tg_load_avg(cfs_rq);
 	}
 }
 
@@ -5139,7 +5112,6 @@  static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
 {
-	cfs_rq_util_change(cfs_rq, 0);
 }
 
 static inline void remove_entity_load_avg(struct sched_entity *se) {}
@@ -6754,14 +6726,6 @@  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
-	/*
-	 * If in_iowait is set, the code below may not trigger any cpufreq
-	 * utilization updates, so do it here explicitly with the IOWAIT flag
-	 * passed.
-	 */
-	if (p->in_iowait)
-		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
-
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
@@ -9351,10 +9315,6 @@  static bool __update_blocked_others(struct rq *rq, bool *done)
 	unsigned long hw_pressure;
 	bool decayed;
 
-	/*
-	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
-	 * DL and IRQ signals have been updated before updating CFS.
-	 */
 	curr_class = rq->curr->sched_class;
 
 	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
@@ -12685,6 +12645,7 @@  static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	check_update_overutilized_status(task_rq(curr));
+	cpufreq_update_util(rq, 0);
 
 	task_tick_core(rq, curr);
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3261b067b67e..fe6d8b0ffa95 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -556,11 +556,8 @@  static void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
 
 	rt_se = rt_rq->tg->rt_se[cpu];
 
-	if (!rt_se) {
+	if (!rt_se)
 		dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
-		/* Kick cpufreq (see the comment in kernel/sched/sched.h). */
-		cpufreq_update_util(rq_of_rt_rq(rt_rq), 0);
-	}
 	else if (on_rt_rq(rt_se))
 		dequeue_rt_entity(rt_se, 0);
 }
@@ -1065,9 +1062,6 @@  enqueue_top_rt_rq(struct rt_rq *rt_rq)
 		add_nr_running(rq, rt_rq->rt_nr_running);
 		rt_rq->rt_queued = 1;
 	}
-
-	/* Kick cpufreq (see the comment in kernel/sched/sched.h). */
-	cpufreq_update_util(rq, 0);
 }
 
 #if defined CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cb3792c04eea..86cec2145221 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -632,6 +632,11 @@  struct cfs_rq {
 		unsigned long	runnable_avg;
 	} removed;
 
+	/*
+	 * Store whether last update_load_avg() has decayed
+	 */
+	bool			decayed;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;