diff mbox

[1/2] sched/fair: move cpufreq hook to update_cfs_rq_load_avg()

Message ID 1458606068-7476-1-git-send-email-smuckle@linaro.org
State Accepted
Commit 21e96f88776deead303ecd30a17d1d7c2a1776e3
Headers show

Commit Message

Steve Muckle March 22, 2016, 12:21 a.m. UTC
The cpufreq hook should be called whenever the root cfs_rq
utilization changes so update_cfs_rq_load_avg() is a better
place for it. The current location is not invoked in the
enqueue_entity() or update_blocked_averages() paths.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>

---
 kernel/sched/fair.c | 50 ++++++++++++++++++++++++++------------------------
 1 file changed, 26 insertions(+), 24 deletions(-)

-- 
2.4.10

Comments

Steve Muckle March 28, 2016, 7:38 p.m. UTC | #1
On 03/28/2016 11:30 AM, Dietmar Eggemann wrote:
> On 03/28/2016 06:34 PM, Steve Muckle wrote:

>> Hi Dietmar,

>>

>> On 03/28/2016 05:02 AM, Dietmar Eggemann wrote:

>>> Hi Steve,

>>>

>>> these patches fall into the bucket of 'optimization of updating the

>>> value only if the root cfs_rq util has changed' as discussed in '[PATCH

>>> 5/8] sched/cpufreq: pass sched class into cpufreq_update_util' of Mike

>>> T's current series '[PATCH 0/8] schedutil enhancements', right?

>>

>> I would say just the second patch is an optimization. The first and

>> third patches cover additional paths in CFS where the hook should be

>> called but currently is not, which I think is a correctness issue.

> 

> Not disagreeing here but I don't know if this level of accuracy is

> really needed. I mean we currently miss updates in

> enqueue_task_fair()->enqueue_entity()->enqueue_entity_load_avg() and

> idle_balance()/rebalance_domains()->update_blocked_averages() but there

> are plenty of call sides of update_load_avg(se, ...) with

> '&rq_of(cfs_rq_of(se))->cfs == cfs_rq_of(se)'.

>

> The question for me is does schedutil work better with this new, more

> accurate signal? IMO, not receiving a bunch of consecutive

> cpufreq_update_util's w/ the same 'util' value is probably a good thing,

> unless we see the interaction with RT/DL class as mentioned by Sai. Here

> an agreement on the design for the 'capacity vote aggregation from

> CFS/RT/DL' would help to clarify.


Without covering all the paths where CFS utilization changes it's
possible to have to wait up to a tick to act on some changes, since the
tick is the only guaranteed regularly-occurring instance of the hook.
That's an unacceptable amount of latency IMO...

thanks,
Steve
Vincent Guittot March 31, 2016, 9:27 a.m. UTC | #2
On 30 March 2016 at 21:35, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Mar 28, 2016 at 12:38:26PM -0700, Steve Muckle wrote:

>> Without covering all the paths where CFS utilization changes it's

>> possible to have to wait up to a tick to act on some changes, since the

>> tick is the only guaranteed regularly-occurring instance of the hook.

>> That's an unacceptable amount of latency IMO...

>

> Note that even with your patches that might still be the case. Remote

> wakeups might not happen on the destination CPU at all, so it might not

> be until the next tick (which always happens locally) that we'll

> 'observe' the utilization change brought with the wakeups.

>

> We could force all the remote wakeups to IPI the destination CPU, but

> that comes at a significant performance cost.


Isn't a reschedule ipi already sent in this case ?
Vincent Guittot March 31, 2016, 12:50 p.m. UTC | #3
On 31 March 2016 at 14:34, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Mar 31, 2016 at 02:14:50PM +0200, Vincent Guittot wrote:

>> In fact, I looks for the sequence where the utilization of a rq is not

>> updated until the next tick but i can't find it.

>

> No, util it always updated, however..

>

>> If cpu doesn't share cache, task is added to wake list and an ipi is

>> sent and the utilization.

>

> Here we run:

>

>  ttwu_do_activate()

>    ttwu_activate()

>      activate_task()

>        enqueue_task()

>          p->sched_class->enqueue_task() := enqueue_task_fair()

>            update_load_avg()

>              update_cfs_rq_load_avg()

>                cfs_rq_util_change()

>

> On the local cpu, and we can indeed call out to have the frequency

> changed.

>

>> Otherwise, we directly enqueue the task on

>> the rq and the utilization is updated

>

> But here we run it on a remote cpu, so we cannot call out and the

> frequency remains the same.

>

> So if a remote wakeup on the same LLC domain happens, utilization will

> increase but we will not observe until the next tick.


ok. I forgot that we have the condition cpu == smp_processor_id() in
cfs_rq_util_change.
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46d64e4ccfde..d418deb04049 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2825,7 +2825,9 @@  static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
 static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
 	struct sched_avg *sa = &cfs_rq->avg;
+	struct rq *rq = rq_of(cfs_rq);
 	int decayed, removed = 0;
+	int cpu = cpu_of(rq);
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
@@ -2840,7 +2842,7 @@  static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
 	}
 
-	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+	decayed = __update_load_avg(now, cpu, sa,
 		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
 
 #ifndef CONFIG_64BIT
@@ -2848,28 +2850,6 @@  static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif
 
-	return decayed || removed;
-}
-
-/* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int update_tg)
-{
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 now = cfs_rq_clock_task(cfs_rq);
-	struct rq *rq = rq_of(cfs_rq);
-	int cpu = cpu_of(rq);
-
-	/*
-	 * Track task load average for carrying it to new CPU after migrated, and
-	 * track group sched_entity load average for task_h_load calc in migration
-	 */
-	__update_load_avg(now, cpu, &se->avg,
-			  se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
-
-	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
-		update_tg_load_avg(cfs_rq, 0);
-
 	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
 		unsigned long max = rq->cpu_capacity_orig;
 
@@ -2890,8 +2870,30 @@  static inline void update_load_avg(struct sched_entity *se, int update_tg)
 		 * See cpu_util().
 		 */
 		cpufreq_update_util(rq_clock(rq),
-				    min(cfs_rq->avg.util_avg, max), max);
+				    min(sa->util_avg, max), max);
 	}
+
+	return decayed || removed;
+}
+
+/* Update task and its cfs_rq load average */
+static inline void update_load_avg(struct sched_entity *se, int update_tg)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 now = cfs_rq_clock_task(cfs_rq);
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
+
+	/*
+	 * Track task load average for carrying it to new CPU after migrated, and
+	 * track group sched_entity load average for task_h_load calc in migration
+	 */
+	__update_load_avg(now, cpu, &se->avg,
+			  se->on_rq * scale_load_down(se->load.weight),
+			  cfs_rq->curr == se, NULL);
+
+	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
+		update_tg_load_avg(cfs_rq, 0);
 }
 
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)