[RESEND,v2,1/2] sched/rt: add utilization tracking

Message ID 1501854022-22128-2-git-send-email-vincent.guittot@linaro.org
State New
Headers show
Series
  • sched/rt: track rt rq utilization
Related show

Commit Message

Vincent Guittot Aug. 4, 2017, 1:40 p.m.
schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
are preempted by rt tasks and in this case util_avg reflects the remaining
capacity that is used by cfs tasks but not what cfs tasks want to use. In such
case, schedutil can select a lower OPP when cfs task runs whereas the CPU is
overloaded. In order to have a more accurate view of the utilization of the
CPU, we track the utilization that is used by RT tasks.
DL tasks are not taken into account as they have their own utilization
tracking mecanism.

We don't use rt_avg which doesn't have the same dynamic as PELT and which
can include IRQ time.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>


---

Change since v1:
- rebase on tip/sched/core

There were several comments on v1:
- As raised by Peter for v1, if IRQ time is taken into account in
  rt_avg, it will not be accounted in rq->clock_task. This means that cfs
  utilization is not affected by some extra contributions or decays
  because of IRQ.  
- Regading the sync of rt and cfs utilization, both cfs and rt use the same
  rq->clock_task. Then, we have the same issue than cfs regarding blocked value.
  The utilization of idle cfs/rt rqs are not updated regularly but only when a
  load_balance is triggered (more precisely a call to update_blocked_average).
  I'd like to fix this issue for both cfs and rt with a separate patch that
  will ensure that utilization (and load) are updated regularly even for
  idle CPUs
- One last open question is the location of rt utilization function in fair.c
  file. PELT related funtions should probably move in a dedicated pelt.c file.
  This would also help to address one comment about having a place to update
  metrics of NOHZ idle CPUs. Thought ?

 kernel/sched/fair.c  | 21 +++++++++++++++++++++
 kernel/sched/rt.c    |  9 +++++++++
 kernel/sched/sched.h |  3 +++
 3 files changed, 33 insertions(+)

-- 
2.7.4

Comments

Peter Zijlstra Aug. 7, 2017, 4:44 p.m. | #1
On Fri, Aug 04, 2017 at 03:40:21PM +0200, Vincent Guittot wrote:

> There were several comments on v1:

> - As raised by Peter for v1, if IRQ time is taken into account in

>   rt_avg, it will not be accounted in rq->clock_task. This means that cfs

>   utilization is not affected by some extra contributions or decays

>   because of IRQ.  


Right.

> - Regading the sync of rt and cfs utilization, both cfs and rt use the same

>   rq->clock_task. Then, we have the same issue than cfs regarding blocked value.

>   The utilization of idle cfs/rt rqs are not updated regularly but only when a

>   load_balance is triggered (more precisely a call to update_blocked_average).

>   I'd like to fix this issue for both cfs and rt with a separate patch that

>   will ensure that utilization (and load) are updated regularly even for

>   idle CPUs


Yeah, that needs help.

> - One last open question is the location of rt utilization function in fair.c

>   file. PELT related funtions should probably move in a dedicated pelt.c file.

>   This would also help to address one comment about having a place to update

>   metrics of NOHZ idle CPUs. Thought ?


Probably, but I have a bunch of patches lined up changing that code, so
lets not do that now.

In any case, would something like the attached patches make sense? It
completely replaces rt_avg with separate IRQ,RT and DL tracking.Subject: sched: Replace rt_avg
From: Peter Zijlstra <peterz@infradead.org>

Date: Mon Aug  7 18:26:25 CEST 2017

Now that we have separate IRQ, RT and DL utilization tracking, we can
fully replace the old rt_avg code, reducing the amount of statistics.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

---
 include/linux/sched/sysctl.h |    1 -
 kernel/sched/core.c          |   40 ++--------------------------------------
 kernel/sched/deadline.c      |    2 --
 kernel/sched/fair.c          |   21 +++++----------------
 kernel/sched/rt.c            |    2 --
 kernel/sched/sched.h         |   26 +-------------------------
 kernel/sysctl.c              |    7 -------
 7 files changed, 8 insertions(+), 91 deletions(-)

--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -39,7 +39,6 @@ extern unsigned int sysctl_numa_balancin
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
-extern unsigned int sysctl_sched_time_avg;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -62,14 +62,6 @@ const_debug unsigned int sysctl_sched_fe
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
 /*
- * period over which we average the RT time consumption, measured
- * in ms.
- *
- * default: 1s
- */
-const_debug unsigned int sysctl_sched_time_avg = MSEC_PER_SEC;
-
-/*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
  */
@@ -161,7 +153,7 @@ static void update_rq_clock_task(struct
 {
 /*
  * In theory, the compile should just see 0 here, and optimize out the call
- * to sched_rt_avg_update. But I don't trust it...
+ * to update_irq_load_avg(). But I don't trust it...
  */
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
 	s64 steal = 0, irq_delta = 0;
@@ -206,10 +198,8 @@ static void update_rq_clock_task(struct
 	rq->clock_task += delta;
 
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
-	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) {
-		sched_rt_avg_update(rq, irq_delta + steal);
+	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
 		update_irq_load_avg(rq->clock, cpu_of(rq), rq, 1);
-	}
 #endif
 }
 
@@ -673,23 +663,6 @@ bool sched_can_stop_tick(struct rq *rq)
 	return true;
 }
 #endif /* CONFIG_NO_HZ_FULL */
-
-void sched_avg_update(struct rq *rq)
-{
-	s64 period = sched_avg_period();
-
-	while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
-		/*
-		 * Inline assembly required to prevent the compiler
-		 * optimising this loop into a divmod call.
-		 * See __iter_div_u64_rem() for another example of this.
-		 */
-		asm("" : "+rm" (rq->age_stamp));
-		rq->age_stamp += period;
-		rq->rt_avg /= 2;
-	}
-}
-
 #endif /* CONFIG_SMP */
 
 #if defined(CONFIG_RT_GROUP_SCHED) || (defined(CONFIG_FAIR_GROUP_SCHED) && \
@@ -5515,13 +5488,6 @@ void set_rq_offline(struct rq *rq)
 	}
 }
 
-static void set_cpu_rq_start_time(unsigned int cpu)
-{
-	struct rq *rq = cpu_rq(cpu);
-
-	rq->age_stamp = sched_clock_cpu(cpu);
-}
-
 /*
  * used to mark begin/end of suspend/resume:
  */
@@ -5640,7 +5606,6 @@ static void sched_rq_cpu_starting(unsign
 
 int sched_cpu_starting(unsigned int cpu)
 {
-	set_cpu_rq_start_time(cpu);
 	sched_rq_cpu_starting(cpu);
 	return 0;
 }
@@ -5919,7 +5884,6 @@ void __init sched_init(void)
 	if (cpu_isolated_map == NULL)
 		zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
 	idle_thread_set_boot_cpu();
-	set_cpu_rq_start_time(smp_processor_id());
 #endif
 	init_sched_fair_class();
 
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1147,8 +1147,6 @@ static void update_curr_dl(struct rq *rq
 	curr->se.exec_start = rq_clock_task(rq);
 	cpuacct_charge(curr, delta_exec);
 
-	sched_rt_avg_update(rq, delta_exec);
-
 	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
 		delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
 	dl_se->runtime -= delta_exec;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5151,8 +5151,6 @@ static void cpu_load_update(struct rq *t
 
 		this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
 	}
-
-	sched_avg_update(this_rq);
 }
 
 /* Used instead of source_load when we know the type == 0 */
@@ -7134,23 +7132,14 @@ static inline int get_sd_load_idx(struct
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	u64 total, used, age_stamp, avg;
-	s64 delta;
+	unsigned long used;
 
 	/*
-	 * Since we're reading these variables without serialization make sure
-	 * we read them once before doing sanity checks on them.
+	 * Subtract IRQ, RT and DL time as all of those preempt CFS.
 	 */
-	age_stamp = READ_ONCE(rq->age_stamp);
-	avg = READ_ONCE(rq->rt_avg);
-	delta = __rq_clock_broken(rq) - age_stamp;
-
-	if (unlikely(delta < 0))
-		delta = 0;
-
-	total = sched_avg_period() + delta;
-
-	used = div_u64(avg, total);
+	used = READ_ONCE(rq->irq_avg.util_avg) +
+	       READ_ONCE(rq->rt.avg.util_avg)  +
+	       READ_ONCE(rq->dl.avg.util_avg);
 
 	if (likely(used < SCHED_CAPACITY_SCALE))
 		return SCHED_CAPACITY_SCALE - used;
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -981,8 +981,6 @@ static void update_curr_rt(struct rq *rq
 	curr->se.exec_start = rq_clock_task(rq);
 	cpuacct_charge(curr, delta_exec);
 
-	sched_rt_avg_update(rq, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -750,8 +750,6 @@ struct rq {
 
 	struct sched_avg irq_avg;
 
-	u64 rt_avg;
-	u64 age_stamp;
 	u64 idle_stamp;
 	u64 avg_idle;
 
@@ -843,11 +841,6 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
-static inline u64 __rq_clock_broken(struct rq *rq)
-{
-	return READ_ONCE(rq->clock);
-}
-
 /*
  * rq::clock_update_flags bits
  *
@@ -1621,15 +1614,9 @@ extern void deactivate_task(struct rq *r
 
 extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
 
-extern const_debug unsigned int sysctl_sched_time_avg;
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
-static inline u64 sched_avg_period(void)
-{
-	return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
-}
-
 #ifdef CONFIG_SCHED_HRTICK
 
 /*
@@ -1658,8 +1645,6 @@ static inline int hrtick_enabled(struct
 #endif /* CONFIG_SCHED_HRTICK */
 
 #ifdef CONFIG_SMP
-extern void sched_avg_update(struct rq *rq);
-
 #ifndef arch_scale_freq_capacity
 static __always_inline
 unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
@@ -1678,16 +1663,7 @@ unsigned long arch_scale_cpu_capacity(st
 	return SCHED_CAPACITY_SCALE;
 }
 #endif
-
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
-{
-	rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
-	sched_avg_update(rq);
-}
-#else
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
-static inline void sched_avg_update(struct rq *rq) { }
-#endif
+#endif /* CONFIG_SMP */
 
 struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	__acquires(rq->lock);
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -362,13 +362,6 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
-	{
-		.procname	= "sched_time_avg_ms",
-		.data		= &sysctl_sched_time_avg,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
 #ifdef CONFIG_SCHEDSTATS
 	{
 		.procname	= "sched_schedstats",

Vincent Guittot Aug. 8, 2017, 1:56 p.m. | #2
On 7 August 2017 at 18:44, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Aug 04, 2017 at 03:40:21PM +0200, Vincent Guittot wrote:

>

>> There were several comments on v1:

>> - As raised by Peter for v1, if IRQ time is taken into account in

>>   rt_avg, it will not be accounted in rq->clock_task. This means that cfs

>>   utilization is not affected by some extra contributions or decays

>>   because of IRQ.

>

> Right.

>

>> - Regading the sync of rt and cfs utilization, both cfs and rt use the same

>>   rq->clock_task. Then, we have the same issue than cfs regarding blocked value.

>>   The utilization of idle cfs/rt rqs are not updated regularly but only when a

>>   load_balance is triggered (more precisely a call to update_blocked_average).

>>   I'd like to fix this issue for both cfs and rt with a separate patch that

>>   will ensure that utilization (and load) are updated regularly even for

>>   idle CPUs

>

> Yeah, that needs help.

>

>> - One last open question is the location of rt utilization function in fair.c

>>   file. PELT related funtions should probably move in a dedicated pelt.c file.

>>   This would also help to address one comment about having a place to update

>>   metrics of NOHZ idle CPUs. Thought ?

>

> Probably, but I have a bunch of patches lined up changing that code, so

> lets not do that now.


ok. I can rebase and move the code once your patches will be there

>

> In any case, would something like the attached patches make sense? It

> completely replaces rt_avg with separate IRQ,RT and DL tracking.


That would be nice if we can replace rt_avg by something that has the
same dynamic as PELT.

The DL patch looks fine but can't we rely on deadline running
bandwidth to get the figures instead ?

I don't think that IRQ tracking patch is working.
update_irq_load_avg(rq->clock, cpu_of(rq), rq, 1); is called in
update_rq_clock_task() which is never called in irq context. In order
to use PELT for tracking irq and paravirt, we should call
update_irq_load_avg() for every context switch between irq/paravirt
and task which will probably be too heavy. Nevertheless, we can

Because PELT is cpu invariant,  the used value must now be subtracted
to cpu_capacity_orig of the local cpu in scale_rt_capacity,

>

>
Peter Zijlstra Aug. 8, 2017, 2:01 p.m. | #3
On Tue, Aug 08, 2017 at 03:56:26PM +0200, Vincent Guittot wrote:
> 

> I don't think that IRQ tracking patch is working.

> update_irq_load_avg(rq->clock, cpu_of(rq), rq, 1); is called in

> update_rq_clock_task() which is never called in irq context. In order

> to use PELT for tracking irq and paravirt, we should call

> update_irq_load_avg() for every context switch between irq/paravirt

> and task which will probably be too heavy. Nevertheless, we can

> 


Right, realized the same much later yesterday...

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 008c514..50fe0a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3012,6 +3012,17 @@  __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->curr != NULL, cfs_rq);
 }
 
+int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running)
+{
+	int ret;
+
+	ret = ___update_load_avg(now, cpu, &rt_rq->avg, 0, running, NULL);
+
+
+	return ret;
+}
+
+
 /*
  * Signed add and clamp on underflow.
  *
@@ -3539,6 +3550,11 @@  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	return 0;
 }
 
+int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running)
+{
+	return 0;
+}
+
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
 
@@ -6932,6 +6948,10 @@  static void update_blocked_averages(int cpu)
 		if (cfs_rq_is_decayed(cfs_rq))
 			list_del_leaf_cfs_rq(cfs_rq);
 	}
+
+	update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0);
+
+
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -6991,6 +7011,7 @@  static inline void update_blocked_averages(int cpu)
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true);
+	update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0);
 	rq_unlock_irqrestore(rq, &rf);
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 45caf93..e72d572 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1534,6 +1534,8 @@  static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return p;
 }
 
+extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
+
 static struct task_struct *
 pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -1579,6 +1581,10 @@  pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	queue_push_tasks(rq);
 
+	if (p)
+		update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), rt_rq,
+				rq->curr->sched_class == &rt_sched_class);
+
 	return p;
 }
 
@@ -1586,6 +1592,8 @@  static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
 	update_curr_rt(rq);
 
+	update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), &rq->rt, 1);
+
 	/*
 	 * The previous task needs to be made eligible for pushing
 	 * if it is still active
@@ -2368,6 +2376,7 @@  static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 	struct sched_rt_entity *rt_se = &p->rt;
 
 	update_curr_rt(rq);
+	update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), &rq->rt, 1);
 
 	watchdog(rq, p);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3..e7ee5b0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -524,6 +524,9 @@  struct rt_rq {
 	unsigned long rt_nr_total;
 	int overloaded;
 	struct plist_head pushable_tasks;
+
+	struct sched_avg avg;
+
 #ifdef HAVE_RT_PUSH_IPI
 	int push_flags;
 	int push_cpu;