[v6,03/11] sched/rt: add rt_rq utilization tracking

Message ID 1528459794-13066-4-git-send-email-vincent.guittot@linaro.org
State Superseded
Headers show
Series
  • track CPU utilization
Related show

Commit Message

Vincent Guittot June 8, 2018, 12:09 p.m.
schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
are preempted by rt tasks and in this case util_avg reflects the remaining
capacity but not what cfs want to use. In such case, schedutil can select a
lower OPP whereas the CPU is overloaded. In order to have a more accurate
view of the utilization of the CPU, we track the utilization of rt tasks.

rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
the same at the root group level, so the PELT windows of the util_sum are
aligned.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c  | 15 ++++++++++++++-
 kernel/sched/pelt.c  | 22 ++++++++++++++++++++++
 kernel/sched/pelt.h  |  7 +++++++
 kernel/sched/rt.c    | 13 +++++++++++++
 kernel/sched/sched.h |  7 +++++++
 5 files changed, 63 insertions(+), 1 deletion(-)

-- 
2.7.4

Comments

Dietmar Eggemann June 15, 2018, 11:52 a.m. | #1
On 06/08/2018 02:09 PM, Vincent Guittot wrote:
> schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs

> tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks

> are preempted by rt tasks and in this case util_avg reflects the remaining

> capacity but not what cfs want to use. In such case, schedutil can select a

> lower OPP whereas the CPU is overloaded. In order to have a more accurate

> view of the utilization of the CPU, we track the utilization of rt tasks.

> 

> rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are

> the same at the root group level, so the PELT windows of the util_sum are

> aligned.

> 

> Cc: Ingo Molnar <mingo@redhat.com>

> Cc: Peter Zijlstra <peterz@infradead.org>

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>


[...]

;
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c

> index 4174582..81c0d7e 100644

> --- a/kernel/sched/pelt.c

> +++ b/kernel/sched/pelt.c

> @@ -307,3 +307,25 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)

>   

>   	return 0;

>   }

> +

> +/*

> + * rt_rq:

> + *

> + *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked

> + *   util_sum = cpu_scale * load_sum

> + *   runnable_load_sum = load_sum

> + *

> + */

> +

> +int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)

> +{

> +	if (___update_load_sum(now, rq->cpu, &rq->avg_rt,

> +				running,

> +				running,

> +				running)) {


The patch clearly says that this is about utilization but what happens
to load and runnable load for the rt rq part when you call
___update_load_sum() with load=[0,1] and runnable=[0,1]?

It looks like that the math would require 1024 instead of 1 for load and
runnable so that we would see a load_avg or runnable_load_avg != 0.

1594.075128: bprint: update_rt_rq_load_avg: now=1593937228087 cpu=4 running=1
1594.075129: bprint: update_rt_rq_load_avg: delta=3068 cpu=4 load=1 runnable=1 running=1 scale_freq=1024 scale_cpu=1024 periods=2
1594.075130: bprint: update_rt_rq_load_avg: load_sum=23927 +2879 runnable_load_sum=23927 +2879 util_sum=24506165 +2948096
1594.075130: bprint: update_rt_rq_load_avg: load_avg=0 runnable_load_avg=0 util_avg=513

IMHO, the patch should say whether load and runnable load are supported as well or not. 

[...]
Vincent Guittot June 15, 2018, 12:18 p.m. | #2
Hi Dietmar,

On 15 June 2018 at 13:52, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 06/08/2018 02:09 PM, Vincent Guittot wrote:

>> schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs

>> tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks

>> are preempted by rt tasks and in this case util_avg reflects the remaining

>> capacity but not what cfs want to use. In such case, schedutil can select a

>> lower OPP whereas the CPU is overloaded. In order to have a more accurate

>> view of the utilization of the CPU, we track the utilization of rt tasks.

>>

>> rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are

>> the same at the root group level, so the PELT windows of the util_sum are

>> aligned.

>>

>> Cc: Ingo Molnar <mingo@redhat.com>

>> Cc: Peter Zijlstra <peterz@infradead.org>

>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>

> [...]

>

> ;

>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c

>> index 4174582..81c0d7e 100644

>> --- a/kernel/sched/pelt.c

>> +++ b/kernel/sched/pelt.c

>> @@ -307,3 +307,25 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)

>>

>>       return 0;

>>   }

>> +

>> +/*

>> + * rt_rq:

>> + *

>> + *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked

>> + *   util_sum = cpu_scale * load_sum

>> + *   runnable_load_sum = load_sum

>> + *

>> + */

>> +

>> +int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)

>> +{

>> +     if (___update_load_sum(now, rq->cpu, &rq->avg_rt,

>> +                             running,

>> +                             running,

>> +                             running)) {

>

> The patch clearly says that this is about utilization but what happens

> to load and runnable load for the rt rq part when you call

> ___update_load_sum() with load=[0,1] and runnable=[0,1]?


I would say the same than what happens for se which has
___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
cfs_rq->curr == se))

>

> It looks like that the math would require 1024 instead of 1 for load and

> runnable so that we would see a load_avg or runnable_load_avg != 0.


why does it require 1024 ? the min  weight of a task is 15 and the min
share of a sched group is 2. AFAICT, there is no requirement mainly
because we are not using them as they will not give any additional
information compare to util_avg

>

> 1594.075128: bprint: update_rt_rq_load_avg: now=1593937228087 cpu=4 running=1

> 1594.075129: bprint: update_rt_rq_load_avg: delta=3068 cpu=4 load=1 runnable=1 running=1 scale_freq=1024 scale_cpu=1024 periods=2

> 1594.075130: bprint: update_rt_rq_load_avg: load_sum=23927 +2879 runnable_load_sum=23927 +2879 util_sum=24506165 +2948096

> 1594.075130: bprint: update_rt_rq_load_avg: load_avg=0 runnable_load_avg=0 util_avg=513

>

> IMHO, the patch should say whether load and runnable load are supported as well or not.


Although it is stated that we track only utilization , i can probably
mentioned clearly that load_avg and runnable_load_avg are useless

Vincent
>

> [...]
Dietmar Eggemann June 15, 2018, 2:55 p.m. | #3
On 06/15/2018 02:18 PM, Vincent Guittot wrote:
> Hi Dietmar,

> 

> On 15 June 2018 at 13:52, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

>> On 06/08/2018 02:09 PM, Vincent Guittot wrote:

>>> schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs

>>> tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks

>>> are preempted by rt tasks and in this case util_avg reflects the remaining

>>> capacity but not what cfs want to use. In such case, schedutil can select a

>>> lower OPP whereas the CPU is overloaded. In order to have a more accurate

>>> view of the utilization of the CPU, we track the utilization of rt tasks.

>>>

>>> rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are

>>> the same at the root group level, so the PELT windows of the util_sum are

>>> aligned.

>>>

>>> Cc: Ingo Molnar <mingo@redhat.com>

>>> Cc: Peter Zijlstra <peterz@infradead.org>

>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>>

>> [...]

>>

>> ;

>>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c

>>> index 4174582..81c0d7e 100644

>>> --- a/kernel/sched/pelt.c

>>> +++ b/kernel/sched/pelt.c

>>> @@ -307,3 +307,25 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)

>>>

>>>        return 0;

>>>    }

>>> +

>>> +/*

>>> + * rt_rq:

>>> + *

>>> + *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked

>>> + *   util_sum = cpu_scale * load_sum

>>> + *   runnable_load_sum = load_sum

>>> + *

>>> + */

>>> +

>>> +int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)

>>> +{

>>> +     if (___update_load_sum(now, rq->cpu, &rq->avg_rt,

>>> +                             running,

>>> +                             running,

>>> +                             running)) {

>>

>> The patch clearly says that this is about utilization but what happens

>> to load and runnable load for the rt rq part when you call

>> ___update_load_sum() with load=[0,1] and runnable=[0,1]?

> 

> I would say the same than what happens for se which has

> ___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,

> cfs_rq->curr == se))

>


That's correct but I was referring to the results of the PELT math operations
for these cpu related entities. With 1024 for load and runnable you get the
same avg for all three of them:

295.879574: bprint: update_rt_rq_load_avg: now=295694598492 cpu=4 running=1
295.879575: bprint: update_rt_rq_load_avg: delta=4448 cpu=4 load=1024 runnable=1024 running=1 scale_freq=391 scale_cpu=1024 periods=4
295.879577: bprint: update_rt_rq_load_avg: load_sum=18398068 +1451008 runnable_load_sum=18398068 +1451008 util_sum=18398068 +1451008
295.879578: bprint: update_rt_rq_load_avg: load_avg=390 runnable_load_avg=390 util_avg=390

Which is meaningless since for load and runnable load, you would have to have
different call points.

>> It looks like that the math would require 1024 instead of 1 for load and

>> runnable so that we would see a load_avg or runnable_load_avg != 0.

> 

> why does it require 1024 ? the min  weight of a task is 15 and the min

> share of a sched group is 2. AFAICT, there is no requirement mainly

> because we are not using them as they will not give any additional

> information compare to util_avg


Agreed.

>> 1594.075128: bprint: update_rt_rq_load_avg: now=1593937228087 cpu=4 running=1

>> 1594.075129: bprint: update_rt_rq_load_avg: delta=3068 cpu=4 load=1 runnable=1 running=1 scale_freq=1024 scale_cpu=1024 periods=2

>> 1594.075130: bprint: update_rt_rq_load_avg: load_sum=23927 +2879 runnable_load_sum=23927 +2879 util_sum=24506165 +2948096

>> 1594.075130: bprint: update_rt_rq_load_avg: load_avg=0 runnable_load_avg=0 util_avg=513

>>

>> IMHO, the patch should say whether load and runnable load are supported as well or not.

> 

> Although it is stated that we track only utilization , i can probably

> mentioned clearly that load_avg and runnable_load_avg are useless


That would be helpful. Reading Peter's answer https://lkml.org/lkml/2018/6/4/757
made me wondering ...
 
[...]
Peter Zijlstra June 21, 2018, 6:50 p.m. | #4
On Fri, Jun 08, 2018 at 02:09:46PM +0200, Vincent Guittot wrote:
> +int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)

> +{

> +	if (___update_load_sum(now, rq->cpu, &rq->avg_rt,

> +				running,

> +				running,

> +				running)) {


For code like this I wish C had grown named arguments for calls, just
like it has named initializers.

Something like:

	___update_load_sum(now, rq->cpu, &rq->avg_rt,
			   .load = running, .runnable = running,
			   .running = running)

would be so much easier to read... a well, maybe in another 30 years or
so.

> +		___update_load_avg(&rq->avg_rt, 1, 1);

> +		return 1;

> +	}

> +

> +	return 0;

> +}

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6390c66..e471fae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7290,6 +7290,14 @@  static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
 	return false;
 }
 
+static inline bool rt_rq_has_blocked(struct rq *rq)
+{
+	if (READ_ONCE(rq->avg_rt.util_avg))
+		return true;
+
+	return false;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
 static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
@@ -7349,6 +7357,10 @@  static void update_blocked_averages(int cpu)
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
 	}
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+	/* Don't need periodic decay once load/util_avg are null */
+	if (rt_rq_has_blocked(rq))
+		done = false;
 
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
@@ -7414,9 +7426,10 @@  static inline void update_blocked_averages(int cpu)
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
-	if (!cfs_rq_has_blocked(cfs_rq))
+	if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
 		rq->has_blocked_load = 0;
 #endif
 	rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 4174582..81c0d7e 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -307,3 +307,25 @@  int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
 
 	return 0;
 }
+
+/*
+ * rt_rq:
+ *
+ *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ *   util_sum = cpu_scale * load_sum
+ *   runnable_load_sum = load_sum
+ *
+ */
+
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+	if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
+				running,
+				running,
+				running)) {
+		___update_load_avg(&rq->avg_rt, 1, 1);
+		return 1;
+	}
+
+	return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 9cac73e..b2983b7 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -3,6 +3,7 @@ 
 int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
 int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
 int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 
 /*
  * When a task is dequeued, its estimated utilization should not be update if
@@ -38,6 +39,12 @@  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline int
+update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+	return 0;
+}
+
 #endif
 
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ef3c4e6..e8c08a8 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -5,6 +5,8 @@ 
  */
 #include "sched.h"
 
+#include "pelt.h"
+
 int sched_rr_timeslice = RR_TIMESLICE;
 int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
 
@@ -1572,6 +1574,14 @@  pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	rt_queue_push_tasks(rq);
 
+	/*
+	 * If prev task was rt, put_prev_task() has already updated the
+	 * utilization. We only care of the case where we start to schedule a
+	 * rt task
+	 */
+	if (rq->curr->sched_class != &rt_sched_class)
+		update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+
 	return p;
 }
 
@@ -1579,6 +1589,8 @@  static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
 	update_curr_rt(rq);
 
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+
 	/*
 	 * The previous task needs to be made eligible for pushing
 	 * if it is still active
@@ -2308,6 +2320,7 @@  static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 	struct sched_rt_entity *rt_se = &p->rt;
 
 	update_curr_rt(rq);
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
 
 	watchdog(rq, p);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 757a3ee..7a16de9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -592,6 +592,7 @@  struct rt_rq {
 	unsigned long		rt_nr_total;
 	int			overloaded;
 	struct plist_head	pushable_tasks;
+
 #endif /* CONFIG_SMP */
 	int			rt_queued;
 
@@ -847,6 +848,7 @@  struct rq {
 
 	u64			rt_avg;
 	u64			age_stamp;
+	struct sched_avg	avg_rt;
 	u64			idle_stamp;
 	u64			avg_idle;
 
@@ -2205,4 +2207,9 @@  static inline unsigned long cpu_util_cfs(struct rq *rq)
 
 	return util;
 }
+
+static inline unsigned long cpu_util_rt(struct rq *rq)
+{
+	return rq->avg_rt.util_avg;
+}
 #endif