mbox series

[v7,00/11] track CPU utilization

Message ID 1530200714-4504-1-git-send-email-vincent.guittot@linaro.org
Headers show
Series track CPU utilization | expand

Message

Vincent Guittot June 28, 2018, 3:45 p.m. UTC
This patchset initially tracked only the utilization of RT rq. During
OSPM summit, it has been discussed the opportunity to extend it in order
to get an estimate of the utilization of the CPU.

- Patches 1 move pelt code in a dedicated file and remove some blank lines
  
- Patches 2-3 add utilization tracking for rt_rq.

When both cfs and rt tasks compete to run on a CPU, we can see some frequency
drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
reflect anymore the utilization of cfs tasks but only the remaining part that
is not used by rt tasks. We should monitor the stolen utilization and take
it into account when selecting OPP. This patchset doesn't change the OPP
selection policy for RT tasks but only for CFS tasks

A rt-app use case which creates an always running cfs thread and a rt threads
that wakes up periodically with both threads pinned on same CPU, show lot of 
frequency switches of the CPU whereas the CPU never goes idles during the 
test. I can share the json file that I used for the test if someone is
interested in.

For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
the cpufreq statistics outputs (stats are reset just before the test) : 
$ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
without patchset : 1230
with patchset : 14

If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
performance improvements:

- Without patchset :
Test execution summary:
    total time:                          15.0009s
    total number of events:              4903
    total time taken by event execution: 14.9972
    per-request statistics:
         min:                                  1.23ms
         avg:                                  3.06ms
         max:                                 13.16ms
         approx.  95 percentile:              12.73ms

Threads fairness:
    events (avg/stddev):           4903.0000/0.00
    execution time (avg/stddev):   14.9972/0.00

- With patchset:
Test execution summary:
    total time:                          15.0014s
    total number of events:              7694
    total time taken by event execution: 14.9979
    per-request statistics:
         min:                                  1.23ms
         avg:                                  1.95ms
         max:                                 10.49ms
         approx.  95 percentile:              10.39ms

Threads fairness:
    events (avg/stddev):           7694.0000/0.00
    execution time (avg/stddev):   14.9979/0.00

The performance improvement is 56% for this use case.

- Patches 4-5 add utilization tracking for dl_rq in order to solve similar
  problem as with rt_rq. Nevertheless, we keep using dl bandwidth as default
  level of requirement for dl tasks. The dl utilization is used to check that
  the CPU is not overloaded which is not always reflected when using dl
  bandwidth

- Patches 6-7 add utilization tracking for interrupt and use it select OPP
  A test with iperf on hikey 6220 gives: 
    w/o patchset	    w/ patchset
    Tx 276 Mbits/sec        304 Mbits/sec +10%
    Rx 299 Mbits/sec        328 Mbits/sec +09%
    
    8 iterations of iperf -c server_address -r -t 5
    stdev is lower than 1%
    Only WFI idle state is enable (shallowest arm idle state)

- Patch 8 merges sugov_aggregate_util and sugov_get_util as proposed by Peter

- Patches 9 uses rt, dl and interrupt utilization in the scale_rt_capacity()
  and remove  the use of sched_rt_avg_update.

- Patches 10 removes the unused sched_avg_update code

- Patch 11 removes the unused sched_time_avg_ms

Change since v6:
- add more comments load tracking metrics 
- merge sugov_aggregate_util and sugov_get_util

Change since v4:
- add support of periodic update of blocked utilization
- rebase on lastest tip/sched/core

Change since v3:
- add support of periodic update of blocked utilization
- rebase on lastest tip/sched/core

Change since v2:
- move pelt code into a dedicated pelt.c file
- rebase on load tracking changes

Change since v1:
- Only a rebase. I have addressed the comments on previous version in
  patch 1/2


Vincent Guittot (11):
  sched/pelt: Move pelt related code in a dedicated file
  sched/rt: add rt_rq utilization tracking
  cpufreq/schedutil: use rt utilization tracking
  sched/dl: add dl_rq utilization tracking
  cpufreq/schedutil: use dl utilization tracking
  sched/irq: add irq utilization tracking
  cpufreq/schedutil: take into account interrupt
  sched: schedutil: remove sugov_aggregate_util()
  sched: use pelt for scale_rt_capacity()
  sched: remove rt_avg code
  proc/sched: remove unused sched_time_avg_ms

 include/linux/sched/sysctl.h     |   1 -
 kernel/sched/Makefile            |   2 +-
 kernel/sched/core.c              |  38 +---
 kernel/sched/cpufreq_schedutil.c |  65 ++++---
 kernel/sched/deadline.c          |   8 +-
 kernel/sched/fair.c              | 403 +++++----------------------------------
 kernel/sched/pelt.c              | 399 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/pelt.h              |  72 +++++++
 kernel/sched/rt.c                |  15 +-
 kernel/sched/sched.h             |  68 +++++--
 kernel/sysctl.c                  |   8 -
 11 files changed, 632 insertions(+), 447 deletions(-)
 create mode 100644 kernel/sched/pelt.c
 create mode 100644 kernel/sched/pelt.h

-- 
2.7.4

Comments

Peter Zijlstra July 5, 2018, 12:36 p.m. UTC | #1
On Thu, Jun 28, 2018 at 05:45:03PM +0200, Vincent Guittot wrote:
> Vincent Guittot (11):

>   sched/pelt: Move pelt related code in a dedicated file

>   sched/rt: add rt_rq utilization tracking

>   cpufreq/schedutil: use rt utilization tracking

>   sched/dl: add dl_rq utilization tracking

>   cpufreq/schedutil: use dl utilization tracking

>   sched/irq: add irq utilization tracking

>   cpufreq/schedutil: take into account interrupt

>   sched: schedutil: remove sugov_aggregate_util()

>   sched: use pelt for scale_rt_capacity()

>   sched: remove rt_avg code

>   proc/sched: remove unused sched_time_avg_ms

> 

>  include/linux/sched/sysctl.h     |   1 -

>  kernel/sched/Makefile            |   2 +-

>  kernel/sched/core.c              |  38 +---

>  kernel/sched/cpufreq_schedutil.c |  65 ++++---

>  kernel/sched/deadline.c          |   8 +-

>  kernel/sched/fair.c              | 403 +++++----------------------------------

>  kernel/sched/pelt.c              | 399 ++++++++++++++++++++++++++++++++++++++

>  kernel/sched/pelt.h              |  72 +++++++

>  kernel/sched/rt.c                |  15 +-

>  kernel/sched/sched.h             |  68 +++++--

>  kernel/sysctl.c                  |   8 -

>  11 files changed, 632 insertions(+), 447 deletions(-)

>  create mode 100644 kernel/sched/pelt.c

>  create mode 100644 kernel/sched/pelt.h


OK, this looks good I suppose. Rafael, are you OK with me taking these?

I have the below on top because I once again forgot how it all worked;
does this work for you Vincent?

---
Subject: sched/cpufreq: Clarify sugov_get_util()

Add a few comments (hopefully) clarifying some of the magic in
sugov_get_util().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

---
 cpufreq_schedutil.c |   69 ++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 51 insertions(+), 18 deletions(-)

--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -177,6 +177,26 @@ static unsigned int get_next_freq(struct
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
 
+/*
+ * This function computes an effective utilization for the given CPU, to be
+ * used for frequency selection given the linear relation: f = u * f_max.
+ *
+ * The scheduler tracks the following metrics:
+ *
+ *   cpu_util_{cfs,rt,dl,irq}()
+ *   cpu_bw_dl()
+ *
+ * Where the cfs,rt and dl util numbers are tracked with the same metric and
+ * synchronized windows and are thus directly comparable.
+ *
+ * The cfs,rt,dl utilization are the running times measured with rq->clock_task
+ * which excludes things like IRQ and steal-time. These latter are then accrued in
+ * the irq utilization.
+ *
+ * The DL bandwidth number otoh is not a measured meric but a value computed
+ * based on the task model parameters and gives the minimal u required to meet
+ * deadlines.
+ */
 static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
@@ -188,26 +208,50 @@ static unsigned long sugov_get_util(stru
 	if (rt_rq_is_runnable(&rq->rt))
 		return max;
 
+	/*
+	 * Early check to see if IRQ/steal time saturates the CPU, can be
+	 * because of inaccuracies in how we track these -- see
+	 * update_irq_load_avg().
+	 */
 	irq = cpu_util_irq(rq);
-
 	if (unlikely(irq >= max))
 		return max;
 
-	/* Sum rq utilization */
+	/*
+	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
+	 * CFS tasks and we use the same metric to track the effective
+	 * utilization (PELT windows are synchronized) we can directly add them
+	 * to obtain the CPU's actual utilization.
+	 */
 	util = cpu_util_cfs(rq);
 	util += cpu_util_rt(rq);
 
 	/*
-	 * Interrupt time is not seen by rqs utilization nso we can compare
-	 * them with the CPU capacity
+	 * We do not make cpu_util_dl() a permanent part of this sum because we
+	 * want to use cpu_bw_dl() later on, but we need to check if the
+	 * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
+	 * f_max when there is no idle time.
+	 *
+	 * NOTE: numerical errors or stop class might cause us to not quite hit
+	 * saturation when we should -- something for later.
 	 */
 	if ((util + cpu_util_dl(rq)) >= max)
 		return max;
 
 	/*
-	 * As there is still idle time on the CPU, we need to compute the
-	 * utilization level of the CPU.
+	 * There is still idle time; further improve the number by using the
+	 * irq metric. Because IRQ/steal time is hidden from the task clock we
+	 * need to scale the task numbers:
 	 *
+	 *              1 - irq
+	 *   U' = irq + ------- * U
+	 *                max
+	 */
+	util *= (max - irq);
+	util /= max;
+	util += irq;
+
+	/*
 	 * Bandwidth required by DEADLINE must always be granted while, for
 	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
 	 * to gracefully reduce the frequency when no tasks show up for longer
@@ -217,18 +261,7 @@ static unsigned long sugov_get_util(stru
 	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet
 	 * ready for such an interface. So, we only do the latter for now.
 	 */
-
-	/* Weight rqs utilization to normal context window */
-	util *= (max - irq);
-	util /= max;
-
-	/* Add interrupt utilization */
-	util += irq;
-
-	/* Add DL bandwidth requirement */
-	util += sg_cpu->bw_dl;
-
-	return min(max, util);
+	return min(max, util + sg_cpu->bw_dl);
 }
 
 /**
Vincent Guittot July 5, 2018, 1:32 p.m. UTC | #2
Hi Peter

On Thu, 5 Jul 2018 at 14:36, Peter Zijlstra <peterz@infradead.org> wrote:
>


>

> OK, this looks good I suppose. Rafael, are you OK with me taking these?

>

> I have the below on top because I once again forgot how it all worked;

> does this work for you Vincent?


Yes looks good to me

Thanks

>

> ---

> Subject: sched/cpufreq: Clarify sugov_get_util()

>

> Add a few comments (hopefully) clarifying some of the magic in

> sugov_get_util().

>

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---

>  cpufreq_schedutil.c |   69 ++++++++++++++++++++++++++++++++++++++--------------

>  1 file changed, 51 insertions(+), 18 deletions(-)

>
Viresh Kumar July 6, 2018, 6:05 a.m. UTC | #3
On 05-07-18, 14:36, Peter Zijlstra wrote:
> Subject: sched/cpufreq: Clarify sugov_get_util()

> 

> Add a few comments (hopefully) clarifying some of the magic in

> sugov_get_util().

> 

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---

>  cpufreq_schedutil.c |   69 ++++++++++++++++++++++++++++++++++++++--------------

>  1 file changed, 51 insertions(+), 18 deletions(-)

> 

> --- a/kernel/sched/cpufreq_schedutil.c

> +++ b/kernel/sched/cpufreq_schedutil.c

> @@ -177,6 +177,26 @@ static unsigned int get_next_freq(struct

>  	return cpufreq_driver_resolve_freq(policy, freq);

>  }

>  

> +/*

> + * This function computes an effective utilization for the given CPU, to be

> + * used for frequency selection given the linear relation: f = u * f_max.

> + *

> + * The scheduler tracks the following metrics:

> + *

> + *   cpu_util_{cfs,rt,dl,irq}()

> + *   cpu_bw_dl()

> + *

> + * Where the cfs,rt and dl util numbers are tracked with the same metric and

> + * synchronized windows and are thus directly comparable.

> + *

> + * The cfs,rt,dl utilization are the running times measured with rq->clock_task

> + * which excludes things like IRQ and steal-time. These latter are then accrued in

> + * the irq utilization.

> + *

> + * The DL bandwidth number otoh is not a measured meric but a value computed


                                                     metric

> + * based on the task model parameters and gives the minimal u required to meet


                                                               u ?

> + * deadlines.

> + */

>  static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)

>  {

>  	struct rq *rq = cpu_rq(sg_cpu->cpu);

> @@ -188,26 +208,50 @@ static unsigned long sugov_get_util(stru

>  	if (rt_rq_is_runnable(&rq->rt))

>  		return max;

>  

> +	/*

> +	 * Early check to see if IRQ/steal time saturates the CPU, can be

> +	 * because of inaccuracies in how we track these -- see

> +	 * update_irq_load_avg().

> +	 */

>  	irq = cpu_util_irq(rq);

> -

>  	if (unlikely(irq >= max))

>  		return max;

>  

> -	/* Sum rq utilization */

> +	/*

> +	 * Because the time spend on RT/DL tasks is visible as 'lost' time to

> +	 * CFS tasks and we use the same metric to track the effective

> +	 * utilization (PELT windows are synchronized) we can directly add them

> +	 * to obtain the CPU's actual utilization.

> +	 */

>  	util = cpu_util_cfs(rq);

>  	util += cpu_util_rt(rq);

>  

>  	/*

> -	 * Interrupt time is not seen by rqs utilization nso we can compare

> -	 * them with the CPU capacity

> +	 * We do not make cpu_util_dl() a permanent part of this sum because we

> +	 * want to use cpu_bw_dl() later on, but we need to check if the

> +	 * CFS+RT+DL sum is saturated (ie. no idle time) such that we select

> +	 * f_max when there is no idle time.

> +	 *

> +	 * NOTE: numerical errors or stop class might cause us to not quite hit

> +	 * saturation when we should -- something for later.

>  	 */

>  	if ((util + cpu_util_dl(rq)) >= max)

>  		return max;

>  

>  	/*

> -	 * As there is still idle time on the CPU, we need to compute the

> -	 * utilization level of the CPU.

> +	 * There is still idle time; further improve the number by using the

> +	 * irq metric. Because IRQ/steal time is hidden from the task clock we

> +	 * need to scale the task numbers:

>  	 *

> +	 *              1 - irq

> +	 *   U' = irq + ------- * U

> +	 *                max

> +	 */

> +	util *= (max - irq);

> +	util /= max;

> +	util += irq;

> +

> +	/*

>  	 * Bandwidth required by DEADLINE must always be granted while, for

>  	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism

>  	 * to gracefully reduce the frequency when no tasks show up for longer

> @@ -217,18 +261,7 @@ static unsigned long sugov_get_util(stru

>  	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet

>  	 * ready for such an interface. So, we only do the latter for now.

>  	 */

> -

> -	/* Weight rqs utilization to normal context window */

> -	util *= (max - irq);

> -	util /= max;

> -

> -	/* Add interrupt utilization */

> -	util += irq;

> -

> -	/* Add DL bandwidth requirement */

> -	util += sg_cpu->bw_dl;

> -

> -	return min(max, util);

> +	return min(max, util + sg_cpu->bw_dl);

>  }

>  


Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


-- 
viresh
Peter Zijlstra July 6, 2018, 9:18 a.m. UTC | #4
On Fri, Jul 06, 2018 at 11:35:22AM +0530, Viresh Kumar wrote:
> On 05-07-18, 14:36, Peter Zijlstra wrote:

> > +/*

> > + * This function computes an effective utilization for the given CPU, to be

> > + * used for frequency selection given the linear relation: f = u * f_max.

> > + *

> > + * The scheduler tracks the following metrics:

> > + *

> > + *   cpu_util_{cfs,rt,dl,irq}()

> > + *   cpu_bw_dl()

> > + *

> > + * Where the cfs,rt and dl util numbers are tracked with the same metric and

> > + * synchronized windows and are thus directly comparable.

> > + *

> > + * The cfs,rt,dl utilization are the running times measured with rq->clock_task

> > + * which excludes things like IRQ and steal-time. These latter are then accrued in

> > + * the irq utilization.

> > + *

> > + * The DL bandwidth number otoh is not a measured meric but a value computed

> 

>                                                      metric


Indeed, fixed.

> > + * based on the task model parameters and gives the minimal u required to meet

> 

>                                                                u ?


utilization, but for lazy people :-) I'll use the whole word.

> > + * deadlines.

> > + */