[09/11] sched: use pelt for scale_rt_capacity()

Message ID 1530200714-4504-10-git-send-email-vincent.guittot@linaro.org
State New
Headers show
Series
  • track CPU utilization
Related show

Commit Message

Vincent Guittot June 28, 2018, 3:45 p.m.
The utilization of the CPU by rt, dl and interrupts are now tracked with
PELT so we can use these metrics instead of rt_avg to evaluate the remaining
capacity available for cfs class.

scale_rt_capacity() behavior has been changed and now returns the remaining
capacity available for cfs instead of a scaling factor because rt, dl and
interrupt provide now absolute utilization value.

The same formula as schedutil is used:
  irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
but the implementation is different because it doesn't return the same value
and doesn't benefit of the same optimization

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/deadline.c |  2 --
 kernel/sched/fair.c     | 41 +++++++++++++++++++----------------------
 kernel/sched/pelt.c     |  2 +-
 kernel/sched/rt.c       |  2 --
 4 files changed, 20 insertions(+), 27 deletions(-)

-- 
2.7.4

Comments

Ingo Molnar July 15, 2018, 10:15 p.m. | #1
* Vincent Guittot <vincent.guittot@linaro.org> wrote:

> The utilization of the CPU by rt, dl and interrupts are now tracked with

> PELT so we can use these metrics instead of rt_avg to evaluate the remaining

> capacity available for cfs class.

> 

> scale_rt_capacity() behavior has been changed and now returns the remaining

> capacity available for cfs instead of a scaling factor because rt, dl and

> interrupt provide now absolute utilization value.

> 

> The same formula as schedutil is used:

>   irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg

> but the implementation is different because it doesn't return the same value

> and doesn't benefit of the same optimization

> 

> Cc: Ingo Molnar <mingo@redhat.com>

> Cc: Peter Zijlstra <peterz@infradead.org>

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---

>  kernel/sched/deadline.c |  2 --

>  kernel/sched/fair.c     | 41 +++++++++++++++++++----------------------

>  kernel/sched/pelt.c     |  2 +-

>  kernel/sched/rt.c       |  2 --

>  4 files changed, 20 insertions(+), 27 deletions(-)


> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index d2758e3..ce0dcbf 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7550,39 +7550,36 @@ static inline int get_sd_load_idx(struct sched_domain *sd,

>  static unsigned long scale_rt_capacity(int cpu)

>  {

>  	struct rq *rq = cpu_rq(cpu);

> -	u64 total, used, age_stamp, avg;

> -	s64 delta;

> -

> -	/*

> -	 * Since we're reading these variables without serialization make sure

> -	 * we read them once before doing sanity checks on them.

> -	 */

> -	age_stamp = READ_ONCE(rq->age_stamp);

> -	avg = READ_ONCE(rq->rt_avg);

> -	delta = __rq_clock_broken(rq) - age_stamp;

> +	unsigned long max = arch_scale_cpu_capacity(NULL, cpu);

> +	unsigned long used, irq, free;

>  

> -	if (unlikely(delta < 0))

> -		delta = 0;

> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

> +	irq = READ_ONCE(rq->avg_irq.util_avg);

>  

> -	total = sched_avg_period() + delta;

> +	if (unlikely(irq >= max))

> +		return 1;

> +#endif


Note that 'irq' is unused outside that macro block, resulting in a new warning on 
defconfig builds:

 CC      kernel/sched/fair.o
 kernel/sched/fair.c: In function ‘scale_rt_capacity’:
 kernel/sched/fair.c:7553:22: warning: unused variable ‘irq’ [-Wunused-variable]
   unsigned long used, irq, free;
                       ^~~

I have applied the delta fix below for simplicity, but what we really want is a 
cleanup of that function to eliminate the #ifdefs. One solution would be to factor 
out the 'irq' utilization value into a helper inline, and double check that if the 
configs are off the compiler does the right thing and eliminates this identity 
transformation for the irq==0 case:

        free *= (max - irq);
        free /= max;

If the compiler refuses to optimize this away (due to the zero and overflow 
cases), try to find something more clever?

Thanks,

	Ingo

 kernel/sched/fair.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e3221db0511a..d5f7d521e448 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7550,7 +7550,10 @@ static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
-	unsigned long used, irq, free;
+	unsigned long used, free;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	unsigned long irq;
+#endif
 
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
 	irq = READ_ONCE(rq->avg_irq.util_avg);
Joe Perches July 15, 2018, 10:46 p.m. | #2
On Mon, 2018-07-16 at 00:15 +0200, Ingo Molnar wrote:
> * Vincent Guittot <vincent.guittot@linaro.org> wrote:

> 

> > The utilization of the CPU by rt, dl and interrupts are now tracked with

> > PELT so we can use these metrics instead of rt_avg to evaluate the remaining

> > capacity available for cfs class.

> > 

> > scale_rt_capacity() behavior has been changed and now returns the remaining

> > capacity available for cfs instead of a scaling factor because rt, dl and

> > interrupt provide now absolute utilization value.

> > 

> > The same formula as schedutil is used:

> >   irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg

> > but the implementation is different because it doesn't return the same value

> > and doesn't benefit of the same optimization

[]
> I have applied the delta fix below for simplicity, but what we really want is a 

> cleanup of that function to eliminate the #ifdefs. One solution would be to factor 

> out the 'irq' utilization value into a helper inline, and double check that if the 

> configs are off the compiler does the right thing and eliminates this identity 

> transformation for the irq==0 case:

[]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

[]
> @@ -7550,7 +7550,10 @@ static unsigned long scale_rt_capacity(int cpu)

>  {

>  	struct rq *rq = cpu_rq(cpu);

>  	unsigned long max = arch_scale_cpu_capacity(NULL, cpu);

> -	unsigned long used, irq, free;

> +	unsigned long used, free;

> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

> +	unsigned long irq;

> +#endif

>  

>  #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)


Perhaps combine these two #if defined blocks into
a single block

>  	irq = READ_ONCE(rq->avg_irq.util_avg);
Vincent Guittot July 16, 2018, 11:24 a.m. | #3
Hi Ingo,

On Mon, 16 Jul 2018 at 00:15, Ingo Molnar <mingo@kernel.org> wrote:
>

>

> * Vincent Guittot <vincent.guittot@linaro.org> wrote:

>


> > @@ -7550,39 +7550,36 @@ static inline int get_sd_load_idx(struct sched_domain *sd,

> >  static unsigned long scale_rt_capacity(int cpu)

> >  {

> >       struct rq *rq = cpu_rq(cpu);

> > -     u64 total, used, age_stamp, avg;

> > -     s64 delta;

> > -

> > -     /*

> > -      * Since we're reading these variables without serialization make sure

> > -      * we read them once before doing sanity checks on them.

> > -      */

> > -     age_stamp = READ_ONCE(rq->age_stamp);

> > -     avg = READ_ONCE(rq->rt_avg);

> > -     delta = __rq_clock_broken(rq) - age_stamp;

> > +     unsigned long max = arch_scale_cpu_capacity(NULL, cpu);

> > +     unsigned long used, irq, free;

> >

> > -     if (unlikely(delta < 0))

> > -             delta = 0;

> > +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

> > +     irq = READ_ONCE(rq->avg_irq.util_avg);

> >

> > -     total = sched_avg_period() + delta;

> > +     if (unlikely(irq >= max))

> > +             return 1;

> > +#endif

>

> Note that 'irq' is unused outside that macro block, resulting in a new warning on

> defconfig builds:

>

>  CC      kernel/sched/fair.o

>  kernel/sched/fair.c: In function ‘scale_rt_capacity’:

>  kernel/sched/fair.c:7553:22: warning: unused variable ‘irq’ [-Wunused-variable]

>    unsigned long used, irq, free;

>                        ^~~

>

> I have applied the delta fix below for simplicity, but what we really want is a

> cleanup of that function to eliminate the #ifdefs. One solution would be to factor

> out the 'irq' utilization value into a helper inline, and double check that if the

> configs are off the compiler does the right thing and eliminates this identity

> transformation for the irq==0 case:

>

>         free *= (max - irq);

>         free /= max;

>

> If the compiler refuses to optimize this away (due to the zero and overflow

> cases), try to find something more clever?


Thanks for the fix.
I'm off for now and will look at your proposal above once back

Regards,
Vincent

>

> Thanks,

>

>         Ingo

>

>  kernel/sched/fair.c | 5 ++++-

>  1 file changed, 4 insertions(+), 1 deletion(-)

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index e3221db0511a..d5f7d521e448 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7550,7 +7550,10 @@ static unsigned long scale_rt_capacity(int cpu)

>  {

>         struct rq *rq = cpu_rq(cpu);

>         unsigned long max = arch_scale_cpu_capacity(NULL, cpu);

> -       unsigned long used, irq, free;

> +       unsigned long used, free;

> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

> +       unsigned long irq;

> +#endif

>

>  #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

>         irq = READ_ONCE(rq->avg_irq.util_avg);
Ingo Molnar July 16, 2018, 11:39 a.m. | #4
* Vincent Guittot <vincent.guittot@linaro.org> wrote:

> > If the compiler refuses to optimize this away (due to the zero and overflow

> > cases), try to find something more clever?

> 

> Thanks for the fix.

> I'm off for now and will look at your proposal above once back


Sounds good, there's no rush, we've still got time until ~rc7.

Thanks,

	Ingo

Patch

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f4de2698..68b8a9f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1180,8 +1180,6 @@  static void update_curr_dl(struct rq *rq)
 	curr->se.exec_start = now;
 	cgroup_account_cputime(curr, delta_exec);
 
-	sched_rt_avg_update(rq, delta_exec);
-
 	if (dl_entity_is_special(dl_se))
 		return;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d2758e3..ce0dcbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7550,39 +7550,36 @@  static inline int get_sd_load_idx(struct sched_domain *sd,
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	u64 total, used, age_stamp, avg;
-	s64 delta;
-
-	/*
-	 * Since we're reading these variables without serialization make sure
-	 * we read them once before doing sanity checks on them.
-	 */
-	age_stamp = READ_ONCE(rq->age_stamp);
-	avg = READ_ONCE(rq->rt_avg);
-	delta = __rq_clock_broken(rq) - age_stamp;
+	unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
+	unsigned long used, irq, free;
 
-	if (unlikely(delta < 0))
-		delta = 0;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	irq = READ_ONCE(rq->avg_irq.util_avg);
 
-	total = sched_avg_period() + delta;
+	if (unlikely(irq >= max))
+		return 1;
+#endif
 
-	used = div_u64(avg, total);
+	used = READ_ONCE(rq->avg_rt.util_avg);
+	used += READ_ONCE(rq->avg_dl.util_avg);
 
-	if (likely(used < SCHED_CAPACITY_SCALE))
-		return SCHED_CAPACITY_SCALE - used;
+	if (unlikely(used >= max))
+		return 1;
 
-	return 1;
+	free = max - used;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	free *= (max - irq);
+	free /= max;
+#endif
+	return free;
 }
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
+	unsigned long capacity = scale_rt_capacity(cpu);
 	struct sched_group *sdg = sd->groups;
 
-	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-
-	capacity *= scale_rt_capacity(cpu);
-	capacity >>= SCHED_CAPACITY_SHIFT;
+	cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(sd, cpu);
 
 	if (!capacity)
 		capacity = 1;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index ead6d8b..35475c0 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -237,7 +237,7 @@  ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
 	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
-	sa->util_avg = sa->util_sum / divider;
+	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
 /*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0e3e57a..2a881bd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -970,8 +970,6 @@  static void update_curr_rt(struct rq *rq)
 	curr->se.exec_start = now;
 	cgroup_account_cputime(curr, delta_exec);
 
-	sched_rt_avg_update(rq, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;