mbox series

[v5,00/10] track CPU utilization

Message ID 1527253951-22709-1-git-send-email-vincent.guittot@linaro.org
Headers show
Series track CPU utilization | expand

Message

Vincent Guittot May 25, 2018, 1:12 p.m. UTC
This patchset initially tracked only the utilization of RT rq. During
OSPM summit, it has been discussed the opportunity to extend it in order
to get an estimate of the utilization of the CPU.

- Patches 1-3 correspond to the content of patchset v4 and add utilization
  tracking for rt_rq.

When both cfs and rt tasks compete to run on a CPU, we can see some frequency
drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
reflect anymore the utilization of cfs tasks but only the remaining part that
is not used by rt tasks. We should monitor the stolen utilization and take
it into account when selecting OPP. This patchset doesn't change the OPP
selection policy for RT tasks but only for CFS tasks

A rt-app use case which creates an always running cfs thread and a rt threads
that wakes up periodically with both threads pinned on same CPU, show lot of 
frequency switches of the CPU whereas the CPU never goes idles during the 
test. I can share the json file that I used for the test if someone is
interested in.

For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
the cpufreq statistics outputs (stats are reset just before the test) : 
$ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
without patchset : 1230
with patchset : 14

If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
performance improvements:

- Without patchset :
Test execution summary:
    total time:                          15.0009s
    total number of events:              4903
    total time taken by event execution: 14.9972
    per-request statistics:
         min:                                  1.23ms
         avg:                                  3.06ms
         max:                                 13.16ms
         approx.  95 percentile:              12.73ms

Threads fairness:
    events (avg/stddev):           4903.0000/0.00
    execution time (avg/stddev):   14.9972/0.00

- With patchset:
Test execution summary:
    total time:                          15.0014s
    total number of events:              7694
    total time taken by event execution: 14.9979
    per-request statistics:
         min:                                  1.23ms
         avg:                                  1.95ms
         max:                                 10.49ms
         approx.  95 percentile:              10.39ms

Threads fairness:
    events (avg/stddev):           7694.0000/0.00
    execution time (avg/stddev):   14.9979/0.00

The performance improvement is 56% for this use case.

- Patches 4-5 add utilization tracking for dl_rq in order to solve similar
  problem as with rt_rq

- Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove
  dl and rt from sched_rt_avg_update

- Patches 7-8 add utilization tracking for interrupt and use it select OPP
  A test with iperf on hikey 6220 gives: 
    w/o patchset	    w/ patchset
    Tx 276 Mbits/sec        304 Mbits/sec +10%
    Rx 299 Mbits/sec        328 Mbits/sec +09%
    
    8 iterations of iperf -c server_address -r -t 5
    stdev is lower than 1%
    Only WFI idle state is enable (shallowest arm idle state)

- Patches 9 removes the unused sched_avg_update code

- Patch 10 removes the unused sched_time_avg_ms

Change since v3:
- add support of periodic update of blocked utilization
- rebase on lastest tip/sched/core

Change since v2:
- move pelt code into a dedicated pelt.c file
- rebase on load tracking changes

Change since v1:
- Only a rebase. I have addressed the comments on previous version in
  patch 1/2

Vincent Guittot (10):
  sched/pelt: Move pelt related code in a dedicated file
  sched/rt: add rt_rq utilization tracking
  cpufreq/schedutil: add rt utilization tracking
  sched/dl: add dl_rq utilization tracking
  cpufreq/schedutil: get max utilization
  sched: remove rt and dl from sched_avg
  sched/irq: add irq utilization tracking
  cpufreq/schedutil: take into account interrupt
  sched: remove rt_avg code
  proc/sched: remove unused sched_time_avg_ms

 include/linux/sched/sysctl.h     |   1 -
 kernel/sched/Makefile            |   2 +-
 kernel/sched/core.c              |  38 +---
 kernel/sched/cpufreq_schedutil.c |  24 ++-
 kernel/sched/deadline.c          |   7 +-
 kernel/sched/fair.c              | 381 +++----------------------------------
 kernel/sched/pelt.c              | 395 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/pelt.h              |  63 +++++++
 kernel/sched/rt.c                |  10 +-
 kernel/sched/sched.h             |  57 ++++--
 kernel/sysctl.c                  |   8 -
 11 files changed, 563 insertions(+), 423 deletions(-)
 create mode 100644 kernel/sched/pelt.c
 create mode 100644 kernel/sched/pelt.h

-- 
2.7.4

Comments

Peter Zijlstra June 4, 2018, 4:50 p.m. UTC | #1
On Fri, May 25, 2018 at 03:12:21PM +0200, Vincent Guittot wrote:
> When both cfs and rt tasks compete to run on a CPU, we can see some frequency

> drops with schedutil governor. In such case, the cfs_rq's utilization doesn't

> reflect anymore the utilization of cfs tasks but only the remaining part that

> is not used by rt tasks. We should monitor the stolen utilization and take

> it into account when selecting OPP. This patchset doesn't change the OPP

> selection policy for RT tasks but only for CFS tasks


So the problem is that when RT/DL/stop/IRQ happens and preempts CFS
tasks, time continues and the CFS load tracking will see !running and
decay things.

Then, when we get back to CFS, we'll have lower load/util than we
expected.

In particular, your focus is on OPP selection, and where we would have
say: u=1 (always running task), after being preempted by our RT task for
a while, it will now have u=.5. With the effect that when the RT task
goes sleep we'll drop our OPP to .5 max -- which is 'wrong', right?

Your solution is to track RT/DL/stop/IRQ with the identical PELT average
as we track cfs util. Such that we can then add the various averages to
reconstruct the actual utilisation signal.

This should work for the case of the utilization signal on UP. When we
consider that PELT migrates the signal around on SMP, but we don't do
that to the per-rq signals we have for RT/DL/stop/IRQ.

There is also the 'complaint' that this ends up with 2 util signals for
DL, complicating things.


So this patch-set tracks the !cfs occupation using the same function,
which is all good. But what, if instead of using that to compensate the
OPP selection, we employ that to renormalize the util signal?

If we normalize util against the dynamic (rt_avg affected) cpu_capacity,
then I think your initial problem goes away. Because while the RT task
will push the util to .5, it will at the same time push the CPU capacity
to .5, and renormalized that gives 1.

  NOTE: the renorm would then become something like:
        scale_cpu = arch_scale_cpu_capacity() / rt_frac();


On IRC I mentioned stopping the CFS clock when preempted, and while that
would result in fixed numbers, Vincent was right in pointing out the
numbers will be difficult to interpret, since the meaning will be purely
CPU local and I'm not sure you can actually fix it again with
normalization.

Imagine, running a .3 RT task, that would push the (always running) CFS
down to .7, but because we discard all !cfs time, it actually has 1. If
we try and normalize that we'll end up with ~1.43, which is of course
completely broken.


_However_, all that happens for util, also happens for load. So the above
scenario will also make the CPU appear less loaded than it actually is.

Now, we actually try and compensate for that by decreasing the capacity
of the CPU. But because the existing rt_avg and PELT signals are so
out-of-tune, this is likely to be less than ideal. With that fixed
however, the best this appears to do is, as per the above, preserve the
actual load. But what we really wanted is to actually inflate the load,
such that someone will take load from us -- we're doing less actual work
after all.

Possibly, we can do something like:

	scale_cpu_capacity / (rt_frac^2)

for load, then we inflate the load and could maybe get rid of all this
capacity_of() sprinkling, but that needs more thinking.


But I really feel we need to consider both util and load, as this issue
affects both.
Quentin Perret June 4, 2018, 5:13 p.m. UTC | #2
On Monday 04 Jun 2018 at 18:50:47 (+0200), Peter Zijlstra wrote:
> On Fri, May 25, 2018 at 03:12:21PM +0200, Vincent Guittot wrote:

> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency

> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't

> > reflect anymore the utilization of cfs tasks but only the remaining part that

> > is not used by rt tasks. We should monitor the stolen utilization and take

> > it into account when selecting OPP. This patchset doesn't change the OPP

> > selection policy for RT tasks but only for CFS tasks

> 

> So the problem is that when RT/DL/stop/IRQ happens and preempts CFS

> tasks, time continues and the CFS load tracking will see !running and

> decay things.

> 

> Then, when we get back to CFS, we'll have lower load/util than we

> expected.

> 

> In particular, your focus is on OPP selection, and where we would have

> say: u=1 (always running task), after being preempted by our RT task for

> a while, it will now have u=.5. With the effect that when the RT task

> goes sleep we'll drop our OPP to .5 max -- which is 'wrong', right?

> 

> Your solution is to track RT/DL/stop/IRQ with the identical PELT average

> as we track cfs util. Such that we can then add the various averages to

> reconstruct the actual utilisation signal.

> 

> This should work for the case of the utilization signal on UP. When we

> consider that PELT migrates the signal around on SMP, but we don't do

> that to the per-rq signals we have for RT/DL/stop/IRQ.

> 

> There is also the 'complaint' that this ends up with 2 util signals for

> DL, complicating things.

> 

> 

> So this patch-set tracks the !cfs occupation using the same function,

> which is all good. But what, if instead of using that to compensate the

> OPP selection, we employ that to renormalize the util signal?

> 

> If we normalize util against the dynamic (rt_avg affected) cpu_capacity,

> then I think your initial problem goes away. Because while the RT task

> will push the util to .5, it will at the same time push the CPU capacity

> to .5, and renormalized that gives 1.

> 

>   NOTE: the renorm would then become something like:

>         scale_cpu = arch_scale_cpu_capacity() / rt_frac();


Isn't it equivalent ? I mean, you can remove RT/DL/stop/IRQ from the CPU
capacity and compare the CFS util_avg against that, or you can add
RT/DL/stop/IRQ to the CFS util_avg and compare it to arch_scale_cpu_capacity().
Both should be interchangeable no ? By adding RT/DL/IRQ PELT signals
to the CFS util_avg, Vincent is proposing to go with the latter I think.

But aren't the signals we currently use to account for RT/DL/stop/IRQ in
cpu_capacity good enough for that ? Can't we just add the diff between
capacity_orig_of and capacity_of to the CFS util and do OPP selection with
that (for !nr_rt_running) ? Maybe add a min with dl running_bw to be on
the safe side ... ?

> 

> 

> On IRC I mentioned stopping the CFS clock when preempted, and while that

> would result in fixed numbers, Vincent was right in pointing out the

> numbers will be difficult to interpret, since the meaning will be purely

> CPU local and I'm not sure you can actually fix it again with

> normalization.

> 

> Imagine, running a .3 RT task, that would push the (always running) CFS

> down to .7, but because we discard all !cfs time, it actually has 1. If

> we try and normalize that we'll end up with ~1.43, which is of course

> completely broken.

> 

> 

> _However_, all that happens for util, also happens for load. So the above

> scenario will also make the CPU appear less loaded than it actually is.

> 

> Now, we actually try and compensate for that by decreasing the capacity

> of the CPU. But because the existing rt_avg and PELT signals are so

> out-of-tune, this is likely to be less than ideal. With that fixed

> however, the best this appears to do is, as per the above, preserve the

> actual load. But what we really wanted is to actually inflate the load,

> such that someone will take load from us -- we're doing less actual work

> after all.

> 

> Possibly, we can do something like:

> 

> 	scale_cpu_capacity / (rt_frac^2)

> 

> for load, then we inflate the load and could maybe get rid of all this

> capacity_of() sprinkling, but that needs more thinking.

> 

> 

> But I really feel we need to consider both util and load, as this issue

> affects both.
Vincent Guittot June 4, 2018, 6:08 p.m. UTC | #3
On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 25, 2018 at 03:12:21PM +0200, Vincent Guittot wrote:

>> When both cfs and rt tasks compete to run on a CPU, we can see some frequency

>> drops with schedutil governor. In such case, the cfs_rq's utilization doesn't

>> reflect anymore the utilization of cfs tasks but only the remaining part that

>> is not used by rt tasks. We should monitor the stolen utilization and take

>> it into account when selecting OPP. This patchset doesn't change the OPP

>> selection policy for RT tasks but only for CFS tasks

>

> So the problem is that when RT/DL/stop/IRQ happens and preempts CFS

> tasks, time continues and the CFS load tracking will see !running and

> decay things.

>

> Then, when we get back to CFS, we'll have lower load/util than we

> expected.

>

> In particular, your focus is on OPP selection, and where we would have

> say: u=1 (always running task), after being preempted by our RT task for

> a while, it will now have u=.5. With the effect that when the RT task

> goes sleep we'll drop our OPP to .5 max -- which is 'wrong', right?


yes that's the typical example

>

> Your solution is to track RT/DL/stop/IRQ with the identical PELT average

> as we track cfs util. Such that we can then add the various averages to

> reconstruct the actual utilisation signal.


yes and get the whole cpu utilization

>

> This should work for the case of the utilization signal on UP. When we

> consider that PELT migrates the signal around on SMP, but we don't do

> that to the per-rq signals we have for RT/DL/stop/IRQ.

>

> There is also the 'complaint' that this ends up with 2 util signals for

> DL, complicating things.


yes. that's the main point of discussion how to balance dl bandwidth
and dl utilization

>

>

> So this patch-set tracks the !cfs occupation using the same function,

> which is all good. But what, if instead of using that to compensate the

> OPP selection, we employ that to renormalize the util signal?

>

> If we normalize util against the dynamic (rt_avg affected) cpu_capacity,

> then I think your initial problem goes away. Because while the RT task

> will push the util to .5, it will at the same time push the CPU capacity

> to .5, and renormalized that gives 1.

>

>   NOTE: the renorm would then become something like:

>         scale_cpu = arch_scale_cpu_capacity() / rt_frac();

>

>

> On IRC I mentioned stopping the CFS clock when preempted, and while that

> would result in fixed numbers, Vincent was right in pointing out the

> numbers will be difficult to interpret, since the meaning will be purely

> CPU local and I'm not sure you can actually fix it again with

> normalization.

>

> Imagine, running a .3 RT task, that would push the (always running) CFS

> down to .7, but because we discard all !cfs time, it actually has 1. If

> we try and normalize that we'll end up with ~1.43, which is of course

> completely broken.

>

>

> _However_, all that happens for util, also happens for load. So the above

> scenario will also make the CPU appear less loaded than it actually is.


The load will continue to increase because we track runnable state and
not running for the load

>

> Now, we actually try and compensate for that by decreasing the capacity

> of the CPU. But because the existing rt_avg and PELT signals are so

> out-of-tune, this is likely to be less than ideal. With that fixed

> however, the best this appears to do is, as per the above, preserve the

> actual load. But what we really wanted is to actually inflate the load,

> such that someone will take load from us -- we're doing less actual work

> after all.

>

> Possibly, we can do something like:

>

>         scale_cpu_capacity / (rt_frac^2)

>

> for load, then we inflate the load and could maybe get rid of all this

> capacity_of() sprinkling, but that needs more thinking.

>

>

> But I really feel we need to consider both util and load, as this issue

> affects both.


my initial idea was to get max between dl bandwidth and dl util_avg
but util_avg can be higher than bandwidth and using it will make
sched_util selecting higher OPP for now good reason if nothing is
running around and need compute capacity

As you mentioned, scale_rt_capacity give the remaining capacity for
cfs and it will behave like cfs util_avg now that it uses PELT. So as
long as cfs util_avg <  scale_rt_capacity(we probably need a margin)
we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting
OPP because we have remaining spare capacity but if  cfs util_avg ==
scale_rt_capacity, we make sure to use max OPP.

I will run some test to make sure that all my test are running
correctly which such policy
Vincent Guittot June 5, 2018, 8:36 a.m. UTC | #4
Hi Quentin,

On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> This patchset initially tracked only the utilization of RT rq. During

> OSPM summit, it has been discussed the opportunity to extend it in order

> to get an estimate of the utilization of the CPU.

>

> - Patches 1-3 correspond to the content of patchset v4 and add utilization

>   tracking for rt_rq.

>

> When both cfs and rt tasks compete to run on a CPU, we can see some frequency

> drops with schedutil governor. In such case, the cfs_rq's utilization doesn't

> reflect anymore the utilization of cfs tasks but only the remaining part that

> is not used by rt tasks. We should monitor the stolen utilization and take

> it into account when selecting OPP. This patchset doesn't change the OPP

> selection policy for RT tasks but only for CFS tasks

>

> A rt-app use case which creates an always running cfs thread and a rt threads

> that wakes up periodically with both threads pinned on same CPU, show lot of

> frequency switches of the CPU whereas the CPU never goes idles during the

> test. I can share the json file that I used for the test if someone is

> interested in.

>

> For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),

> the cpufreq statistics outputs (stats are reset just before the test) :

> $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

> without patchset : 1230

> with patchset : 14


I have attached the rt-app json file that I use for this test

>

> If we replace the cfs thread of rt-app by a sysbench cpu test, we can see

> performance improvements:

>

> - Without patchset :

> Test execution summary:

>     total time:                          15.0009s

>     total number of events:              4903

>     total time taken by event execution: 14.9972

>     per-request statistics:

>          min:                                  1.23ms

>          avg:                                  3.06ms

>          max:                                 13.16ms

>          approx.  95 percentile:              12.73ms

>

> Threads fairness:

>     events (avg/stddev):           4903.0000/0.00

>     execution time (avg/stddev):   14.9972/0.00

>

> - With patchset:

> Test execution summary:

>     total time:                          15.0014s

>     total number of events:              7694

>     total time taken by event execution: 14.9979

>     per-request statistics:

>          min:                                  1.23ms

>          avg:                                  1.95ms

>          max:                                 10.49ms

>          approx.  95 percentile:              10.39ms

>

> Threads fairness:

>     events (avg/stddev):           7694.0000/0.00

>     execution time (avg/stddev):   14.9979/0.00

>

> The performance improvement is 56% for this use case.

>

> - Patches 4-5 add utilization tracking for dl_rq in order to solve similar

>   problem as with rt_rq

>

> - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove

>   dl and rt from sched_rt_avg_update

>

> - Patches 7-8 add utilization tracking for interrupt and use it select OPP

>   A test with iperf on hikey 6220 gives:

>     w/o patchset            w/ patchset

>     Tx 276 Mbits/sec        304 Mbits/sec +10%

>     Rx 299 Mbits/sec        328 Mbits/sec +09%

>

>     8 iterations of iperf -c server_address -r -t 5

>     stdev is lower than 1%

>     Only WFI idle state is enable (shallowest arm idle state)

>

> - Patches 9 removes the unused sched_avg_update code

>

> - Patch 10 removes the unused sched_time_avg_ms

>

> Change since v3:

> - add support of periodic update of blocked utilization

> - rebase on lastest tip/sched/core

>

> Change since v2:

> - move pelt code into a dedicated pelt.c file

> - rebase on load tracking changes

>

> Change since v1:

> - Only a rebase. I have addressed the comments on previous version in

>   patch 1/2

>

> Vincent Guittot (10):

>   sched/pelt: Move pelt related code in a dedicated file

>   sched/rt: add rt_rq utilization tracking

>   cpufreq/schedutil: add rt utilization tracking

>   sched/dl: add dl_rq utilization tracking

>   cpufreq/schedutil: get max utilization

>   sched: remove rt and dl from sched_avg

>   sched/irq: add irq utilization tracking

>   cpufreq/schedutil: take into account interrupt

>   sched: remove rt_avg code

>   proc/sched: remove unused sched_time_avg_ms

>

>  include/linux/sched/sysctl.h     |   1 -

>  kernel/sched/Makefile            |   2 +-

>  kernel/sched/core.c              |  38 +---

>  kernel/sched/cpufreq_schedutil.c |  24 ++-

>  kernel/sched/deadline.c          |   7 +-

>  kernel/sched/fair.c              | 381 +++----------------------------------

>  kernel/sched/pelt.c              | 395 +++++++++++++++++++++++++++++++++++++++

>  kernel/sched/pelt.h              |  63 +++++++

>  kernel/sched/rt.c                |  10 +-

>  kernel/sched/sched.h             |  57 ++++--

>  kernel/sysctl.c                  |   8 -

>  11 files changed, 563 insertions(+), 423 deletions(-)

>  create mode 100644 kernel/sched/pelt.c

>  create mode 100644 kernel/sched/pelt.h

>

> --

> 2.7.4

>
Quentin Perret June 5, 2018, 10:57 a.m. UTC | #5
Hi Vincent,

On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:
> Hi Quentin,

> 

> On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote:

> > This patchset initially tracked only the utilization of RT rq. During

> > OSPM summit, it has been discussed the opportunity to extend it in order

> > to get an estimate of the utilization of the CPU.

> >

> > - Patches 1-3 correspond to the content of patchset v4 and add utilization

> >   tracking for rt_rq.

> >

> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency

> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't

> > reflect anymore the utilization of cfs tasks but only the remaining part that

> > is not used by rt tasks. We should monitor the stolen utilization and take

> > it into account when selecting OPP. This patchset doesn't change the OPP

> > selection policy for RT tasks but only for CFS tasks

> >

> > A rt-app use case which creates an always running cfs thread and a rt threads

> > that wakes up periodically with both threads pinned on same CPU, show lot of

> > frequency switches of the CPU whereas the CPU never goes idles during the

> > test. I can share the json file that I used for the test if someone is

> > interested in.

> >

> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),

> > the cpufreq statistics outputs (stats are reset just before the test) :

> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

> > without patchset : 1230

> > with patchset : 14

> 

> I have attached the rt-app json file that I use for this test


Thank you very much ! I did a quick test with a much simpler fix to this
RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().
I get the following results on Hikey960:

Without patch:
   cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
   12
   cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
   640
With patch
   cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
   8
   cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
   12

Yes the rt_avg stuff is out of sync with the PELT signal, but do you think
this is an actual issue for realistic use-cases ?

What about the diff below (just a quick hack to show the idea) applied
on tip/sched/core ?

---8<---
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index a8ba6d1f262a..23a4fb1c2c25 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
 	sg_cpu->util_dl  = cpu_util_dl(rq);
 }
 
+unsigned long scale_rt_capacity(int cpu);
 static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
+	int cpu = sg_cpu->cpu;
+	unsigned long util, dl_bw;
 
 	if (rq->rt.rt_nr_running)
 		return sg_cpu->max;
@@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet
 	 * ready for such an interface. So, we only do the latter for now.
 	 */
-	return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
+	util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
+	util >>= SCHED_CAPACITY_SHIFT;
+	util = arch_scale_cpu_capacity(NULL, cpu) - util;
+	util += sg_cpu->util_cfs;
+	dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
+
+	/* Make sure to always provide the reserved freq to DL. */
+	return max(util, dl_bw);
 }
 
 static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f01f0f395f9a..0e87cbe47c8b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 	return load_idx;
 }
 
-static unsigned long scale_rt_capacity(int cpu)
+unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	u64 total, used, age_stamp, avg;
--->8---



> 

> >

> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see

> > performance improvements:

> >

> > - Without patchset :

> > Test execution summary:

> >     total time:                          15.0009s

> >     total number of events:              4903

> >     total time taken by event execution: 14.9972

> >     per-request statistics:

> >          min:                                  1.23ms

> >          avg:                                  3.06ms

> >          max:                                 13.16ms

> >          approx.  95 percentile:              12.73ms

> >

> > Threads fairness:

> >     events (avg/stddev):           4903.0000/0.00

> >     execution time (avg/stddev):   14.9972/0.00

> >

> > - With patchset:

> > Test execution summary:

> >     total time:                          15.0014s

> >     total number of events:              7694

> >     total time taken by event execution: 14.9979

> >     per-request statistics:

> >          min:                                  1.23ms

> >          avg:                                  1.95ms

> >          max:                                 10.49ms

> >          approx.  95 percentile:              10.39ms

> >

> > Threads fairness:

> >     events (avg/stddev):           7694.0000/0.00

> >     execution time (avg/stddev):   14.9979/0.00

> >

> > The performance improvement is 56% for this use case.

> >

> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar

> >   problem as with rt_rq

> >

> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove

> >   dl and rt from sched_rt_avg_update

> >

> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP

> >   A test with iperf on hikey 6220 gives:

> >     w/o patchset            w/ patchset

> >     Tx 276 Mbits/sec        304 Mbits/sec +10%

> >     Rx 299 Mbits/sec        328 Mbits/sec +09%

> >

> >     8 iterations of iperf -c server_address -r -t 5

> >     stdev is lower than 1%

> >     Only WFI idle state is enable (shallowest arm idle state)

> >

> > - Patches 9 removes the unused sched_avg_update code

> >

> > - Patch 10 removes the unused sched_time_avg_ms

> >

> > Change since v3:

> > - add support of periodic update of blocked utilization

> > - rebase on lastest tip/sched/core

> >

> > Change since v2:

> > - move pelt code into a dedicated pelt.c file

> > - rebase on load tracking changes

> >

> > Change since v1:

> > - Only a rebase. I have addressed the comments on previous version in

> >   patch 1/2

> >

> > Vincent Guittot (10):

> >   sched/pelt: Move pelt related code in a dedicated file

> >   sched/rt: add rt_rq utilization tracking

> >   cpufreq/schedutil: add rt utilization tracking

> >   sched/dl: add dl_rq utilization tracking

> >   cpufreq/schedutil: get max utilization

> >   sched: remove rt and dl from sched_avg

> >   sched/irq: add irq utilization tracking

> >   cpufreq/schedutil: take into account interrupt

> >   sched: remove rt_avg code

> >   proc/sched: remove unused sched_time_avg_ms

> >

> >  include/linux/sched/sysctl.h     |   1 -

> >  kernel/sched/Makefile            |   2 +-

> >  kernel/sched/core.c              |  38 +---

> >  kernel/sched/cpufreq_schedutil.c |  24 ++-

> >  kernel/sched/deadline.c          |   7 +-

> >  kernel/sched/fair.c              | 381 +++----------------------------------

> >  kernel/sched/pelt.c              | 395 +++++++++++++++++++++++++++++++++++++++

> >  kernel/sched/pelt.h              |  63 +++++++

> >  kernel/sched/rt.c                |  10 +-

> >  kernel/sched/sched.h             |  57 ++++--

> >  kernel/sysctl.c                  |   8 -

> >  11 files changed, 563 insertions(+), 423 deletions(-)

> >  create mode 100644 kernel/sched/pelt.c

> >  create mode 100644 kernel/sched/pelt.h

> >

> > --

> > 2.7.4

> >
Vincent Guittot June 5, 2018, 11:59 a.m. UTC | #6
On 5 June 2018 at 12:57, Quentin Perret <quentin.perret@arm.com> wrote:
> Hi Vincent,

>

> On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:

>> Hi Quentin,

>>

>> On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote:

>> > This patchset initially tracked only the utilization of RT rq. During

>> > OSPM summit, it has been discussed the opportunity to extend it in order

>> > to get an estimate of the utilization of the CPU.

>> >

>> > - Patches 1-3 correspond to the content of patchset v4 and add utilization

>> >   tracking for rt_rq.

>> >

>> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency

>> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't

>> > reflect anymore the utilization of cfs tasks but only the remaining part that

>> > is not used by rt tasks. We should monitor the stolen utilization and take

>> > it into account when selecting OPP. This patchset doesn't change the OPP

>> > selection policy for RT tasks but only for CFS tasks

>> >

>> > A rt-app use case which creates an always running cfs thread and a rt threads

>> > that wakes up periodically with both threads pinned on same CPU, show lot of

>> > frequency switches of the CPU whereas the CPU never goes idles during the

>> > test. I can share the json file that I used for the test if someone is

>> > interested in.

>> >

>> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),

>> > the cpufreq statistics outputs (stats are reset just before the test) :

>> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

>> > without patchset : 1230

>> > with patchset : 14

>>

>> I have attached the rt-app json file that I use for this test

>

> Thank you very much ! I did a quick test with a much simpler fix to this

> RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().

> I get the following results on Hikey960:

>

> Without patch:

>    cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

>    12

>    cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans

>    640

> With patch

>    cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

>    8

>    cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans

>    12

>

> Yes the rt_avg stuff is out of sync with the PELT signal, but do you think

> this is an actual issue for realistic use-cases ?


yes I think that it's worth syncing and consolidating things on the
same metric. The result will be saner and more robust as we will have
the same behavior

>

> What about the diff below (just a quick hack to show the idea) applied

> on tip/sched/core ?

>

> ---8<---

> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> index a8ba6d1f262a..23a4fb1c2c25 100644

> --- a/kernel/sched/cpufreq_schedutil.c

> +++ b/kernel/sched/cpufreq_schedutil.c

> @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

>         sg_cpu->util_dl  = cpu_util_dl(rq);

>  }

>

> +unsigned long scale_rt_capacity(int cpu);

>  static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

>  {

>         struct rq *rq = cpu_rq(sg_cpu->cpu);

> +       int cpu = sg_cpu->cpu;

> +       unsigned long util, dl_bw;

>

>         if (rq->rt.rt_nr_running)

>                 return sg_cpu->max;

> @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

>          * util_cfs + util_dl as requested freq. However, cpufreq is not yet

>          * ready for such an interface. So, we only do the latter for now.

>          */

> -       return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

> +       util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

> +       util >>= SCHED_CAPACITY_SHIFT;

> +       util = arch_scale_cpu_capacity(NULL, cpu) - util;

> +       util += sg_cpu->util_cfs;

> +       dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> +

> +       /* Make sure to always provide the reserved freq to DL. */

> +       return max(util, dl_bw);

>  }

>

>  static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index f01f0f395f9a..0e87cbe47c8b 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,

>         return load_idx;

>  }

>

> -static unsigned long scale_rt_capacity(int cpu)

> +unsigned long scale_rt_capacity(int cpu)

>  {

>         struct rq *rq = cpu_rq(cpu);

>         u64 total, used, age_stamp, avg;

> --->8---

>

>

>

>>

>> >

>> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see

>> > performance improvements:

>> >

>> > - Without patchset :

>> > Test execution summary:

>> >     total time:                          15.0009s

>> >     total number of events:              4903

>> >     total time taken by event execution: 14.9972

>> >     per-request statistics:

>> >          min:                                  1.23ms

>> >          avg:                                  3.06ms

>> >          max:                                 13.16ms

>> >          approx.  95 percentile:              12.73ms

>> >

>> > Threads fairness:

>> >     events (avg/stddev):           4903.0000/0.00

>> >     execution time (avg/stddev):   14.9972/0.00

>> >

>> > - With patchset:

>> > Test execution summary:

>> >     total time:                          15.0014s

>> >     total number of events:              7694

>> >     total time taken by event execution: 14.9979

>> >     per-request statistics:

>> >          min:                                  1.23ms

>> >          avg:                                  1.95ms

>> >          max:                                 10.49ms

>> >          approx.  95 percentile:              10.39ms

>> >

>> > Threads fairness:

>> >     events (avg/stddev):           7694.0000/0.00

>> >     execution time (avg/stddev):   14.9979/0.00

>> >

>> > The performance improvement is 56% for this use case.

>> >

>> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar

>> >   problem as with rt_rq

>> >

>> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove

>> >   dl and rt from sched_rt_avg_update

>> >

>> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP

>> >   A test with iperf on hikey 6220 gives:

>> >     w/o patchset            w/ patchset

>> >     Tx 276 Mbits/sec        304 Mbits/sec +10%

>> >     Rx 299 Mbits/sec        328 Mbits/sec +09%

>> >

>> >     8 iterations of iperf -c server_address -r -t 5

>> >     stdev is lower than 1%

>> >     Only WFI idle state is enable (shallowest arm idle state)

>> >

>> > - Patches 9 removes the unused sched_avg_update code

>> >

>> > - Patch 10 removes the unused sched_time_avg_ms

>> >

>> > Change since v3:

>> > - add support of periodic update of blocked utilization

>> > - rebase on lastest tip/sched/core

>> >

>> > Change since v2:

>> > - move pelt code into a dedicated pelt.c file

>> > - rebase on load tracking changes

>> >

>> > Change since v1:

>> > - Only a rebase. I have addressed the comments on previous version in

>> >   patch 1/2

>> >

>> > Vincent Guittot (10):

>> >   sched/pelt: Move pelt related code in a dedicated file

>> >   sched/rt: add rt_rq utilization tracking

>> >   cpufreq/schedutil: add rt utilization tracking

>> >   sched/dl: add dl_rq utilization tracking

>> >   cpufreq/schedutil: get max utilization

>> >   sched: remove rt and dl from sched_avg

>> >   sched/irq: add irq utilization tracking

>> >   cpufreq/schedutil: take into account interrupt

>> >   sched: remove rt_avg code

>> >   proc/sched: remove unused sched_time_avg_ms

>> >

>> >  include/linux/sched/sysctl.h     |   1 -

>> >  kernel/sched/Makefile            |   2 +-

>> >  kernel/sched/core.c              |  38 +---

>> >  kernel/sched/cpufreq_schedutil.c |  24 ++-

>> >  kernel/sched/deadline.c          |   7 +-

>> >  kernel/sched/fair.c              | 381 +++----------------------------------

>> >  kernel/sched/pelt.c              | 395 +++++++++++++++++++++++++++++++++++++++

>> >  kernel/sched/pelt.h              |  63 +++++++

>> >  kernel/sched/rt.c                |  10 +-

>> >  kernel/sched/sched.h             |  57 ++++--

>> >  kernel/sysctl.c                  |   8 -

>> >  11 files changed, 563 insertions(+), 423 deletions(-)

>> >  create mode 100644 kernel/sched/pelt.c

>> >  create mode 100644 kernel/sched/pelt.h

>> >

>> > --

>> > 2.7.4

>> >

>

>
Juri Lelli June 5, 2018, 12:11 p.m. UTC | #7
Hi Quentin,

On 05/06/18 11:57, Quentin Perret wrote:

[...]

> What about the diff below (just a quick hack to show the idea) applied

> on tip/sched/core ?

> 

> ---8<---

> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> index a8ba6d1f262a..23a4fb1c2c25 100644

> --- a/kernel/sched/cpufreq_schedutil.c

> +++ b/kernel/sched/cpufreq_schedutil.c

> @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

>  	sg_cpu->util_dl  = cpu_util_dl(rq);

>  }

>  

> +unsigned long scale_rt_capacity(int cpu);

>  static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

>  {

>  	struct rq *rq = cpu_rq(sg_cpu->cpu);

> +	int cpu = sg_cpu->cpu;

> +	unsigned long util, dl_bw;

>  

>  	if (rq->rt.rt_nr_running)

>  		return sg_cpu->max;

> @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

>  	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet

>  	 * ready for such an interface. So, we only do the latter for now.

>  	 */

> -	return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

> +	util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);


Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,
since we use max below, we will probably have the same problem that we
discussed on Vincent's approach (overestimation of DL contribution while
we could use running_bw).

> +	util >>= SCHED_CAPACITY_SHIFT;

> +	util = arch_scale_cpu_capacity(NULL, cpu) - util;

> +	util += sg_cpu->util_cfs;

> +	dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;


Why this_bw instead of running_bw?

Thanks,

- Juri
Quentin Perret June 5, 2018, 1:05 p.m. UTC | #8
On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:
> Hi Quentin,

> 

> On 05/06/18 11:57, Quentin Perret wrote:

> 

> [...]

> 

> > What about the diff below (just a quick hack to show the idea) applied

> > on tip/sched/core ?

> > 

> > ---8<---

> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > index a8ba6d1f262a..23a4fb1c2c25 100644

> > --- a/kernel/sched/cpufreq_schedutil.c

> > +++ b/kernel/sched/cpufreq_schedutil.c

> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

> >  	sg_cpu->util_dl  = cpu_util_dl(rq);

> >  }

> >  

> > +unsigned long scale_rt_capacity(int cpu);

> >  static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> >  {

> >  	struct rq *rq = cpu_rq(sg_cpu->cpu);

> > +	int cpu = sg_cpu->cpu;

> > +	unsigned long util, dl_bw;

> >  

> >  	if (rq->rt.rt_nr_running)

> >  		return sg_cpu->max;

> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> >  	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet

> >  	 * ready for such an interface. So, we only do the latter for now.

> >  	 */

> > -	return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

> > +	util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

> 

> Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,

> since we use max below, we will probably have the same problem that we

> discussed on Vincent's approach (overestimation of DL contribution while

> we could use running_bw).


Ah no, you're right, this isn't great for long running deadline tasks.
We should definitely account for the running_bw here, not the dl avg...

I was trying to address the issue of RT stealing time from CFS here, but
the DL integration isn't quite right which this patch as-is, I agree ...

> 

> > +	util >>= SCHED_CAPACITY_SHIFT;

> > +	util = arch_scale_cpu_capacity(NULL, cpu) - util;

> > +	util += sg_cpu->util_cfs;

> > +	dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> 

> Why this_bw instead of running_bw?


So IIUC, this_bw should basically give you the absolute reservation (== the
sum of runtime/deadline ratios of all DL tasks on that rq).

The reason I added this max is because I'm still not sure to understand
how we can safely drop the freq below that point ? If we don't guarantee
to always stay at least at the freq required by DL, aren't we risking to
start a deadline tasks stuck at a low freq because of rate limiting ? In
this case, if that tasks uses all of its runtime then you might start
missing deadlines ...

My feeling is that the only safe thing to do is to guarantee to never go
below the freq required by DL, and to optimistically add CFS tasks
without raising the OPP if we have good reasons to think that DL is
using less than it required (which is what we should get by using
running_bw above I suppose). Does that make any sense ?

Thanks !
Quentin
Quentin Perret June 5, 2018, 1:12 p.m. UTC | #9
On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote:
> On 5 June 2018 at 12:57, Quentin Perret <quentin.perret@arm.com> wrote:

> > Hi Vincent,

> >

> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:

> >> Hi Quentin,

> >>

> >> On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote:

> >> > This patchset initially tracked only the utilization of RT rq. During

> >> > OSPM summit, it has been discussed the opportunity to extend it in order

> >> > to get an estimate of the utilization of the CPU.

> >> >

> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization

> >> >   tracking for rt_rq.

> >> >

> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency

> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't

> >> > reflect anymore the utilization of cfs tasks but only the remaining part that

> >> > is not used by rt tasks. We should monitor the stolen utilization and take

> >> > it into account when selecting OPP. This patchset doesn't change the OPP

> >> > selection policy for RT tasks but only for CFS tasks

> >> >

> >> > A rt-app use case which creates an always running cfs thread and a rt threads

> >> > that wakes up periodically with both threads pinned on same CPU, show lot of

> >> > frequency switches of the CPU whereas the CPU never goes idles during the

> >> > test. I can share the json file that I used for the test if someone is

> >> > interested in.

> >> >

> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),

> >> > the cpufreq statistics outputs (stats are reset just before the test) :

> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

> >> > without patchset : 1230

> >> > with patchset : 14

> >>

> >> I have attached the rt-app json file that I use for this test

> >

> > Thank you very much ! I did a quick test with a much simpler fix to this

> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().

> > I get the following results on Hikey960:

> >

> > Without patch:

> >    cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

> >    12

> >    cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans

> >    640

> > With patch

> >    cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

> >    8

> >    cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans

> >    12

> >

> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think

> > this is an actual issue for realistic use-cases ?

> 

> yes I think that it's worth syncing and consolidating things on the

> same metric. The result will be saner and more robust as we will have

> the same behavior


TBH I'm not disagreeing with that, the PELT-everywhere approach feels
cleaner in a way, but do you have a use-case in mind where this will
definitely help ?

I mean, yes the rt_avg is a slow response to the RT pressure, but is
this always a problem ? Ramping down slower might actually help in some
cases no ?

> 

> >

> > What about the diff below (just a quick hack to show the idea) applied

> > on tip/sched/core ?

> >

> > ---8<---

> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > index a8ba6d1f262a..23a4fb1c2c25 100644

> > --- a/kernel/sched/cpufreq_schedutil.c

> > +++ b/kernel/sched/cpufreq_schedutil.c

> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

> >         sg_cpu->util_dl  = cpu_util_dl(rq);

> >  }

> >

> > +unsigned long scale_rt_capacity(int cpu);

> >  static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> >  {

> >         struct rq *rq = cpu_rq(sg_cpu->cpu);

> > +       int cpu = sg_cpu->cpu;

> > +       unsigned long util, dl_bw;

> >

> >         if (rq->rt.rt_nr_running)

> >                 return sg_cpu->max;

> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> >          * util_cfs + util_dl as requested freq. However, cpufreq is not yet

> >          * ready for such an interface. So, we only do the latter for now.

> >          */

> > -       return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

> > +       util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

> > +       util >>= SCHED_CAPACITY_SHIFT;

> > +       util = arch_scale_cpu_capacity(NULL, cpu) - util;

> > +       util += sg_cpu->util_cfs;

> > +       dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > +

> > +       /* Make sure to always provide the reserved freq to DL. */

> > +       return max(util, dl_bw);

> >  }

> >

> >  static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index f01f0f395f9a..0e87cbe47c8b 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,

> >         return load_idx;

> >  }

> >

> > -static unsigned long scale_rt_capacity(int cpu)

> > +unsigned long scale_rt_capacity(int cpu)

> >  {

> >         struct rq *rq = cpu_rq(cpu);

> >         u64 total, used, age_stamp, avg;

> > --->8---

> >

> >

> >

> >>

> >> >

> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see

> >> > performance improvements:

> >> >

> >> > - Without patchset :

> >> > Test execution summary:

> >> >     total time:                          15.0009s

> >> >     total number of events:              4903

> >> >     total time taken by event execution: 14.9972

> >> >     per-request statistics:

> >> >          min:                                  1.23ms

> >> >          avg:                                  3.06ms

> >> >          max:                                 13.16ms

> >> >          approx.  95 percentile:              12.73ms

> >> >

> >> > Threads fairness:

> >> >     events (avg/stddev):           4903.0000/0.00

> >> >     execution time (avg/stddev):   14.9972/0.00

> >> >

> >> > - With patchset:

> >> > Test execution summary:

> >> >     total time:                          15.0014s

> >> >     total number of events:              7694

> >> >     total time taken by event execution: 14.9979

> >> >     per-request statistics:

> >> >          min:                                  1.23ms

> >> >          avg:                                  1.95ms

> >> >          max:                                 10.49ms

> >> >          approx.  95 percentile:              10.39ms

> >> >

> >> > Threads fairness:

> >> >     events (avg/stddev):           7694.0000/0.00

> >> >     execution time (avg/stddev):   14.9979/0.00

> >> >

> >> > The performance improvement is 56% for this use case.

> >> >

> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar

> >> >   problem as with rt_rq

> >> >

> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove

> >> >   dl and rt from sched_rt_avg_update

> >> >

> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP

> >> >   A test with iperf on hikey 6220 gives:

> >> >     w/o patchset            w/ patchset

> >> >     Tx 276 Mbits/sec        304 Mbits/sec +10%

> >> >     Rx 299 Mbits/sec        328 Mbits/sec +09%

> >> >

> >> >     8 iterations of iperf -c server_address -r -t 5

> >> >     stdev is lower than 1%

> >> >     Only WFI idle state is enable (shallowest arm idle state)

> >> >

> >> > - Patches 9 removes the unused sched_avg_update code

> >> >

> >> > - Patch 10 removes the unused sched_time_avg_ms

> >> >

> >> > Change since v3:

> >> > - add support of periodic update of blocked utilization

> >> > - rebase on lastest tip/sched/core

> >> >

> >> > Change since v2:

> >> > - move pelt code into a dedicated pelt.c file

> >> > - rebase on load tracking changes

> >> >

> >> > Change since v1:

> >> > - Only a rebase. I have addressed the comments on previous version in

> >> >   patch 1/2

> >> >

> >> > Vincent Guittot (10):

> >> >   sched/pelt: Move pelt related code in a dedicated file

> >> >   sched/rt: add rt_rq utilization tracking

> >> >   cpufreq/schedutil: add rt utilization tracking

> >> >   sched/dl: add dl_rq utilization tracking

> >> >   cpufreq/schedutil: get max utilization

> >> >   sched: remove rt and dl from sched_avg

> >> >   sched/irq: add irq utilization tracking

> >> >   cpufreq/schedutil: take into account interrupt

> >> >   sched: remove rt_avg code

> >> >   proc/sched: remove unused sched_time_avg_ms

> >> >

> >> >  include/linux/sched/sysctl.h     |   1 -

> >> >  kernel/sched/Makefile            |   2 +-

> >> >  kernel/sched/core.c              |  38 +---

> >> >  kernel/sched/cpufreq_schedutil.c |  24 ++-

> >> >  kernel/sched/deadline.c          |   7 +-

> >> >  kernel/sched/fair.c              | 381 +++----------------------------------

> >> >  kernel/sched/pelt.c              | 395 +++++++++++++++++++++++++++++++++++++++

> >> >  kernel/sched/pelt.h              |  63 +++++++

> >> >  kernel/sched/rt.c                |  10 +-

> >> >  kernel/sched/sched.h             |  57 ++++--

> >> >  kernel/sysctl.c                  |   8 -

> >> >  11 files changed, 563 insertions(+), 423 deletions(-)

> >> >  create mode 100644 kernel/sched/pelt.c

> >> >  create mode 100644 kernel/sched/pelt.h

> >> >

> >> > --

> >> > 2.7.4

> >> >

> >

> >
Juri Lelli June 5, 2018, 1:15 p.m. UTC | #10
On 05/06/18 14:05, Quentin Perret wrote:
> On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:

> > Hi Quentin,

> > 

> > On 05/06/18 11:57, Quentin Perret wrote:

> > 

> > [...]

> > 

> > > What about the diff below (just a quick hack to show the idea) applied

> > > on tip/sched/core ?

> > > 

> > > ---8<---

> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > index a8ba6d1f262a..23a4fb1c2c25 100644

> > > --- a/kernel/sched/cpufreq_schedutil.c

> > > +++ b/kernel/sched/cpufreq_schedutil.c

> > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

> > >  	sg_cpu->util_dl  = cpu_util_dl(rq);

> > >  }

> > >  

> > > +unsigned long scale_rt_capacity(int cpu);

> > >  static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > >  {

> > >  	struct rq *rq = cpu_rq(sg_cpu->cpu);

> > > +	int cpu = sg_cpu->cpu;

> > > +	unsigned long util, dl_bw;

> > >  

> > >  	if (rq->rt.rt_nr_running)

> > >  		return sg_cpu->max;

> > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > >  	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet

> > >  	 * ready for such an interface. So, we only do the latter for now.

> > >  	 */

> > > -	return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

> > > +	util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

> > 

> > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,

> > since we use max below, we will probably have the same problem that we

> > discussed on Vincent's approach (overestimation of DL contribution while

> > we could use running_bw).

> 

> Ah no, you're right, this isn't great for long running deadline tasks.

> We should definitely account for the running_bw here, not the dl avg...

> 

> I was trying to address the issue of RT stealing time from CFS here, but

> the DL integration isn't quite right which this patch as-is, I agree ...

> 

> > 

> > > +	util >>= SCHED_CAPACITY_SHIFT;

> > > +	util = arch_scale_cpu_capacity(NULL, cpu) - util;

> > > +	util += sg_cpu->util_cfs;

> > > +	dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > 

> > Why this_bw instead of running_bw?

> 

> So IIUC, this_bw should basically give you the absolute reservation (== the

> sum of runtime/deadline ratios of all DL tasks on that rq).


Yep.

> The reason I added this max is because I'm still not sure to understand

> how we can safely drop the freq below that point ? If we don't guarantee

> to always stay at least at the freq required by DL, aren't we risking to

> start a deadline tasks stuck at a low freq because of rate limiting ? In

> this case, if that tasks uses all of its runtime then you might start

> missing deadlines ...


We decided to avoid (software) rate limiting for DL with e97a90f7069b
("sched/cpufreq: Rate limits for SCHED_DEADLINE").

> My feeling is that the only safe thing to do is to guarantee to never go

> below the freq required by DL, and to optimistically add CFS tasks

> without raising the OPP if we have good reasons to think that DL is

> using less than it required (which is what we should get by using

> running_bw above I suppose). Does that make any sense ?


Then we can't still avoid the hardware limits, so using running_bw is a
trade off between safety (especially considering soft real-time
scenarios) and energy consumption (which seems to be working in
practice).

Thanks,

- Juri
Vincent Guittot June 5, 2018, 1:18 p.m. UTC | #11
On 5 June 2018 at 15:12, Quentin Perret <quentin.perret@arm.com> wrote:
> On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote:

>> On 5 June 2018 at 12:57, Quentin Perret <quentin.perret@arm.com> wrote:

>> > Hi Vincent,

>> >

>> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:

>> >> Hi Quentin,

>> >>

>> >> On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote:

>> >> > This patchset initially tracked only the utilization of RT rq. During

>> >> > OSPM summit, it has been discussed the opportunity to extend it in order

>> >> > to get an estimate of the utilization of the CPU.

>> >> >

>> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization

>> >> >   tracking for rt_rq.

>> >> >

>> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency

>> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't

>> >> > reflect anymore the utilization of cfs tasks but only the remaining part that

>> >> > is not used by rt tasks. We should monitor the stolen utilization and take

>> >> > it into account when selecting OPP. This patchset doesn't change the OPP

>> >> > selection policy for RT tasks but only for CFS tasks

>> >> >

>> >> > A rt-app use case which creates an always running cfs thread and a rt threads

>> >> > that wakes up periodically with both threads pinned on same CPU, show lot of

>> >> > frequency switches of the CPU whereas the CPU never goes idles during the

>> >> > test. I can share the json file that I used for the test if someone is

>> >> > interested in.

>> >> >

>> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),

>> >> > the cpufreq statistics outputs (stats are reset just before the test) :

>> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

>> >> > without patchset : 1230

>> >> > with patchset : 14

>> >>

>> >> I have attached the rt-app json file that I use for this test

>> >

>> > Thank you very much ! I did a quick test with a much simpler fix to this

>> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().

>> > I get the following results on Hikey960:

>> >

>> > Without patch:

>> >    cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

>> >    12

>> >    cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans

>> >    640

>> > With patch

>> >    cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

>> >    8

>> >    cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans

>> >    12

>> >

>> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think

>> > this is an actual issue for realistic use-cases ?

>>

>> yes I think that it's worth syncing and consolidating things on the

>> same metric. The result will be saner and more robust as we will have

>> the same behavior

>

> TBH I'm not disagreeing with that, the PELT-everywhere approach feels

> cleaner in a way, but do you have a use-case in mind where this will

> definitely help ?

>

> I mean, yes the rt_avg is a slow response to the RT pressure, but is

> this always a problem ? Ramping down slower might actually help in some

> cases no ?


I would say no because when one will decrease the other one will not
increase at the same pace and we will have some wrong behavior or
decision

>

>>

>> >

>> > What about the diff below (just a quick hack to show the idea) applied

>> > on tip/sched/core ?

>> >

>> > ---8<---

>> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

>> > index a8ba6d1f262a..23a4fb1c2c25 100644

>> > --- a/kernel/sched/cpufreq_schedutil.c

>> > +++ b/kernel/sched/cpufreq_schedutil.c

>> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

>> >         sg_cpu->util_dl  = cpu_util_dl(rq);

>> >  }

>> >

>> > +unsigned long scale_rt_capacity(int cpu);

>> >  static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

>> >  {

>> >         struct rq *rq = cpu_rq(sg_cpu->cpu);

>> > +       int cpu = sg_cpu->cpu;

>> > +       unsigned long util, dl_bw;

>> >

>> >         if (rq->rt.rt_nr_running)

>> >                 return sg_cpu->max;

>> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

>> >          * util_cfs + util_dl as requested freq. However, cpufreq is not yet

>> >          * ready for such an interface. So, we only do the latter for now.

>> >          */

>> > -       return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

>> > +       util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

>> > +       util >>= SCHED_CAPACITY_SHIFT;

>> > +       util = arch_scale_cpu_capacity(NULL, cpu) - util;

>> > +       util += sg_cpu->util_cfs;

>> > +       dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

>> > +

>> > +       /* Make sure to always provide the reserved freq to DL. */

>> > +       return max(util, dl_bw);

>> >  }

>> >

>> >  static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)

>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> > index f01f0f395f9a..0e87cbe47c8b 100644

>> > --- a/kernel/sched/fair.c

>> > +++ b/kernel/sched/fair.c

>> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,

>> >         return load_idx;

>> >  }

>> >

>> > -static unsigned long scale_rt_capacity(int cpu)

>> > +unsigned long scale_rt_capacity(int cpu)

>> >  {

>> >         struct rq *rq = cpu_rq(cpu);

>> >         u64 total, used, age_stamp, avg;

>> > --->8---

>> >

>> >

>> >

>> >>

>> >> >

>> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see

>> >> > performance improvements:

>> >> >

>> >> > - Without patchset :

>> >> > Test execution summary:

>> >> >     total time:                          15.0009s

>> >> >     total number of events:              4903

>> >> >     total time taken by event execution: 14.9972

>> >> >     per-request statistics:

>> >> >          min:                                  1.23ms

>> >> >          avg:                                  3.06ms

>> >> >          max:                                 13.16ms

>> >> >          approx.  95 percentile:              12.73ms

>> >> >

>> >> > Threads fairness:

>> >> >     events (avg/stddev):           4903.0000/0.00

>> >> >     execution time (avg/stddev):   14.9972/0.00

>> >> >

>> >> > - With patchset:

>> >> > Test execution summary:

>> >> >     total time:                          15.0014s

>> >> >     total number of events:              7694

>> >> >     total time taken by event execution: 14.9979

>> >> >     per-request statistics:

>> >> >          min:                                  1.23ms

>> >> >          avg:                                  1.95ms

>> >> >          max:                                 10.49ms

>> >> >          approx.  95 percentile:              10.39ms

>> >> >

>> >> > Threads fairness:

>> >> >     events (avg/stddev):           7694.0000/0.00

>> >> >     execution time (avg/stddev):   14.9979/0.00

>> >> >

>> >> > The performance improvement is 56% for this use case.

>> >> >

>> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar

>> >> >   problem as with rt_rq

>> >> >

>> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove

>> >> >   dl and rt from sched_rt_avg_update

>> >> >

>> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP

>> >> >   A test with iperf on hikey 6220 gives:

>> >> >     w/o patchset            w/ patchset

>> >> >     Tx 276 Mbits/sec        304 Mbits/sec +10%

>> >> >     Rx 299 Mbits/sec        328 Mbits/sec +09%

>> >> >

>> >> >     8 iterations of iperf -c server_address -r -t 5

>> >> >     stdev is lower than 1%

>> >> >     Only WFI idle state is enable (shallowest arm idle state)

>> >> >

>> >> > - Patches 9 removes the unused sched_avg_update code

>> >> >

>> >> > - Patch 10 removes the unused sched_time_avg_ms

>> >> >

>> >> > Change since v3:

>> >> > - add support of periodic update of blocked utilization

>> >> > - rebase on lastest tip/sched/core

>> >> >

>> >> > Change since v2:

>> >> > - move pelt code into a dedicated pelt.c file

>> >> > - rebase on load tracking changes

>> >> >

>> >> > Change since v1:

>> >> > - Only a rebase. I have addressed the comments on previous version in

>> >> >   patch 1/2

>> >> >

>> >> > Vincent Guittot (10):

>> >> >   sched/pelt: Move pelt related code in a dedicated file

>> >> >   sched/rt: add rt_rq utilization tracking

>> >> >   cpufreq/schedutil: add rt utilization tracking

>> >> >   sched/dl: add dl_rq utilization tracking

>> >> >   cpufreq/schedutil: get max utilization

>> >> >   sched: remove rt and dl from sched_avg

>> >> >   sched/irq: add irq utilization tracking

>> >> >   cpufreq/schedutil: take into account interrupt

>> >> >   sched: remove rt_avg code

>> >> >   proc/sched: remove unused sched_time_avg_ms

>> >> >

>> >> >  include/linux/sched/sysctl.h     |   1 -

>> >> >  kernel/sched/Makefile            |   2 +-

>> >> >  kernel/sched/core.c              |  38 +---

>> >> >  kernel/sched/cpufreq_schedutil.c |  24 ++-

>> >> >  kernel/sched/deadline.c          |   7 +-

>> >> >  kernel/sched/fair.c              | 381 +++----------------------------------

>> >> >  kernel/sched/pelt.c              | 395 +++++++++++++++++++++++++++++++++++++++

>> >> >  kernel/sched/pelt.h              |  63 +++++++

>> >> >  kernel/sched/rt.c                |  10 +-

>> >> >  kernel/sched/sched.h             |  57 ++++--

>> >> >  kernel/sysctl.c                  |   8 -

>> >> >  11 files changed, 563 insertions(+), 423 deletions(-)

>> >> >  create mode 100644 kernel/sched/pelt.c

>> >> >  create mode 100644 kernel/sched/pelt.h

>> >> >

>> >> > --

>> >> > 2.7.4

>> >> >

>> >

>> >
Quentin Perret June 5, 2018, 1:52 p.m. UTC | #12
On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote:
> On 5 June 2018 at 15:12, Quentin Perret <quentin.perret@arm.com> wrote:

> > On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote:

> >> On 5 June 2018 at 12:57, Quentin Perret <quentin.perret@arm.com> wrote:

> >> > Hi Vincent,

> >> >

> >> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:

> >> >> Hi Quentin,

> >> >>

> >> >> On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote:

> >> >> > This patchset initially tracked only the utilization of RT rq. During

> >> >> > OSPM summit, it has been discussed the opportunity to extend it in order

> >> >> > to get an estimate of the utilization of the CPU.

> >> >> >

> >> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization

> >> >> >   tracking for rt_rq.

> >> >> >

> >> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency

> >> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't

> >> >> > reflect anymore the utilization of cfs tasks but only the remaining part that

> >> >> > is not used by rt tasks. We should monitor the stolen utilization and take

> >> >> > it into account when selecting OPP. This patchset doesn't change the OPP

> >> >> > selection policy for RT tasks but only for CFS tasks

> >> >> >

> >> >> > A rt-app use case which creates an always running cfs thread and a rt threads

> >> >> > that wakes up periodically with both threads pinned on same CPU, show lot of

> >> >> > frequency switches of the CPU whereas the CPU never goes idles during the

> >> >> > test. I can share the json file that I used for the test if someone is

> >> >> > interested in.

> >> >> >

> >> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),

> >> >> > the cpufreq statistics outputs (stats are reset just before the test) :

> >> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

> >> >> > without patchset : 1230

> >> >> > with patchset : 14

> >> >>

> >> >> I have attached the rt-app json file that I use for this test

> >> >

> >> > Thank you very much ! I did a quick test with a much simpler fix to this

> >> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().

> >> > I get the following results on Hikey960:

> >> >

> >> > Without patch:

> >> >    cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

> >> >    12

> >> >    cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans

> >> >    640

> >> > With patch

> >> >    cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

> >> >    8

> >> >    cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans

> >> >    12

> >> >

> >> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think

> >> > this is an actual issue for realistic use-cases ?

> >>

> >> yes I think that it's worth syncing and consolidating things on the

> >> same metric. The result will be saner and more robust as we will have

> >> the same behavior

> >

> > TBH I'm not disagreeing with that, the PELT-everywhere approach feels

> > cleaner in a way, but do you have a use-case in mind where this will

> > definitely help ?

> >

> > I mean, yes the rt_avg is a slow response to the RT pressure, but is

> > this always a problem ? Ramping down slower might actually help in some

> > cases no ?

> 

> I would say no because when one will decrease the other one will not

> increase at the same pace and we will have some wrong behavior or

> decision


I think I get your point. Yes, sometimes, the slow-moving rt_avg can be
off a little bit (which can be good or bad, depending in the case) if your
RT task runs a lot with very changing behaviour. And again, I'm not
fundamentally against the idea of having extra complexity for RT/IRQ PELT
signals _if_ we have a use-case. But is there a real use-case where we
really need all of that ? That's a true question, I honestly don't have
the answer :-)

> 

> >

> >>

> >> >

> >> > What about the diff below (just a quick hack to show the idea) applied

> >> > on tip/sched/core ?

> >> >

> >> > ---8<---

> >> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> >> > index a8ba6d1f262a..23a4fb1c2c25 100644

> >> > --- a/kernel/sched/cpufreq_schedutil.c

> >> > +++ b/kernel/sched/cpufreq_schedutil.c

> >> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

> >> >         sg_cpu->util_dl  = cpu_util_dl(rq);

> >> >  }

> >> >

> >> > +unsigned long scale_rt_capacity(int cpu);

> >> >  static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> >> >  {

> >> >         struct rq *rq = cpu_rq(sg_cpu->cpu);

> >> > +       int cpu = sg_cpu->cpu;

> >> > +       unsigned long util, dl_bw;

> >> >

> >> >         if (rq->rt.rt_nr_running)

> >> >                 return sg_cpu->max;

> >> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> >> >          * util_cfs + util_dl as requested freq. However, cpufreq is not yet

> >> >          * ready for such an interface. So, we only do the latter for now.

> >> >          */

> >> > -       return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

> >> > +       util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

> >> > +       util >>= SCHED_CAPACITY_SHIFT;

> >> > +       util = arch_scale_cpu_capacity(NULL, cpu) - util;

> >> > +       util += sg_cpu->util_cfs;

> >> > +       dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> >> > +

> >> > +       /* Make sure to always provide the reserved freq to DL. */

> >> > +       return max(util, dl_bw);

> >> >  }

> >> >

> >> >  static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)

> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> >> > index f01f0f395f9a..0e87cbe47c8b 100644

> >> > --- a/kernel/sched/fair.c

> >> > +++ b/kernel/sched/fair.c

> >> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,

> >> >         return load_idx;

> >> >  }

> >> >

> >> > -static unsigned long scale_rt_capacity(int cpu)

> >> > +unsigned long scale_rt_capacity(int cpu)

> >> >  {

> >> >         struct rq *rq = cpu_rq(cpu);

> >> >         u64 total, used, age_stamp, avg;

> >> > --->8---

> >> >

> >> >

> >> >

> >> >>

> >> >> >

> >> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see

> >> >> > performance improvements:

> >> >> >

> >> >> > - Without patchset :

> >> >> > Test execution summary:

> >> >> >     total time:                          15.0009s

> >> >> >     total number of events:              4903

> >> >> >     total time taken by event execution: 14.9972

> >> >> >     per-request statistics:

> >> >> >          min:                                  1.23ms

> >> >> >          avg:                                  3.06ms

> >> >> >          max:                                 13.16ms

> >> >> >          approx.  95 percentile:              12.73ms

> >> >> >

> >> >> > Threads fairness:

> >> >> >     events (avg/stddev):           4903.0000/0.00

> >> >> >     execution time (avg/stddev):   14.9972/0.00

> >> >> >

> >> >> > - With patchset:

> >> >> > Test execution summary:

> >> >> >     total time:                          15.0014s

> >> >> >     total number of events:              7694

> >> >> >     total time taken by event execution: 14.9979

> >> >> >     per-request statistics:

> >> >> >          min:                                  1.23ms

> >> >> >          avg:                                  1.95ms

> >> >> >          max:                                 10.49ms

> >> >> >          approx.  95 percentile:              10.39ms

> >> >> >

> >> >> > Threads fairness:

> >> >> >     events (avg/stddev):           7694.0000/0.00

> >> >> >     execution time (avg/stddev):   14.9979/0.00

> >> >> >

> >> >> > The performance improvement is 56% for this use case.

> >> >> >

> >> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar

> >> >> >   problem as with rt_rq

> >> >> >

> >> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove

> >> >> >   dl and rt from sched_rt_avg_update

> >> >> >

> >> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP

> >> >> >   A test with iperf on hikey 6220 gives:

> >> >> >     w/o patchset            w/ patchset

> >> >> >     Tx 276 Mbits/sec        304 Mbits/sec +10%

> >> >> >     Rx 299 Mbits/sec        328 Mbits/sec +09%

> >> >> >

> >> >> >     8 iterations of iperf -c server_address -r -t 5

> >> >> >     stdev is lower than 1%

> >> >> >     Only WFI idle state is enable (shallowest arm idle state)

> >> >> >

> >> >> > - Patches 9 removes the unused sched_avg_update code

> >> >> >

> >> >> > - Patch 10 removes the unused sched_time_avg_ms

> >> >> >

> >> >> > Change since v3:

> >> >> > - add support of periodic update of blocked utilization

> >> >> > - rebase on lastest tip/sched/core

> >> >> >

> >> >> > Change since v2:

> >> >> > - move pelt code into a dedicated pelt.c file

> >> >> > - rebase on load tracking changes

> >> >> >

> >> >> > Change since v1:

> >> >> > - Only a rebase. I have addressed the comments on previous version in

> >> >> >   patch 1/2

> >> >> >

> >> >> > Vincent Guittot (10):

> >> >> >   sched/pelt: Move pelt related code in a dedicated file

> >> >> >   sched/rt: add rt_rq utilization tracking

> >> >> >   cpufreq/schedutil: add rt utilization tracking

> >> >> >   sched/dl: add dl_rq utilization tracking

> >> >> >   cpufreq/schedutil: get max utilization

> >> >> >   sched: remove rt and dl from sched_avg

> >> >> >   sched/irq: add irq utilization tracking

> >> >> >   cpufreq/schedutil: take into account interrupt

> >> >> >   sched: remove rt_avg code

> >> >> >   proc/sched: remove unused sched_time_avg_ms

> >> >> >

> >> >> >  include/linux/sched/sysctl.h     |   1 -

> >> >> >  kernel/sched/Makefile            |   2 +-

> >> >> >  kernel/sched/core.c              |  38 +---

> >> >> >  kernel/sched/cpufreq_schedutil.c |  24 ++-

> >> >> >  kernel/sched/deadline.c          |   7 +-

> >> >> >  kernel/sched/fair.c              | 381 +++----------------------------------

> >> >> >  kernel/sched/pelt.c              | 395 +++++++++++++++++++++++++++++++++++++++

> >> >> >  kernel/sched/pelt.h              |  63 +++++++

> >> >> >  kernel/sched/rt.c                |  10 +-

> >> >> >  kernel/sched/sched.h             |  57 ++++--

> >> >> >  kernel/sysctl.c                  |   8 -

> >> >> >  11 files changed, 563 insertions(+), 423 deletions(-)

> >> >> >  create mode 100644 kernel/sched/pelt.c

> >> >> >  create mode 100644 kernel/sched/pelt.h

> >> >> >

> >> >> > --

> >> >> > 2.7.4

> >> >> >

> >> >

> >> >
Vincent Guittot June 5, 2018, 1:55 p.m. UTC | #13
On 5 June 2018 at 15:52, Quentin Perret <quentin.perret@arm.com> wrote:
> On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote:

>> On 5 June 2018 at 15:12, Quentin Perret <quentin.perret@arm.com> wrote:

>> > On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote:

>> >> On 5 June 2018 at 12:57, Quentin Perret <quentin.perret@arm.com> wrote:

>> >> > Hi Vincent,

>> >> >

>> >> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:

>> >> >> Hi Quentin,

>> >> >>

>> >> >> On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote:

>> >> >> > This patchset initially tracked only the utilization of RT rq. During

>> >> >> > OSPM summit, it has been discussed the opportunity to extend it in order

>> >> >> > to get an estimate of the utilization of the CPU.

>> >> >> >

>> >> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization

>> >> >> >   tracking for rt_rq.

>> >> >> >

>> >> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency

>> >> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't

>> >> >> > reflect anymore the utilization of cfs tasks but only the remaining part that

>> >> >> > is not used by rt tasks. We should monitor the stolen utilization and take

>> >> >> > it into account when selecting OPP. This patchset doesn't change the OPP

>> >> >> > selection policy for RT tasks but only for CFS tasks

>> >> >> >

>> >> >> > A rt-app use case which creates an always running cfs thread and a rt threads

>> >> >> > that wakes up periodically with both threads pinned on same CPU, show lot of

>> >> >> > frequency switches of the CPU whereas the CPU never goes idles during the

>> >> >> > test. I can share the json file that I used for the test if someone is

>> >> >> > interested in.

>> >> >> >

>> >> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),

>> >> >> > the cpufreq statistics outputs (stats are reset just before the test) :

>> >> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

>> >> >> > without patchset : 1230

>> >> >> > with patchset : 14

>> >> >>

>> >> >> I have attached the rt-app json file that I use for this test

>> >> >

>> >> > Thank you very much ! I did a quick test with a much simpler fix to this

>> >> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().

>> >> > I get the following results on Hikey960:

>> >> >

>> >> > Without patch:

>> >> >    cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

>> >> >    12

>> >> >    cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans

>> >> >    640

>> >> > With patch

>> >> >    cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

>> >> >    8

>> >> >    cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans

>> >> >    12

>> >> >

>> >> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think

>> >> > this is an actual issue for realistic use-cases ?

>> >>

>> >> yes I think that it's worth syncing and consolidating things on the

>> >> same metric. The result will be saner and more robust as we will have

>> >> the same behavior

>> >

>> > TBH I'm not disagreeing with that, the PELT-everywhere approach feels

>> > cleaner in a way, but do you have a use-case in mind where this will

>> > definitely help ?

>> >

>> > I mean, yes the rt_avg is a slow response to the RT pressure, but is

>> > this always a problem ? Ramping down slower might actually help in some

>> > cases no ?

>>

>> I would say no because when one will decrease the other one will not

>> increase at the same pace and we will have some wrong behavior or

>> decision

>

> I think I get your point. Yes, sometimes, the slow-moving rt_avg can be

> off a little bit (which can be good or bad, depending in the case) if your

> RT task runs a lot with very changing behaviour. And again, I'm not

> fundamentally against the idea of having extra complexity for RT/IRQ PELT

> signals _if_ we have a use-case. But is there a real use-case where we

> really need all of that ? That's a true question, I honestly don't have

> the answer :-)


The iperf test result is another example of the benefit

>

>>

>> >

>> >>

>> >> >

>> >> > What about the diff below (just a quick hack to show the idea) applied

>> >> > on tip/sched/core ?

>> >> >

>> >> > ---8<---

>> >> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

>> >> > index a8ba6d1f262a..23a4fb1c2c25 100644

>> >> > --- a/kernel/sched/cpufreq_schedutil.c

>> >> > +++ b/kernel/sched/cpufreq_schedutil.c

>> >> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

>> >> >         sg_cpu->util_dl  = cpu_util_dl(rq);

>> >> >  }

>> >> >

>> >> > +unsigned long scale_rt_capacity(int cpu);

>> >> >  static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

>> >> >  {

>> >> >         struct rq *rq = cpu_rq(sg_cpu->cpu);

>> >> > +       int cpu = sg_cpu->cpu;

>> >> > +       unsigned long util, dl_bw;

>> >> >

>> >> >         if (rq->rt.rt_nr_running)

>> >> >                 return sg_cpu->max;

>> >> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

>> >> >          * util_cfs + util_dl as requested freq. However, cpufreq is not yet

>> >> >          * ready for such an interface. So, we only do the latter for now.

>> >> >          */

>> >> > -       return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

>> >> > +       util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

>> >> > +       util >>= SCHED_CAPACITY_SHIFT;

>> >> > +       util = arch_scale_cpu_capacity(NULL, cpu) - util;

>> >> > +       util += sg_cpu->util_cfs;

>> >> > +       dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

>> >> > +

>> >> > +       /* Make sure to always provide the reserved freq to DL. */

>> >> > +       return max(util, dl_bw);

>> >> >  }

>> >> >

>> >> >  static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)

>> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> >> > index f01f0f395f9a..0e87cbe47c8b 100644

>> >> > --- a/kernel/sched/fair.c

>> >> > +++ b/kernel/sched/fair.c

>> >> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,

>> >> >         return load_idx;

>> >> >  }

>> >> >

>> >> > -static unsigned long scale_rt_capacity(int cpu)

>> >> > +unsigned long scale_rt_capacity(int cpu)

>> >> >  {

>> >> >         struct rq *rq = cpu_rq(cpu);

>> >> >         u64 total, used, age_stamp, avg;

>> >> > --->8---

>> >> >

>> >> >

>> >> >

>> >> >>

>> >> >> >

>> >> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see

>> >> >> > performance improvements:

>> >> >> >

>> >> >> > - Without patchset :

>> >> >> > Test execution summary:

>> >> >> >     total time:                          15.0009s

>> >> >> >     total number of events:              4903

>> >> >> >     total time taken by event execution: 14.9972

>> >> >> >     per-request statistics:

>> >> >> >          min:                                  1.23ms

>> >> >> >          avg:                                  3.06ms

>> >> >> >          max:                                 13.16ms

>> >> >> >          approx.  95 percentile:              12.73ms

>> >> >> >

>> >> >> > Threads fairness:

>> >> >> >     events (avg/stddev):           4903.0000/0.00

>> >> >> >     execution time (avg/stddev):   14.9972/0.00

>> >> >> >

>> >> >> > - With patchset:

>> >> >> > Test execution summary:

>> >> >> >     total time:                          15.0014s

>> >> >> >     total number of events:              7694

>> >> >> >     total time taken by event execution: 14.9979

>> >> >> >     per-request statistics:

>> >> >> >          min:                                  1.23ms

>> >> >> >          avg:                                  1.95ms

>> >> >> >          max:                                 10.49ms

>> >> >> >          approx.  95 percentile:              10.39ms

>> >> >> >

>> >> >> > Threads fairness:

>> >> >> >     events (avg/stddev):           7694.0000/0.00

>> >> >> >     execution time (avg/stddev):   14.9979/0.00

>> >> >> >

>> >> >> > The performance improvement is 56% for this use case.

>> >> >> >

>> >> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar

>> >> >> >   problem as with rt_rq

>> >> >> >

>> >> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove

>> >> >> >   dl and rt from sched_rt_avg_update

>> >> >> >

>> >> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP

>> >> >> >   A test with iperf on hikey 6220 gives:

>> >> >> >     w/o patchset            w/ patchset

>> >> >> >     Tx 276 Mbits/sec        304 Mbits/sec +10%

>> >> >> >     Rx 299 Mbits/sec        328 Mbits/sec +09%

>> >> >> >

>> >> >> >     8 iterations of iperf -c server_address -r -t 5

>> >> >> >     stdev is lower than 1%

>> >> >> >     Only WFI idle state is enable (shallowest arm idle state)

>> >> >> >

>> >> >> > - Patches 9 removes the unused sched_avg_update code

>> >> >> >

>> >> >> > - Patch 10 removes the unused sched_time_avg_ms

>> >> >> >

>> >> >> > Change since v3:

>> >> >> > - add support of periodic update of blocked utilization

>> >> >> > - rebase on lastest tip/sched/core

>> >> >> >

>> >> >> > Change since v2:

>> >> >> > - move pelt code into a dedicated pelt.c file

>> >> >> > - rebase on load tracking changes

>> >> >> >

>> >> >> > Change since v1:

>> >> >> > - Only a rebase. I have addressed the comments on previous version in

>> >> >> >   patch 1/2

>> >> >> >

>> >> >> > Vincent Guittot (10):

>> >> >> >   sched/pelt: Move pelt related code in a dedicated file

>> >> >> >   sched/rt: add rt_rq utilization tracking

>> >> >> >   cpufreq/schedutil: add rt utilization tracking

>> >> >> >   sched/dl: add dl_rq utilization tracking

>> >> >> >   cpufreq/schedutil: get max utilization

>> >> >> >   sched: remove rt and dl from sched_avg

>> >> >> >   sched/irq: add irq utilization tracking

>> >> >> >   cpufreq/schedutil: take into account interrupt

>> >> >> >   sched: remove rt_avg code

>> >> >> >   proc/sched: remove unused sched_time_avg_ms

>> >> >> >

>> >> >> >  include/linux/sched/sysctl.h     |   1 -

>> >> >> >  kernel/sched/Makefile            |   2 +-

>> >> >> >  kernel/sched/core.c              |  38 +---

>> >> >> >  kernel/sched/cpufreq_schedutil.c |  24 ++-

>> >> >> >  kernel/sched/deadline.c          |   7 +-

>> >> >> >  kernel/sched/fair.c              | 381 +++----------------------------------

>> >> >> >  kernel/sched/pelt.c              | 395 +++++++++++++++++++++++++++++++++++++++

>> >> >> >  kernel/sched/pelt.h              |  63 +++++++

>> >> >> >  kernel/sched/rt.c                |  10 +-

>> >> >> >  kernel/sched/sched.h             |  57 ++++--

>> >> >> >  kernel/sysctl.c                  |   8 -

>> >> >> >  11 files changed, 563 insertions(+), 423 deletions(-)

>> >> >> >  create mode 100644 kernel/sched/pelt.c

>> >> >> >  create mode 100644 kernel/sched/pelt.h

>> >> >> >

>> >> >> > --

>> >> >> > 2.7.4

>> >> >> >

>> >> >

>> >> >
Quentin Perret June 5, 2018, 2:01 p.m. UTC | #14
On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:
> On 05/06/18 14:05, Quentin Perret wrote:

> > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:

> > > Hi Quentin,

> > > 

> > > On 05/06/18 11:57, Quentin Perret wrote:

> > > 

> > > [...]

> > > 

> > > > What about the diff below (just a quick hack to show the idea) applied

> > > > on tip/sched/core ?

> > > > 

> > > > ---8<---

> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > > index a8ba6d1f262a..23a4fb1c2c25 100644

> > > > --- a/kernel/sched/cpufreq_schedutil.c

> > > > +++ b/kernel/sched/cpufreq_schedutil.c

> > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

> > > >  	sg_cpu->util_dl  = cpu_util_dl(rq);

> > > >  }

> > > >  

> > > > +unsigned long scale_rt_capacity(int cpu);

> > > >  static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > > >  {

> > > >  	struct rq *rq = cpu_rq(sg_cpu->cpu);

> > > > +	int cpu = sg_cpu->cpu;

> > > > +	unsigned long util, dl_bw;

> > > >  

> > > >  	if (rq->rt.rt_nr_running)

> > > >  		return sg_cpu->max;

> > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > > >  	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet

> > > >  	 * ready for such an interface. So, we only do the latter for now.

> > > >  	 */

> > > > -	return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

> > > > +	util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

> > > 

> > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,

> > > since we use max below, we will probably have the same problem that we

> > > discussed on Vincent's approach (overestimation of DL contribution while

> > > we could use running_bw).

> > 

> > Ah no, you're right, this isn't great for long running deadline tasks.

> > We should definitely account for the running_bw here, not the dl avg...

> > 

> > I was trying to address the issue of RT stealing time from CFS here, but

> > the DL integration isn't quite right which this patch as-is, I agree ...

> > 

> > > 

> > > > +	util >>= SCHED_CAPACITY_SHIFT;

> > > > +	util = arch_scale_cpu_capacity(NULL, cpu) - util;

> > > > +	util += sg_cpu->util_cfs;

> > > > +	dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > > 

> > > Why this_bw instead of running_bw?

> > 

> > So IIUC, this_bw should basically give you the absolute reservation (== the

> > sum of runtime/deadline ratios of all DL tasks on that rq).

> 

> Yep.

> 

> > The reason I added this max is because I'm still not sure to understand

> > how we can safely drop the freq below that point ? If we don't guarantee

> > to always stay at least at the freq required by DL, aren't we risking to

> > start a deadline tasks stuck at a low freq because of rate limiting ? In

> > this case, if that tasks uses all of its runtime then you might start

> > missing deadlines ...

> 

> We decided to avoid (software) rate limiting for DL with e97a90f7069b

> ("sched/cpufreq: Rate limits for SCHED_DEADLINE").


Right, I spotted that one, but yeah you could also be limited by HW ...

> 

> > My feeling is that the only safe thing to do is to guarantee to never go

> > below the freq required by DL, and to optimistically add CFS tasks

> > without raising the OPP if we have good reasons to think that DL is

> > using less than it required (which is what we should get by using

> > running_bw above I suppose). Does that make any sense ?

> 

> Then we can't still avoid the hardware limits, so using running_bw is a

> trade off between safety (especially considering soft real-time

> scenarios) and energy consumption (which seems to be working in

> practice).


Ok, I see ... Have you guys already tried something like my patch above
(keeping the freq >= this_bw) in real world use cases ? Is this costing
that much energy in practice ? If we fill the gaps left by DL (when it
doesn't use all the runtime) with CFS tasks that might no be so bad ...

Thank you very much for taking the time to explain all this, I really
appreciate :-)

Quentin
Quentin Perret June 5, 2018, 2:09 p.m. UTC | #15
On Tuesday 05 Jun 2018 at 15:55:43 (+0200), Vincent Guittot wrote:
> On 5 June 2018 at 15:52, Quentin Perret <quentin.perret@arm.com> wrote:

> > On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote:

> >> On 5 June 2018 at 15:12, Quentin Perret <quentin.perret@arm.com> wrote:

> >> I would say no because when one will decrease the other one will not

> >> increase at the same pace and we will have some wrong behavior or

> >> decision

> >

> > I think I get your point. Yes, sometimes, the slow-moving rt_avg can be

> > off a little bit (which can be good or bad, depending in the case) if your

> > RT task runs a lot with very changing behaviour. And again, I'm not

> > fundamentally against the idea of having extra complexity for RT/IRQ PELT

> > signals _if_ we have a use-case. But is there a real use-case where we

> > really need all of that ? That's a true question, I honestly don't have

> > the answer :-)

> 

> The iperf test result is another example of the benefit


The iperf test result ? The sysbench test you mean ?
Juri Lelli June 5, 2018, 2:13 p.m. UTC | #16
On 05/06/18 15:01, Quentin Perret wrote:
> On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:

> > On 05/06/18 14:05, Quentin Perret wrote:

> > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:

> > > > Hi Quentin,

> > > > 

> > > > On 05/06/18 11:57, Quentin Perret wrote:

> > > > 

> > > > [...]

> > > > 

> > > > > What about the diff below (just a quick hack to show the idea) applied

> > > > > on tip/sched/core ?

> > > > > 

> > > > > ---8<---

> > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > > > index a8ba6d1f262a..23a4fb1c2c25 100644

> > > > > --- a/kernel/sched/cpufreq_schedutil.c

> > > > > +++ b/kernel/sched/cpufreq_schedutil.c

> > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

> > > > >  	sg_cpu->util_dl  = cpu_util_dl(rq);

> > > > >  }

> > > > >  

> > > > > +unsigned long scale_rt_capacity(int cpu);

> > > > >  static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > > > >  {

> > > > >  	struct rq *rq = cpu_rq(sg_cpu->cpu);

> > > > > +	int cpu = sg_cpu->cpu;

> > > > > +	unsigned long util, dl_bw;

> > > > >  

> > > > >  	if (rq->rt.rt_nr_running)

> > > > >  		return sg_cpu->max;

> > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > > > >  	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet

> > > > >  	 * ready for such an interface. So, we only do the latter for now.

> > > > >  	 */

> > > > > -	return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

> > > > > +	util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

> > > > 

> > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,

> > > > since we use max below, we will probably have the same problem that we

> > > > discussed on Vincent's approach (overestimation of DL contribution while

> > > > we could use running_bw).

> > > 

> > > Ah no, you're right, this isn't great for long running deadline tasks.

> > > We should definitely account for the running_bw here, not the dl avg...

> > > 

> > > I was trying to address the issue of RT stealing time from CFS here, but

> > > the DL integration isn't quite right which this patch as-is, I agree ...

> > > 

> > > > 

> > > > > +	util >>= SCHED_CAPACITY_SHIFT;

> > > > > +	util = arch_scale_cpu_capacity(NULL, cpu) - util;

> > > > > +	util += sg_cpu->util_cfs;

> > > > > +	dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > > > 

> > > > Why this_bw instead of running_bw?

> > > 

> > > So IIUC, this_bw should basically give you the absolute reservation (== the

> > > sum of runtime/deadline ratios of all DL tasks on that rq).

> > 

> > Yep.

> > 

> > > The reason I added this max is because I'm still not sure to understand

> > > how we can safely drop the freq below that point ? If we don't guarantee

> > > to always stay at least at the freq required by DL, aren't we risking to

> > > start a deadline tasks stuck at a low freq because of rate limiting ? In

> > > this case, if that tasks uses all of its runtime then you might start

> > > missing deadlines ...

> > 

> > We decided to avoid (software) rate limiting for DL with e97a90f7069b

> > ("sched/cpufreq: Rate limits for SCHED_DEADLINE").

> 

> Right, I spotted that one, but yeah you could also be limited by HW ...

> 

> > 

> > > My feeling is that the only safe thing to do is to guarantee to never go

> > > below the freq required by DL, and to optimistically add CFS tasks

> > > without raising the OPP if we have good reasons to think that DL is

> > > using less than it required (which is what we should get by using

> > > running_bw above I suppose). Does that make any sense ?

> > 

> > Then we can't still avoid the hardware limits, so using running_bw is a

> > trade off between safety (especially considering soft real-time

> > scenarios) and energy consumption (which seems to be working in

> > practice).

> 

> Ok, I see ... Have you guys already tried something like my patch above

> (keeping the freq >= this_bw) in real world use cases ? Is this costing

> that much energy in practice ? If we fill the gaps left by DL (when it


IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so
he might add some numbers to my words above. I didn't (yet). But, please
consider that I might be reserving (for example) 50% of bandwidth for my
heavy and time sensitive task and then have that task wake up only once
in a while (but I'll be keeping clock speed up for the whole time). :/

> doesn't use all the runtime) with CFS tasks that might no be so bad ...

> 

> Thank you very much for taking the time to explain all this, I really

> appreciate :-)


Sure. Thanks for participating to the discussion!

Best,

- Juri
Peter Zijlstra June 5, 2018, 2:18 p.m. UTC | #17
On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:
> On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote:


> > So this patch-set tracks the !cfs occupation using the same function,

> > which is all good. But what, if instead of using that to compensate the

> > OPP selection, we employ that to renormalize the util signal?

> >

> > If we normalize util against the dynamic (rt_avg affected) cpu_capacity,

> > then I think your initial problem goes away. Because while the RT task

> > will push the util to .5, it will at the same time push the CPU capacity

> > to .5, and renormalized that gives 1.

> >

> >   NOTE: the renorm would then become something like:

> >         scale_cpu = arch_scale_cpu_capacity() / rt_frac();


Should probably be:

	scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac())

> >

> >

> > On IRC I mentioned stopping the CFS clock when preempted, and while that

> > would result in fixed numbers, Vincent was right in pointing out the

> > numbers will be difficult to interpret, since the meaning will be purely

> > CPU local and I'm not sure you can actually fix it again with

> > normalization.

> >

> > Imagine, running a .3 RT task, that would push the (always running) CFS

> > down to .7, but because we discard all !cfs time, it actually has 1. If

> > we try and normalize that we'll end up with ~1.43, which is of course

> > completely broken.

> >

> >

> > _However_, all that happens for util, also happens for load. So the above

> > scenario will also make the CPU appear less loaded than it actually is.

> 

> The load will continue to increase because we track runnable state and

> not running for the load


Duh yes. So renormalizing it once, like proposed for util would actually
do the right thing there too.  Would not that allow us to get rid of
much of the capacity magic in the load balance code?

/me thinks more..

Bah, no.. because you don't want this dynamic renormalization part of
the sums. So you want to keep it after the fact. :/

> As you mentioned, scale_rt_capacity give the remaining capacity for

> cfs and it will behave like cfs util_avg now that it uses PELT. So as

> long as cfs util_avg <  scale_rt_capacity(we probably need a margin)

> we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting

> OPP because we have remaining spare capacity but if  cfs util_avg ==

> scale_rt_capacity, we make sure to use max OPP.


Good point, when cfs-util < cfs-cap then there is idle time and the util
number is 'right', when cfs-util == cfs-cap we're overcommitted and
should go max.

Since the util and cap values are aligned that should track nicely.
Quentin Perret June 5, 2018, 2:21 p.m. UTC | #18
On Tuesday 05 Jun 2018 at 15:09:54 (+0100), Quentin Perret wrote:
> On Tuesday 05 Jun 2018 at 15:55:43 (+0200), Vincent Guittot wrote:

> > On 5 June 2018 at 15:52, Quentin Perret <quentin.perret@arm.com> wrote:

> > > On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote:

> > >> On 5 June 2018 at 15:12, Quentin Perret <quentin.perret@arm.com> wrote:

> > >> I would say no because when one will decrease the other one will not

> > >> increase at the same pace and we will have some wrong behavior or

> > >> decision

> > >

> > > I think I get your point. Yes, sometimes, the slow-moving rt_avg can be

> > > off a little bit (which can be good or bad, depending in the case) if your

> > > RT task runs a lot with very changing behaviour. And again, I'm not

> > > fundamentally against the idea of having extra complexity for RT/IRQ PELT

> > > signals _if_ we have a use-case. But is there a real use-case where we

> > > really need all of that ? That's a true question, I honestly don't have

> > > the answer :-)

> > 

> > The iperf test result is another example of the benefit

> 

> The iperf test result ? The sysbench test you mean ?


Ah sorry I missed that one form the cover letter ... I'll look into that
then :-)

Thanks,
Quentin
Juri Lelli June 5, 2018, 3:03 p.m. UTC | #19
On 05/06/18 16:18, Peter Zijlstra wrote:
> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:


[...]

> > As you mentioned, scale_rt_capacity give the remaining capacity for

> > cfs and it will behave like cfs util_avg now that it uses PELT. So as

> > long as cfs util_avg <  scale_rt_capacity(we probably need a margin)

> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting

> > OPP because we have remaining spare capacity but if  cfs util_avg ==

> > scale_rt_capacity, we make sure to use max OPP.

> 

> Good point, when cfs-util < cfs-cap then there is idle time and the util

> number is 'right', when cfs-util == cfs-cap we're overcommitted and

> should go max.

> 

> Since the util and cap values are aligned that should track nicely.


Yeah. Makes sense to me as well. :)
Patrick Bellasi June 5, 2018, 3:38 p.m. UTC | #20
On 05-Jun 16:18, Peter Zijlstra wrote:
> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:

> > On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote:

> 

> > > So this patch-set tracks the !cfs occupation using the same function,

> > > which is all good. But what, if instead of using that to compensate the

> > > OPP selection, we employ that to renormalize the util signal?

> > >

> > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity,

> > > then I think your initial problem goes away. Because while the RT task

> > > will push the util to .5, it will at the same time push the CPU capacity

> > > to .5, and renormalized that gives 1.


And would not that mean also that a 50% task co-scheduled with the
same 50% RT task, will be reported as a 100% util_avg task?

> > >

> > >   NOTE: the renorm would then become something like:

> > >         scale_cpu = arch_scale_cpu_capacity() / rt_frac();

> 

> Should probably be:

> 

> 	scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac())

> 

> > >

> > >

> > > On IRC I mentioned stopping the CFS clock when preempted, and while that

> > > would result in fixed numbers, Vincent was right in pointing out the

> > > numbers will be difficult to interpret, since the meaning will be purely

> > > CPU local and I'm not sure you can actually fix it again with

> > > normalization.

> > >

> > > Imagine, running a .3 RT task, that would push the (always running) CFS

> > > down to .7, but because we discard all !cfs time, it actually has 1. If

> > > we try and normalize that we'll end up with ~1.43, which is of course

> > > completely broken.

> > >

> > >

> > > _However_, all that happens for util, also happens for load. So the above

> > > scenario will also make the CPU appear less loaded than it actually is.

> > 

> > The load will continue to increase because we track runnable state and

> > not running for the load

> 

> Duh yes. So renormalizing it once, like proposed for util would actually

> do the right thing there too.  Would not that allow us to get rid of

> much of the capacity magic in the load balance code?

> 

> /me thinks more..

> 

> Bah, no.. because you don't want this dynamic renormalization part of

> the sums. So you want to keep it after the fact. :/

> 

> > As you mentioned, scale_rt_capacity give the remaining capacity for

> > cfs and it will behave like cfs util_avg now that it uses PELT. So as

> > long as cfs util_avg <  scale_rt_capacity(we probably need a margin)

> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting

> > OPP because we have remaining spare capacity but if  cfs util_avg ==

> > scale_rt_capacity, we make sure to use max OPP.


What will happen for the 50% task of the example above?

> Good point, when cfs-util < cfs-cap then there is idle time and the util

> number is 'right', when cfs-util == cfs-cap we're overcommitted and

> should go max.


Again I cannot easily read the example above...

Would that mean that a 50% CFS task, preempted by a 50% RT task (which
already set OPP to max while RUNNABLE) will end up running at the max
OPP too?

> Since the util and cap values are aligned that should track nicely.


True... the only potential issue I see is that we are steering PELT
behaviors towards better driving schedutil to run high-demand
workloads while _maybe_ affecting quite sensibly the capacity of PELT
to describe how much CPU a task uses.

Ultimately, utilization has always been a metric on "how much you
use"... while here it seems to me we are bending it to be something to
define "how fast you have to run".

-- 
#include <best/regards.h>

Patrick Bellasi
Peter Zijlstra June 5, 2018, 10:27 p.m. UTC | #21
On Tue, Jun 05, 2018 at 04:38:26PM +0100, Patrick Bellasi wrote:
> On 05-Jun 16:18, Peter Zijlstra wrote:

> > On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:


> > > As you mentioned, scale_rt_capacity give the remaining capacity for

> > > cfs and it will behave like cfs util_avg now that it uses PELT. So as

> > > long as cfs util_avg <  scale_rt_capacity(we probably need a margin)

> > > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting

> > > OPP because we have remaining spare capacity but if  cfs util_avg ==

> > > scale_rt_capacity, we make sure to use max OPP.

> 

> What will happen for the 50% task of the example above?


When the cfs-cap reaches 50% (where cfs_cap := 1 - rt_avg - dl_avg -
stop_avg - irq_avg) a cfs-util of 50% means that there is no idle time.

So util will still be 50%, nothing funny. But frequency selection will
see util==cap and select max (it might not have because reduction could
be due to IRQ pressure for example).

At the moment cfs-cap rises (>50%), and the cfs-util stays at 50%, we'll
have 50% utilization. We know there is idle time, the task could use
more if it wanted to.

> > Good point, when cfs-util < cfs-cap then there is idle time and the util

> > number is 'right', when cfs-util == cfs-cap we're overcommitted and

> > should go max.

> 

> Again I cannot easily read the example above...

> 

> Would that mean that a 50% CFS task, preempted by a 50% RT task (which

> already set OPP to max while RUNNABLE) will end up running at the max

> OPP too?


Yes, because there is no idle time. When there is no idle time, max freq
is the right frequency.

The moment cfs-util drops below cfs-cap, we'll stop running at max,
because at that point we've found idle time to reduce frequency with.

> > Since the util and cap values are aligned that should track nicely.

> 

> True... the only potential issue I see is that we are steering PELT

> behaviors towards better driving schedutil to run high-demand

> workloads while _maybe_ affecting quite sensibly the capacity of PELT

> to describe how much CPU a task uses.

> 

> Ultimately, utilization has always been a metric on "how much you

> use"... while here it seems to me we are bending it to be something to

> define "how fast you have to run".


This latest proposal does not in fact change the util tracking. But in
general, 'how much do you use' can be a very difficult question, see the
whole turbo / hardware managed dvfs discussion a week or so ago.
Quentin Perret June 6, 2018, 9:44 a.m. UTC | #22
On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote:
> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:

> > On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote:

> 

> > > So this patch-set tracks the !cfs occupation using the same function,

> > > which is all good. But what, if instead of using that to compensate the

> > > OPP selection, we employ that to renormalize the util signal?

> > >

> > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity,

> > > then I think your initial problem goes away. Because while the RT task

> > > will push the util to .5, it will at the same time push the CPU capacity

> > > to .5, and renormalized that gives 1.

> > >

> > >   NOTE: the renorm would then become something like:

> > >         scale_cpu = arch_scale_cpu_capacity() / rt_frac();

> 

> Should probably be:

> 

> 	scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac())

> 

> > >

> > >

> > > On IRC I mentioned stopping the CFS clock when preempted, and while that

> > > would result in fixed numbers, Vincent was right in pointing out the

> > > numbers will be difficult to interpret, since the meaning will be purely

> > > CPU local and I'm not sure you can actually fix it again with

> > > normalization.

> > >

> > > Imagine, running a .3 RT task, that would push the (always running) CFS

> > > down to .7, but because we discard all !cfs time, it actually has 1. If

> > > we try and normalize that we'll end up with ~1.43, which is of course

> > > completely broken.

> > >

> > >

> > > _However_, all that happens for util, also happens for load. So the above

> > > scenario will also make the CPU appear less loaded than it actually is.

> > 

> > The load will continue to increase because we track runnable state and

> > not running for the load

> 

> Duh yes. So renormalizing it once, like proposed for util would actually

> do the right thing there too.  Would not that allow us to get rid of

> much of the capacity magic in the load balance code?

> 

> /me thinks more..

> 

> Bah, no.. because you don't want this dynamic renormalization part of

> the sums. So you want to keep it after the fact. :/

> 

> > As you mentioned, scale_rt_capacity give the remaining capacity for

> > cfs and it will behave like cfs util_avg now that it uses PELT. So as

> > long as cfs util_avg <  scale_rt_capacity(we probably need a margin)

> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting

> > OPP because we have remaining spare capacity but if  cfs util_avg ==

> > scale_rt_capacity, we make sure to use max OPP.

> 

> Good point, when cfs-util < cfs-cap then there is idle time and the util

> number is 'right', when cfs-util == cfs-cap we're overcommitted and

> should go max.

> 

> Since the util and cap values are aligned that should track nicely.


So Vincent proposed to have a margin between cfs util and cfs cap to be
sure there is a little bit of idle time. This is _exactly_ what the
overutilized flag in EAS does. That would actually make a lot of sense
to use that flag in schedutil. The idea is basically to say, if there
isn't enough idle time on all CPUs, the util signal are kinda wrong, so
let's not make any decisions (task placement or OPP selection) based on
that. If overutilized, go to max freq. Does that make sense ?

Thanks,
Quentin
Vincent Guittot June 6, 2018, 9:59 a.m. UTC | #23
On 6 June 2018 at 11:44, Quentin Perret <quentin.perret@arm.com> wrote:
> On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote:

>> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:

>> > On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote:

>>

>> > > So this patch-set tracks the !cfs occupation using the same function,

>> > > which is all good. But what, if instead of using that to compensate the

>> > > OPP selection, we employ that to renormalize the util signal?

>> > >

>> > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity,

>> > > then I think your initial problem goes away. Because while the RT task

>> > > will push the util to .5, it will at the same time push the CPU capacity

>> > > to .5, and renormalized that gives 1.

>> > >

>> > >   NOTE: the renorm would then become something like:

>> > >         scale_cpu = arch_scale_cpu_capacity() / rt_frac();

>>

>> Should probably be:

>>

>>       scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac())

>>

>> > >

>> > >

>> > > On IRC I mentioned stopping the CFS clock when preempted, and while that

>> > > would result in fixed numbers, Vincent was right in pointing out the

>> > > numbers will be difficult to interpret, since the meaning will be purely

>> > > CPU local and I'm not sure you can actually fix it again with

>> > > normalization.

>> > >

>> > > Imagine, running a .3 RT task, that would push the (always running) CFS

>> > > down to .7, but because we discard all !cfs time, it actually has 1. If

>> > > we try and normalize that we'll end up with ~1.43, which is of course

>> > > completely broken.

>> > >

>> > >

>> > > _However_, all that happens for util, also happens for load. So the above

>> > > scenario will also make the CPU appear less loaded than it actually is.

>> >

>> > The load will continue to increase because we track runnable state and

>> > not running for the load

>>

>> Duh yes. So renormalizing it once, like proposed for util would actually

>> do the right thing there too.  Would not that allow us to get rid of

>> much of the capacity magic in the load balance code?

>>

>> /me thinks more..

>>

>> Bah, no.. because you don't want this dynamic renormalization part of

>> the sums. So you want to keep it after the fact. :/

>>

>> > As you mentioned, scale_rt_capacity give the remaining capacity for

>> > cfs and it will behave like cfs util_avg now that it uses PELT. So as

>> > long as cfs util_avg <  scale_rt_capacity(we probably need a margin)

>> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting

>> > OPP because we have remaining spare capacity but if  cfs util_avg ==

>> > scale_rt_capacity, we make sure to use max OPP.

>>

>> Good point, when cfs-util < cfs-cap then there is idle time and the util

>> number is 'right', when cfs-util == cfs-cap we're overcommitted and

>> should go max.

>>

>> Since the util and cap values are aligned that should track nicely.

>

> So Vincent proposed to have a margin between cfs util and cfs cap to be

> sure there is a little bit of idle time. This is _exactly_ what the

> overutilized flag in EAS does. That would actually make a lot of sense

> to use that flag in schedutil. The idea is basically to say, if there

> isn't enough idle time on all CPUs, the util signal are kinda wrong, so

> let's not make any decisions (task placement or OPP selection) based on

> that. If overutilized, go to max freq. Does that make sense ?


Yes it's similar to the overutilized except that
- this is done per cpu and whereas overutilization is for the whole system
- the test is done at every freq update and not only during some cfs
event and it uses the last up to date value and not a periodically
updated snapshot of the value
- this is done also without EAS

Then for the margin, it has to be discussed if it is really needed or not

>

> Thanks,

> Quentin
Vincent Guittot June 6, 2018, 10:02 a.m. UTC | #24
On 6 June 2018 at 11:59, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 6 June 2018 at 11:44, Quentin Perret <quentin.perret@arm.com> wrote:

>> On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote:


[snip]

>>>

>>> > As you mentioned, scale_rt_capacity give the remaining capacity for

>>> > cfs and it will behave like cfs util_avg now that it uses PELT. So as

>>> > long as cfs util_avg <  scale_rt_capacity(we probably need a margin)

>>> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting

>>> > OPP because we have remaining spare capacity but if  cfs util_avg ==

>>> > scale_rt_capacity, we make sure to use max OPP.

>>>

>>> Good point, when cfs-util < cfs-cap then there is idle time and the util

>>> number is 'right', when cfs-util == cfs-cap we're overcommitted and

>>> should go max.

>>>

>>> Since the util and cap values are aligned that should track nicely.


I have re run my tests and and the results seem to be ok so far.

I'm going to clean up a bit the code used for the test and sent a new
version of the proposal

>>

>> So Vincent proposed to have a margin between cfs util and cfs cap to be

>> sure there is a little bit of idle time. This is _exactly_ what the

>> overutilized flag in EAS does. That would actually make a lot of sense

>> to use that flag in schedutil. The idea is basically to say, if there

>> isn't enough idle time on all CPUs, the util signal are kinda wrong, so

>> let's not make any decisions (task placement or OPP selection) based on

>> that. If overutilized, go to max freq. Does that make sense ?

>

> Yes it's similar to the overutilized except that

> - this is done per cpu and whereas overutilization is for the whole system

> - the test is done at every freq update and not only during some cfs

> event and it uses the last up to date value and not a periodically

> updated snapshot of the value

> - this is done also without EAS

>

> Then for the margin, it has to be discussed if it is really needed or not

>

>>

>> Thanks,

>> Quentin
Quentin Perret June 6, 2018, 10:12 a.m. UTC | #25
On Wednesday 06 Jun 2018 at 11:59:04 (+0200), Vincent Guittot wrote:
> On 6 June 2018 at 11:44, Quentin Perret <quentin.perret@arm.com> wrote:

> > On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote:

> >> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:

> >> > On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote:

> >>

> >> > > So this patch-set tracks the !cfs occupation using the same function,

> >> > > which is all good. But what, if instead of using that to compensate the

> >> > > OPP selection, we employ that to renormalize the util signal?

> >> > >

> >> > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity,

> >> > > then I think your initial problem goes away. Because while the RT task

> >> > > will push the util to .5, it will at the same time push the CPU capacity

> >> > > to .5, and renormalized that gives 1.

> >> > >

> >> > >   NOTE: the renorm would then become something like:

> >> > >         scale_cpu = arch_scale_cpu_capacity() / rt_frac();

> >>

> >> Should probably be:

> >>

> >>       scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac())

> >>

> >> > >

> >> > >

> >> > > On IRC I mentioned stopping the CFS clock when preempted, and while that

> >> > > would result in fixed numbers, Vincent was right in pointing out the

> >> > > numbers will be difficult to interpret, since the meaning will be purely

> >> > > CPU local and I'm not sure you can actually fix it again with

> >> > > normalization.

> >> > >

> >> > > Imagine, running a .3 RT task, that would push the (always running) CFS

> >> > > down to .7, but because we discard all !cfs time, it actually has 1. If

> >> > > we try and normalize that we'll end up with ~1.43, which is of course

> >> > > completely broken.

> >> > >

> >> > >

> >> > > _However_, all that happens for util, also happens for load. So the above

> >> > > scenario will also make the CPU appear less loaded than it actually is.

> >> >

> >> > The load will continue to increase because we track runnable state and

> >> > not running for the load

> >>

> >> Duh yes. So renormalizing it once, like proposed for util would actually

> >> do the right thing there too.  Would not that allow us to get rid of

> >> much of the capacity magic in the load balance code?

> >>

> >> /me thinks more..

> >>

> >> Bah, no.. because you don't want this dynamic renormalization part of

> >> the sums. So you want to keep it after the fact. :/

> >>

> >> > As you mentioned, scale_rt_capacity give the remaining capacity for

> >> > cfs and it will behave like cfs util_avg now that it uses PELT. So as

> >> > long as cfs util_avg <  scale_rt_capacity(we probably need a margin)

> >> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting

> >> > OPP because we have remaining spare capacity but if  cfs util_avg ==

> >> > scale_rt_capacity, we make sure to use max OPP.

> >>

> >> Good point, when cfs-util < cfs-cap then there is idle time and the util

> >> number is 'right', when cfs-util == cfs-cap we're overcommitted and

> >> should go max.

> >>

> >> Since the util and cap values are aligned that should track nicely.

> >

> > So Vincent proposed to have a margin between cfs util and cfs cap to be

> > sure there is a little bit of idle time. This is _exactly_ what the

> > overutilized flag in EAS does. That would actually make a lot of sense

> > to use that flag in schedutil. The idea is basically to say, if there

> > isn't enough idle time on all CPUs, the util signal are kinda wrong, so

> > let's not make any decisions (task placement or OPP selection) based on

> > that. If overutilized, go to max freq. Does that make sense ?

> 

> Yes it's similar to the overutilized except that

> - this is done per cpu and whereas overutilization is for the whole system


Is this a good thing ? It has to be discussed. Anyways, the patch from
Morten which is part of the latest EAS posting (v3) introduces a
cpu_overutilized() function which does what you want I think.

> - the test is done at every freq update and not only during some cfs

> event and it uses the last up to date value and not a periodically

> updated snapshot of the value


Yeah good point. Now, the overutilized flag is attached to the root domain
so you should be able to set/clear it from RT/DL whenever that makes sense
I suppose. That's just a flag about the current state of the system so I
don't see why it should be touched only by CFS.

> - this is done also without EAS


The overutilized flag doesn't have to come with EAS if it is useful for
something else (OPP selection).

> 

> Then for the margin, it has to be discussed if it is really needed or not


+1

Thanks,
Quentin
Claudio Scordino June 6, 2018, 1:05 p.m. UTC | #26
Hi Quentin,

Il 05/06/2018 16:13, Juri Lelli ha scritto:
> On 05/06/18 15:01, Quentin Perret wrote:

>> On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:

>>> On 05/06/18 14:05, Quentin Perret wrote:

>>>> On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:

>>>>> Hi Quentin,

>>>>>

>>>>> On 05/06/18 11:57, Quentin Perret wrote:

>>>>>

>>>>> [...]

>>>>>

>>>>>> What about the diff below (just a quick hack to show the idea) applied

>>>>>> on tip/sched/core ?

>>>>>>

>>>>>> ---8<---

>>>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

>>>>>> index a8ba6d1f262a..23a4fb1c2c25 100644

>>>>>> --- a/kernel/sched/cpufreq_schedutil.c

>>>>>> +++ b/kernel/sched/cpufreq_schedutil.c

>>>>>> @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

>>>>>>   	sg_cpu->util_dl  = cpu_util_dl(rq);

>>>>>>   }

>>>>>>   

>>>>>> +unsigned long scale_rt_capacity(int cpu);

>>>>>>   static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

>>>>>>   {

>>>>>>   	struct rq *rq = cpu_rq(sg_cpu->cpu);

>>>>>> +	int cpu = sg_cpu->cpu;

>>>>>> +	unsigned long util, dl_bw;

>>>>>>   

>>>>>>   	if (rq->rt.rt_nr_running)

>>>>>>   		return sg_cpu->max;

>>>>>> @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

>>>>>>   	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet

>>>>>>   	 * ready for such an interface. So, we only do the latter for now.

>>>>>>   	 */

>>>>>> -	return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

>>>>>> +	util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

>>>>>

>>>>> Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,

>>>>> since we use max below, we will probably have the same problem that we

>>>>> discussed on Vincent's approach (overestimation of DL contribution while

>>>>> we could use running_bw).

>>>>

>>>> Ah no, you're right, this isn't great for long running deadline tasks.

>>>> We should definitely account for the running_bw here, not the dl avg...

>>>>

>>>> I was trying to address the issue of RT stealing time from CFS here, but

>>>> the DL integration isn't quite right which this patch as-is, I agree ...

>>>>

>>>>>

>>>>>> +	util >>= SCHED_CAPACITY_SHIFT;

>>>>>> +	util = arch_scale_cpu_capacity(NULL, cpu) - util;

>>>>>> +	util += sg_cpu->util_cfs;

>>>>>> +	dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

>>>>>

>>>>> Why this_bw instead of running_bw?

>>>>

>>>> So IIUC, this_bw should basically give you the absolute reservation (== the

>>>> sum of runtime/deadline ratios of all DL tasks on that rq).

>>>

>>> Yep.

>>>

>>>> The reason I added this max is because I'm still not sure to understand

>>>> how we can safely drop the freq below that point ? If we don't guarantee

>>>> to always stay at least at the freq required by DL, aren't we risking to

>>>> start a deadline tasks stuck at a low freq because of rate limiting ? In

>>>> this case, if that tasks uses all of its runtime then you might start

>>>> missing deadlines ...

>>>

>>> We decided to avoid (software) rate limiting for DL with e97a90f7069b

>>> ("sched/cpufreq: Rate limits for SCHED_DEADLINE").

>>

>> Right, I spotted that one, but yeah you could also be limited by HW ...

>>

>>>

>>>> My feeling is that the only safe thing to do is to guarantee to never go

>>>> below the freq required by DL, and to optimistically add CFS tasks

>>>> without raising the OPP if we have good reasons to think that DL is

>>>> using less than it required (which is what we should get by using

>>>> running_bw above I suppose). Does that make any sense ?

>>>

>>> Then we can't still avoid the hardware limits, so using running_bw is a

>>> trade off between safety (especially considering soft real-time

>>> scenarios) and energy consumption (which seems to be working in

>>> practice).

>>

>> Ok, I see ... Have you guys already tried something like my patch above

>> (keeping the freq >= this_bw) in real world use cases ? Is this costing

>> that much energy in practice ? If we fill the gaps left by DL (when it

> 

> IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so

> he might add some numbers to my words above. I didn't (yet). But, please

> consider that I might be reserving (for example) 50% of bandwidth for my

> heavy and time sensitive task and then have that task wake up only once

> in a while (but I'll be keeping clock speed up for the whole time). :/


As far as I can remember, we never tested energy consumption of running_bw vs this_bw, as at OSPM'17 we had already decided to use running_bw implementing GRUB-PA.
The rationale is that, as Juri pointed out, the amount of spare (i.e. reclaimable) bandwidth in this_bw is very user-dependent.
For example, the user can let this_bw be much higher than the measured bandwidth, just to be sure that the deadlines are met even in corner cases.
In practice, this means that the task executes for quite a short time and then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced, at the 0lag time).
Using this_bw rather than running_bw, the CPU frequency would remain at the same fixed value even when the task is blocked.
I understand that on some cases it could even be better (i.e. no waste of energy in frequency switch).
However, IMHO, these are corner cases and in the average case it is better to rely on running_bw and reduce the CPU frequency accordingly.

Best regards,

              Claudio
Quentin Perret June 6, 2018, 1:20 p.m. UTC | #27
Hi Claudio,

On Wednesday 06 Jun 2018 at 15:05:58 (+0200), Claudio Scordino wrote:
> Hi Quentin,

> 

> Il 05/06/2018 16:13, Juri Lelli ha scritto:

> > On 05/06/18 15:01, Quentin Perret wrote:

> > > On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:

> > > > On 05/06/18 14:05, Quentin Perret wrote:

> > > > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:

> > > > > > Hi Quentin,

> > > > > > 

> > > > > > On 05/06/18 11:57, Quentin Perret wrote:

> > > > > > 

> > > > > > [...]

> > > > > > 

> > > > > > > What about the diff below (just a quick hack to show the idea) applied

> > > > > > > on tip/sched/core ?

> > > > > > > 

> > > > > > > ---8<---

> > > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > > > > > index a8ba6d1f262a..23a4fb1c2c25 100644

> > > > > > > --- a/kernel/sched/cpufreq_schedutil.c

> > > > > > > +++ b/kernel/sched/cpufreq_schedutil.c

> > > > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

> > > > > > >   	sg_cpu->util_dl  = cpu_util_dl(rq);

> > > > > > >   }

> > > > > > > +unsigned long scale_rt_capacity(int cpu);

> > > > > > >   static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > > > > > >   {

> > > > > > >   	struct rq *rq = cpu_rq(sg_cpu->cpu);

> > > > > > > +	int cpu = sg_cpu->cpu;

> > > > > > > +	unsigned long util, dl_bw;

> > > > > > >   	if (rq->rt.rt_nr_running)

> > > > > > >   		return sg_cpu->max;

> > > > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > > > > > >   	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet

> > > > > > >   	 * ready for such an interface. So, we only do the latter for now.

> > > > > > >   	 */

> > > > > > > -	return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

> > > > > > > +	util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

> > > > > > 

> > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,

> > > > > > since we use max below, we will probably have the same problem that we

> > > > > > discussed on Vincent's approach (overestimation of DL contribution while

> > > > > > we could use running_bw).

> > > > > 

> > > > > Ah no, you're right, this isn't great for long running deadline tasks.

> > > > > We should definitely account for the running_bw here, not the dl avg...

> > > > > 

> > > > > I was trying to address the issue of RT stealing time from CFS here, but

> > > > > the DL integration isn't quite right which this patch as-is, I agree ...

> > > > > 

> > > > > > 

> > > > > > > +	util >>= SCHED_CAPACITY_SHIFT;

> > > > > > > +	util = arch_scale_cpu_capacity(NULL, cpu) - util;

> > > > > > > +	util += sg_cpu->util_cfs;

> > > > > > > +	dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > > > > > 

> > > > > > Why this_bw instead of running_bw?

> > > > > 

> > > > > So IIUC, this_bw should basically give you the absolute reservation (== the

> > > > > sum of runtime/deadline ratios of all DL tasks on that rq).

> > > > 

> > > > Yep.

> > > > 

> > > > > The reason I added this max is because I'm still not sure to understand

> > > > > how we can safely drop the freq below that point ? If we don't guarantee

> > > > > to always stay at least at the freq required by DL, aren't we risking to

> > > > > start a deadline tasks stuck at a low freq because of rate limiting ? In

> > > > > this case, if that tasks uses all of its runtime then you might start

> > > > > missing deadlines ...

> > > > 

> > > > We decided to avoid (software) rate limiting for DL with e97a90f7069b

> > > > ("sched/cpufreq: Rate limits for SCHED_DEADLINE").

> > > 

> > > Right, I spotted that one, but yeah you could also be limited by HW ...

> > > 

> > > > 

> > > > > My feeling is that the only safe thing to do is to guarantee to never go

> > > > > below the freq required by DL, and to optimistically add CFS tasks

> > > > > without raising the OPP if we have good reasons to think that DL is

> > > > > using less than it required (which is what we should get by using

> > > > > running_bw above I suppose). Does that make any sense ?

> > > > 

> > > > Then we can't still avoid the hardware limits, so using running_bw is a

> > > > trade off between safety (especially considering soft real-time

> > > > scenarios) and energy consumption (which seems to be working in

> > > > practice).

> > > 

> > > Ok, I see ... Have you guys already tried something like my patch above

> > > (keeping the freq >= this_bw) in real world use cases ? Is this costing

> > > that much energy in practice ? If we fill the gaps left by DL (when it

> > 

> > IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so

> > he might add some numbers to my words above. I didn't (yet). But, please

> > consider that I might be reserving (for example) 50% of bandwidth for my

> > heavy and time sensitive task and then have that task wake up only once

> > in a while (but I'll be keeping clock speed up for the whole time). :/

> 

> As far as I can remember, we never tested energy consumption of running_bw

> vs this_bw, as at OSPM'17 we had already decided to use running_bw

> implementing GRUB-PA.

> The rationale is that, as Juri pointed out, the amount of spare (i.e.

> reclaimable) bandwidth in this_bw is very user-dependent. For example,

> the user can let this_bw be much higher than the measured bandwidth, just

> to be sure that the deadlines are met even in corner cases.


Ok I see the issue. Trusting userspace isn't necessarily the right thing
to do, I totally agree with that.

> In practice, this means that the task executes for quite a short time and

> then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced,

> at the 0lag time).

> Using this_bw rather than running_bw, the CPU frequency would remain at

> the same fixed value even when the task is blocked.

> I understand that on some cases it could even be better (i.e. no waste

> of energy in frequency switch).


+1, I'm pretty sure using this_bw is pretty much always worst than
using running_bw from an energy standpoint,. The waste of energy in
frequency changes should be less than the energy wasted by staying at a
too high frequency for a long time, otherwise DVFS isn't a good idea to
begin with :-)

> However, IMHO, these are corner cases and in the average case it is better

> to rely on running_bw and reduce the CPU frequency accordingly.


My point was that accepting to go at a lower frequency than required by
this_bw is fundamentally unsafe. If you're at a low frequency when a DL
task starts, there are real situations where you won't be able to
increase the frequency immediately, which can eventually lead to missing
deadlines. Now, if this risk is known, has been discussed, and is
accepted, that's fair enough. I'm just too late for the discussion :-)

Thank you for your time !
Quentin
Claudio Scordino June 6, 2018, 1:53 p.m. UTC | #28
Hi Quentin,

2018-06-06 15:20 GMT+02:00 Quentin Perret <quentin.perret@arm.com>:
>

> Hi Claudio,

>

> On Wednesday 06 Jun 2018 at 15:05:58 (+0200), Claudio Scordino wrote:

> > Hi Quentin,

> >

> > Il 05/06/2018 16:13, Juri Lelli ha scritto:

> > > On 05/06/18 15:01, Quentin Perret wrote:

> > > > On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:

> > > > > On 05/06/18 14:05, Quentin Perret wrote:

> > > > > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:

> > > > > > > Hi Quentin,

> > > > > > >

> > > > > > > On 05/06/18 11:57, Quentin Perret wrote:

> > > > > > >

> > > > > > > [...]

> > > > > > >

> > > > > > > > What about the diff below (just a quick hack to show the idea) applied

> > > > > > > > on tip/sched/core ?

> > > > > > > >

> > > > > > > > ---8<---

> > > > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > > > > > > index a8ba6d1f262a..23a4fb1c2c25 100644

> > > > > > > > --- a/kernel/sched/cpufreq_schedutil.c

> > > > > > > > +++ b/kernel/sched/cpufreq_schedutil.c

> > > > > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

> > > > > > > >           sg_cpu->util_dl  = cpu_util_dl(rq);

> > > > > > > >   }

> > > > > > > > +unsigned long scale_rt_capacity(int cpu);

> > > > > > > >   static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > > > > > > >   {

> > > > > > > >           struct rq *rq = cpu_rq(sg_cpu->cpu);

> > > > > > > > + int cpu = sg_cpu->cpu;

> > > > > > > > + unsigned long util, dl_bw;

> > > > > > > >           if (rq->rt.rt_nr_running)

> > > > > > > >                   return sg_cpu->max;

> > > > > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > > > > > > >            * util_cfs + util_dl as requested freq. However, cpufreq is not yet

> > > > > > > >            * ready for such an interface. So, we only do the latter for now.

> > > > > > > >            */

> > > > > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

> > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

> > > > > > >

> > > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,

> > > > > > > since we use max below, we will probably have the same problem that we

> > > > > > > discussed on Vincent's approach (overestimation of DL contribution while

> > > > > > > we could use running_bw).

> > > > > >

> > > > > > Ah no, you're right, this isn't great for long running deadline tasks.

> > > > > > We should definitely account for the running_bw here, not the dl avg...

> > > > > >

> > > > > > I was trying to address the issue of RT stealing time from CFS here, but

> > > > > > the DL integration isn't quite right which this patch as-is, I agree ...

> > > > > >

> > > > > > >

> > > > > > > > + util >>= SCHED_CAPACITY_SHIFT;

> > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util;

> > > > > > > > + util += sg_cpu->util_cfs;

> > > > > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > > > > > >

> > > > > > > Why this_bw instead of running_bw?

> > > > > >

> > > > > > So IIUC, this_bw should basically give you the absolute reservation (== the

> > > > > > sum of runtime/deadline ratios of all DL tasks on that rq).

> > > > >

> > > > > Yep.

> > > > >

> > > > > > The reason I added this max is because I'm still not sure to understand

> > > > > > how we can safely drop the freq below that point ? If we don't guarantee

> > > > > > to always stay at least at the freq required by DL, aren't we risking to

> > > > > > start a deadline tasks stuck at a low freq because of rate limiting ? In

> > > > > > this case, if that tasks uses all of its runtime then you might start

> > > > > > missing deadlines ...

> > > > >

> > > > > We decided to avoid (software) rate limiting for DL with e97a90f7069b

> > > > > ("sched/cpufreq: Rate limits for SCHED_DEADLINE").

> > > >

> > > > Right, I spotted that one, but yeah you could also be limited by HW ...

> > > >

> > > > >

> > > > > > My feeling is that the only safe thing to do is to guarantee to never go

> > > > > > below the freq required by DL, and to optimistically add CFS tasks

> > > > > > without raising the OPP if we have good reasons to think that DL is

> > > > > > using less than it required (which is what we should get by using

> > > > > > running_bw above I suppose). Does that make any sense ?

> > > > >

> > > > > Then we can't still avoid the hardware limits, so using running_bw is a

> > > > > trade off between safety (especially considering soft real-time

> > > > > scenarios) and energy consumption (which seems to be working in

> > > > > practice).

> > > >

> > > > Ok, I see ... Have you guys already tried something like my patch above

> > > > (keeping the freq >= this_bw) in real world use cases ? Is this costing

> > > > that much energy in practice ? If we fill the gaps left by DL (when it

> > >

> > > IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so

> > > he might add some numbers to my words above. I didn't (yet). But, please

> > > consider that I might be reserving (for example) 50% of bandwidth for my

> > > heavy and time sensitive task and then have that task wake up only once

> > > in a while (but I'll be keeping clock speed up for the whole time). :/

> >

> > As far as I can remember, we never tested energy consumption of running_bw

> > vs this_bw, as at OSPM'17 we had already decided to use running_bw

> > implementing GRUB-PA.

> > The rationale is that, as Juri pointed out, the amount of spare (i.e.

> > reclaimable) bandwidth in this_bw is very user-dependent. For example,

> > the user can let this_bw be much higher than the measured bandwidth, just

> > to be sure that the deadlines are met even in corner cases.

>

> Ok I see the issue. Trusting userspace isn't necessarily the right thing

> to do, I totally agree with that.

>

> > In practice, this means that the task executes for quite a short time and

> > then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced,

> > at the 0lag time).

> > Using this_bw rather than running_bw, the CPU frequency would remain at

> > the same fixed value even when the task is blocked.

> > I understand that on some cases it could even be better (i.e. no waste

> > of energy in frequency switch).

>

> +1, I'm pretty sure using this_bw is pretty much always worst than

> using running_bw from an energy standpoint,. The waste of energy in

> frequency changes should be less than the energy wasted by staying at a

> too high frequency for a long time, otherwise DVFS isn't a good idea to

> begin with :-)

>

> > However, IMHO, these are corner cases and in the average case it is better

> > to rely on running_bw and reduce the CPU frequency accordingly.

>

> My point was that accepting to go at a lower frequency than required by

> this_bw is fundamentally unsafe. If you're at a low frequency when a DL

> task starts, there are real situations where you won't be able to

> increase the frequency immediately, which can eventually lead to missing

> deadlines.



I see. Unfortunately, I'm having quite crazy days so I couldn't follow
the original discussion on LKML properly. My fault.
Anyway, to answer your question (if this time I have understood it correctly).

You're right: the tests have shown that whenever the DL task period
gets comparable with the time for switching frequency, the amount of
missed deadlines becomes not negligible.
To give you a rough idea, this already happens with periods of 10msec
on a Odroid XU4.
The reason is that the task instance starts at a too low frequency,
and the system can't switch frequency in time for meeting the
deadline.

This is a known issue, partially discussed during the RT Summit'17.
However, the community has been more in favour of reducing the energy
consumption than meeting firm deadlines.
If you need a safe system, in fact, you'd better thinking about
disabling DVFS completely and relying on a fixed CPU frequency.

A possible trade-off could be a further entry in sys to let system
designers switching from (default) running_bw to (more pessimistic)
this_bw.
However, I'm not sure the community wants a further knob on sysfs just
to make RT people happier :)

Best,

              Claudio
Quentin Perret June 6, 2018, 2:10 p.m. UTC | #29
On Wednesday 06 Jun 2018 at 15:53:27 (+0200), Claudio Scordino wrote:
> Hi Quentin,

> 

> 2018-06-06 15:20 GMT+02:00 Quentin Perret <quentin.perret@arm.com>:

> >

> > Hi Claudio,

> >

> > On Wednesday 06 Jun 2018 at 15:05:58 (+0200), Claudio Scordino wrote:

> > > Hi Quentin,

> > >

> > > Il 05/06/2018 16:13, Juri Lelli ha scritto:

> > > > On 05/06/18 15:01, Quentin Perret wrote:

> > > > > On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:

> > > > > > On 05/06/18 14:05, Quentin Perret wrote:

> > > > > > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:

> > > > > > > > Hi Quentin,

> > > > > > > >

> > > > > > > > On 05/06/18 11:57, Quentin Perret wrote:

> > > > > > > >

> > > > > > > > [...]

> > > > > > > >

> > > > > > > > > What about the diff below (just a quick hack to show the idea) applied

> > > > > > > > > on tip/sched/core ?

> > > > > > > > >

> > > > > > > > > ---8<---

> > > > > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > > > > > > > index a8ba6d1f262a..23a4fb1c2c25 100644

> > > > > > > > > --- a/kernel/sched/cpufreq_schedutil.c

> > > > > > > > > +++ b/kernel/sched/cpufreq_schedutil.c

> > > > > > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

> > > > > > > > >           sg_cpu->util_dl  = cpu_util_dl(rq);

> > > > > > > > >   }

> > > > > > > > > +unsigned long scale_rt_capacity(int cpu);

> > > > > > > > >   static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > > > > > > > >   {

> > > > > > > > >           struct rq *rq = cpu_rq(sg_cpu->cpu);

> > > > > > > > > + int cpu = sg_cpu->cpu;

> > > > > > > > > + unsigned long util, dl_bw;

> > > > > > > > >           if (rq->rt.rt_nr_running)

> > > > > > > > >                   return sg_cpu->max;

> > > > > > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

> > > > > > > > >            * util_cfs + util_dl as requested freq. However, cpufreq is not yet

> > > > > > > > >            * ready for such an interface. So, we only do the latter for now.

> > > > > > > > >            */

> > > > > > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));

> > > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

> > > > > > > >

> > > > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,

> > > > > > > > since we use max below, we will probably have the same problem that we

> > > > > > > > discussed on Vincent's approach (overestimation of DL contribution while

> > > > > > > > we could use running_bw).

> > > > > > >

> > > > > > > Ah no, you're right, this isn't great for long running deadline tasks.

> > > > > > > We should definitely account for the running_bw here, not the dl avg...

> > > > > > >

> > > > > > > I was trying to address the issue of RT stealing time from CFS here, but

> > > > > > > the DL integration isn't quite right which this patch as-is, I agree ...

> > > > > > >

> > > > > > > >

> > > > > > > > > + util >>= SCHED_CAPACITY_SHIFT;

> > > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util;

> > > > > > > > > + util += sg_cpu->util_cfs;

> > > > > > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

> > > > > > > >

> > > > > > > > Why this_bw instead of running_bw?

> > > > > > >

> > > > > > > So IIUC, this_bw should basically give you the absolute reservation (== the

> > > > > > > sum of runtime/deadline ratios of all DL tasks on that rq).

> > > > > >

> > > > > > Yep.

> > > > > >

> > > > > > > The reason I added this max is because I'm still not sure to understand

> > > > > > > how we can safely drop the freq below that point ? If we don't guarantee

> > > > > > > to always stay at least at the freq required by DL, aren't we risking to

> > > > > > > start a deadline tasks stuck at a low freq because of rate limiting ? In

> > > > > > > this case, if that tasks uses all of its runtime then you might start

> > > > > > > missing deadlines ...

> > > > > >

> > > > > > We decided to avoid (software) rate limiting for DL with e97a90f7069b

> > > > > > ("sched/cpufreq: Rate limits for SCHED_DEADLINE").

> > > > >

> > > > > Right, I spotted that one, but yeah you could also be limited by HW ...

> > > > >

> > > > > >

> > > > > > > My feeling is that the only safe thing to do is to guarantee to never go

> > > > > > > below the freq required by DL, and to optimistically add CFS tasks

> > > > > > > without raising the OPP if we have good reasons to think that DL is

> > > > > > > using less than it required (which is what we should get by using

> > > > > > > running_bw above I suppose). Does that make any sense ?

> > > > > >

> > > > > > Then we can't still avoid the hardware limits, so using running_bw is a

> > > > > > trade off between safety (especially considering soft real-time

> > > > > > scenarios) and energy consumption (which seems to be working in

> > > > > > practice).

> > > > >

> > > > > Ok, I see ... Have you guys already tried something like my patch above

> > > > > (keeping the freq >= this_bw) in real world use cases ? Is this costing

> > > > > that much energy in practice ? If we fill the gaps left by DL (when it

> > > >

> > > > IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so

> > > > he might add some numbers to my words above. I didn't (yet). But, please

> > > > consider that I might be reserving (for example) 50% of bandwidth for my

> > > > heavy and time sensitive task and then have that task wake up only once

> > > > in a while (but I'll be keeping clock speed up for the whole time). :/

> > >

> > > As far as I can remember, we never tested energy consumption of running_bw

> > > vs this_bw, as at OSPM'17 we had already decided to use running_bw

> > > implementing GRUB-PA.

> > > The rationale is that, as Juri pointed out, the amount of spare (i.e.

> > > reclaimable) bandwidth in this_bw is very user-dependent. For example,

> > > the user can let this_bw be much higher than the measured bandwidth, just

> > > to be sure that the deadlines are met even in corner cases.

> >

> > Ok I see the issue. Trusting userspace isn't necessarily the right thing

> > to do, I totally agree with that.

> >

> > > In practice, this means that the task executes for quite a short time and

> > > then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced,

> > > at the 0lag time).

> > > Using this_bw rather than running_bw, the CPU frequency would remain at

> > > the same fixed value even when the task is blocked.

> > > I understand that on some cases it could even be better (i.e. no waste

> > > of energy in frequency switch).

> >

> > +1, I'm pretty sure using this_bw is pretty much always worst than

> > using running_bw from an energy standpoint,. The waste of energy in

> > frequency changes should be less than the energy wasted by staying at a

> > too high frequency for a long time, otherwise DVFS isn't a good idea to

> > begin with :-)

> >

> > > However, IMHO, these are corner cases and in the average case it is better

> > > to rely on running_bw and reduce the CPU frequency accordingly.

> >

> > My point was that accepting to go at a lower frequency than required by

> > this_bw is fundamentally unsafe. If you're at a low frequency when a DL

> > task starts, there are real situations where you won't be able to

> > increase the frequency immediately, which can eventually lead to missing

> > deadlines.

> 

> 

> I see. Unfortunately, I'm having quite crazy days so I couldn't follow

> the original discussion on LKML properly. My fault.


No problem !

> Anyway, to answer your question (if this time I have understood it correctly).

> 

> You're right: the tests have shown that whenever the DL task period

> gets comparable with the time for switching frequency, the amount of

> missed deadlines becomes not negligible.

> To give you a rough idea, this already happens with periods of 10msec

> on a Odroid XU4.

> The reason is that the task instance starts at a too low frequency,

> and the system can't switch frequency in time for meeting the

> deadline.


Ok that's very interesting ...

> 

> This is a known issue, partially discussed during the RT Summit'17.

> However, the community has been more in favour of reducing the energy

> consumption than meeting firm deadlines.

> If you need a safe system, in fact, you'd better thinking about

> disabling DVFS completely and relying on a fixed CPU frequency.


Yeah, understood. My proposition was sort of a middle-ground solution. I
was basically suggesting to select OPPs with something like:

        max(this_bw, running_bw + cfs_util)

The idea is that you're always guaranteed to always request a high enough
frequency for this_bw, and you can opportunistically try to run CFS tasks
without raising the OPP if running_bw is low. In the worst case, the CFS
tasks will be preempted and delayed a bit. But DL should always be safe.
And if the CFS activity is sufficient to fill the gap between running_bw
and this_bw, then you should be pretty energy efficient as well.

Now, that doesn't always solve the issue you mentioned earlier. If there
isn't much CFS traffic going on, and if the delta between this_bw and
running_bw is high, they you'll be stuck at a high freq although the
system utilization is low ... As usual, it's just a trade off :-)

> 

> A possible trade-off could be a further entry in sys to let system

> designers switching from (default) running_bw to (more pessimistic)

> this_bw.

> However, I'm not sure the community wants a further knob on sysfs just

> to make RT people happier :)

> 

> Best,

> 

>               Claudio
luca abeni June 6, 2018, 8:53 p.m. UTC | #30
Hi all,

sorry; I missed the beginning of this thread... Anyway, below I add
some comments:

On Wed, 6 Jun 2018 15:05:58 +0200
Claudio Scordino <claudio@evidence.eu.com> wrote:
[...]
> >> Ok, I see ... Have you guys already tried something like my patch

> >> above (keeping the freq >= this_bw) in real world use cases ? Is

> >> this costing that much energy in practice ? If we fill the gaps

> >> left by DL (when it  

> > 

> > IIRC, Claudio (now Cc-ed) did experiment a bit with both

> > approaches, so he might add some numbers to my words above. I

> > didn't (yet). But, please consider that I might be reserving (for

> > example) 50% of bandwidth for my heavy and time sensitive task and

> > then have that task wake up only once in a while (but I'll be

> > keeping clock speed up for the whole time). :/  

> 

> As far as I can remember, we never tested energy consumption of

> running_bw vs this_bw, as at OSPM'17 we had already decided to use

> running_bw implementing GRUB-PA. The rationale is that, as Juri

> pointed out, the amount of spare (i.e. reclaimable) bandwidth in

> this_bw is very user-dependent.


Yes, I agree with this. The appropriateness of using this_bw or
running_bw is highly workload-dependent... If a periodic task consumes
much less than its runtime (or if a sporadic task has inter-activation
times much larger than the SCHED_DEADLINE period), then running_bw has
to be preferred. But if a periodic task consumes almost all of its
runtime before blocking, then this_bw has to be preferred...

But this also depends on the hardware: if the frequency switch time is
small, then running_bw is more appropriate... On the other hand,
this_bw works much better if the frequency switch time is high.
(Talking about this, I remember Claudio measured frequency switch times
large almost 3ms... Is this really due to hardware issues? Or maybe
there is some software issue invoved? 3ms look like a lot of time...)

Anyway, this is why I proposed to use some kind of /sys/... file to
control the kind of deadline utilization used for frequency scaling: in
this way, the system designer / administrator, who hopefully has the
needed information about workload and hardware, can optimize the
frequency scaling behaviour by deciding if running_bw or this_bw will be
used.


			Luca

> For example, the user can let this_bw

> be much higher than the measured bandwidth, just to be sure that the

> deadlines are met even in corner cases. In practice, this means that

> the task executes for quite a short time and then blocks (with its

> bandwidth reclaimed, hence the CPU frequency reduced, at the 0lag

> time). Using this_bw rather than running_bw, the CPU frequency would

> remain at the same fixed value even when the task is blocked. I

> understand that on some cases it could even be better (i.e. no waste

> of energy in frequency switch). However, IMHO, these are corner cases

> and in the average case it is better to rely on running_bw and reduce

> the CPU frequency accordingly.

> 

> Best regards,

> 

>               Claudio
luca abeni June 6, 2018, 9:05 p.m. UTC | #31
Hi,

On Wed, 6 Jun 2018 14:20:46 +0100
Quentin Perret <quentin.perret@arm.com> wrote:
[...]
> > However, IMHO, these are corner cases and in the average case it is

> > better to rely on running_bw and reduce the CPU frequency

> > accordingly.  

> 

> My point was that accepting to go at a lower frequency than required

> by this_bw is fundamentally unsafe. If you're at a low frequency when

> a DL task starts, there are real situations where you won't be able

> to increase the frequency immediately, which can eventually lead to

> missing deadlines. Now, if this risk is known, has been discussed,

> and is accepted, that's fair enough. I'm just too late for the

> discussion :-)


Well, our conclusion was that this issue can be addressed when
designing the scheduling parameters:
- If we do not consider frequency scaling, a task can respect its
  deadlines if the SCHED_DEADLINE runtime is larger than the task's
  execution time and the SCHED_DEADLINE period is smaller than the
  task's period (and if some kind of "global" admission test is
  respected)
- Considering frequency scaling (and 0-time frequency switches), the
  SCHED_DEADLINE runtime must be larger than the task execution time at
  the highest frequency
- If the frequency switch time is larger than 0, then the
  SCHED_DEADLINE runtime must be larger than the task execution time
  (at the highest frequency) plus the frequency switch time

If this third condition is respected, I think that deadline misses can
be avoided even if running_bw is used (and the CPU takes a considerable
time to switch frequency). Of course, this requires an over-allocation
of runtime (and the global admission test has more probabilities to
fail)...



			Luca
Quentin Perret June 7, 2018, 8:25 a.m. UTC | #32
Hi Luca,

On Wednesday 06 Jun 2018 at 23:05:36 (+0200), luca abeni wrote:
> Hi,

> 

> On Wed, 6 Jun 2018 14:20:46 +0100

> Quentin Perret <quentin.perret@arm.com> wrote:

> [...]

> > > However, IMHO, these are corner cases and in the average case it is

> > > better to rely on running_bw and reduce the CPU frequency

> > > accordingly.  

> > 

> > My point was that accepting to go at a lower frequency than required

> > by this_bw is fundamentally unsafe. If you're at a low frequency when

> > a DL task starts, there are real situations where you won't be able

> > to increase the frequency immediately, which can eventually lead to

> > missing deadlines. Now, if this risk is known, has been discussed,

> > and is accepted, that's fair enough. I'm just too late for the

> > discussion :-)

> 

> Well, our conclusion was that this issue can be addressed when

> designing the scheduling parameters:

> - If we do not consider frequency scaling, a task can respect its

>   deadlines if the SCHED_DEADLINE runtime is larger than the task's

>   execution time and the SCHED_DEADLINE period is smaller than the

>   task's period (and if some kind of "global" admission test is

>   respected)

> - Considering frequency scaling (and 0-time frequency switches), the

>   SCHED_DEADLINE runtime must be larger than the task execution time at

>   the highest frequency

> - If the frequency switch time is larger than 0, then the

>   SCHED_DEADLINE runtime must be larger than the task execution time

>   (at the highest frequency) plus the frequency switch time

> 

> If this third condition is respected, I think that deadline misses can

> be avoided even if running_bw is used (and the CPU takes a considerable

> time to switch frequency). Of course, this requires an over-allocation

> of runtime (and the global admission test has more probabilities to

> fail)...


Ah, right, this third condition should definitely be a valid workaround
to the issue I mentioned earlier. And the runtime parameter is already
very much target-dependent I guess, so it should be fine to add another
target-specific component (the frequency-switching time) to the runtime
estimation.

And, if you really need to have tight runtimes to fit all of your tasks,
then you should just use a fixed frequency I guess ... At least the
current implementation gives a choice to the user between being
energy-efficient using sugov with over-allocated runtimes and having
tighter runtimes with the performance/powersave/userspace governor, so
that's all good :-)

Thank you very much,
Quentin