diff mbox

[v3,09/12] Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED"

Message ID 1404144343-18720-10-git-send-email-vincent.guittot@linaro.org
State New
Headers show

Commit Message

Vincent Guittot June 30, 2014, 4:05 p.m. UTC
This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.

We are going to use runnable_avg_sum and runnable_avg_period in order to get
the utilization of the CPU. This statistic includes all tasks that run the CPU
and not only CFS tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 13 ++++++-------
 kernel/sched/sched.h |  4 ++--
 2 files changed, 8 insertions(+), 9 deletions(-)

Comments

Peter Zijlstra July 10, 2014, 1:16 p.m. UTC | #1
On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
> 
> We are going to use runnable_avg_sum and runnable_avg_period in order to get
> the utilization of the CPU. This statistic includes all tasks that run the CPU
> and not only CFS tasks.

But this rq->avg is not the one that is migration aware, right? So why
use this?

We already compensate cpu_capacity for !fair tasks, so I don't see why
we can't use the migration aware one (and kill this one as Yuyang keeps
proposing) and compensate with the capacity factor.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot July 11, 2014, 7:51 a.m. UTC | #2
On 10 July 2014 15:16, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
>> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
>>
>> We are going to use runnable_avg_sum and runnable_avg_period in order to get
>> the utilization of the CPU. This statistic includes all tasks that run the CPU
>> and not only CFS tasks.
>
> But this rq->avg is not the one that is migration aware, right? So why
> use this?

Yes, it's not the one that is migration aware

>
> We already compensate cpu_capacity for !fair tasks, so I don't see why
> we can't use the migration aware one (and kill this one as Yuyang keeps
> proposing) and compensate with the capacity factor.

The 1st point is that cpu_capacity is compensated by both !fair_tasks
and frequency scaling and we should not take into account frequency
scaling for detecting overload

What we have now is the the weighted load avg that is the sum of the
weight load of entities on the run queue. This is not usable to detect
overload because of the weight. An unweighted version of this figure
would be more usefull but it's not as accurate as the one I use IMHO.
The example that has been discussed during the review of the last
version has shown some limitations

With the following schedule pattern from Morten's example

   | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
A:   run     rq     run  ----------- sleeping -------------  run
B:   rq      run    rq    run   ---- sleeping -------------  rq

The scheduler will see the following values:
Task A unweighted load value is 47%
Task B unweight load is 60%
The maximum Sum of unweighted load is 104%
rq->avg load is 60%

And the real CPU load is 50%

So we will have opposite decision depending of the used values: the
rq->avg or the Sum of unweighted load

The sum of unweighted load has the main advantage of showing
immediately what will be the relative impact of adding/removing a
task. In the example, we can see that removing task A or B will remove
around half the CPU load but it's not so good for giving the current
utilization of the CPU

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra July 11, 2014, 3:13 p.m. UTC | #3
On Fri, Jul 11, 2014 at 09:51:06AM +0200, Vincent Guittot wrote:
> On 10 July 2014 15:16, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
> >> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
> >>
> >> We are going to use runnable_avg_sum and runnable_avg_period in order to get
> >> the utilization of the CPU. This statistic includes all tasks that run the CPU
> >> and not only CFS tasks.
> >
> > But this rq->avg is not the one that is migration aware, right? So why
> > use this?
> 
> Yes, it's not the one that is migration aware
> 
> >
> > We already compensate cpu_capacity for !fair tasks, so I don't see why
> > we can't use the migration aware one (and kill this one as Yuyang keeps
> > proposing) and compensate with the capacity factor.
> 
> The 1st point is that cpu_capacity is compensated by both !fair_tasks
> and frequency scaling and we should not take into account frequency
> scaling for detecting overload

dvfs could help? Also we should not use arch_scale_freq_capacity() for
things like cpufreq-ondemand etc. Because for those the compute capacity
is still the max. We should only use it when we hard limit things.

> What we have now is the the weighted load avg that is the sum of the
> weight load of entities on the run queue. This is not usable to detect
> overload because of the weight. An unweighted version of this figure
> would be more usefull but it's not as accurate as the one I use IMHO.
> The example that has been discussed during the review of the last
> version has shown some limitations
> 
> With the following schedule pattern from Morten's example
> 
>    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> A:   run     rq     run  ----------- sleeping -------------  run
> B:   rq      run    rq    run   ---- sleeping -------------  rq
> 
> The scheduler will see the following values:
> Task A unweighted load value is 47%
> Task B unweight load is 60%
> The maximum Sum of unweighted load is 104%
> rq->avg load is 60%
> 
> And the real CPU load is 50%
> 
> So we will have opposite decision depending of the used values: the
> rq->avg or the Sum of unweighted load
> 
> The sum of unweighted load has the main advantage of showing
> immediately what will be the relative impact of adding/removing a
> task. In the example, we can see that removing task A or B will remove
> around half the CPU load but it's not so good for giving the current
> utilization of the CPU

In that same discussion ISTR a suggestion about adding avg_running time,
as opposed to the current avg_runnable. The sum of avg_running should be
much more accurate, and still react correctly to migrations.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen July 11, 2014, 4:13 p.m. UTC | #4
On Fri, Jul 11, 2014 at 08:51:06AM +0100, Vincent Guittot wrote:
> On 10 July 2014 15:16, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
> >> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
> >>
> >> We are going to use runnable_avg_sum and runnable_avg_period in order to get
> >> the utilization of the CPU. This statistic includes all tasks that run the CPU
> >> and not only CFS tasks.
> >
> > But this rq->avg is not the one that is migration aware, right? So why
> > use this?
> 
> Yes, it's not the one that is migration aware
> 
> >
> > We already compensate cpu_capacity for !fair tasks, so I don't see why
> > we can't use the migration aware one (and kill this one as Yuyang keeps
> > proposing) and compensate with the capacity factor.
> 
> The 1st point is that cpu_capacity is compensated by both !fair_tasks
> and frequency scaling and we should not take into account frequency
> scaling for detecting overload
> 
> What we have now is the the weighted load avg that is the sum of the
> weight load of entities on the run queue. This is not usable to detect
> overload because of the weight. An unweighted version of this figure
> would be more usefull but it's not as accurate as the one I use IMHO.

IMHO there is no perfect utilization metric, but I think it is
fundamentally wrong to use a metric that is migration unaware to make
migration decisions. I mentioned that during the last review as well. It
is like having a very fast controller with a really slow (large delay)
feedback loop. There is a high risk of getting an unstable balance when
you load-balance rate is faster than the feedback delay.

> The example that has been discussed during the review of the last
> version has shown some limitations
> 
> With the following schedule pattern from Morten's example
> 
>    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> A:   run     rq     run  ----------- sleeping -------------  run
> B:   rq      run    rq    run   ---- sleeping -------------  rq
> 
> The scheduler will see the following values:
> Task A unweighted load value is 47%
> Task B unweight load is 60%
> The maximum Sum of unweighted load is 104%
> rq->avg load is 60%
> 
> And the real CPU load is 50%
> 
> So we will have opposite decision depending of the used values: the
> rq->avg or the Sum of unweighted load
> 
> The sum of unweighted load has the main advantage of showing
> immediately what will be the relative impact of adding/removing a
> task. In the example, we can see that removing task A or B will remove
> around half the CPU load but it's not so good for giving the current
> utilization of the CPU

You forgot to mention the issues with rq->avg that were brought up last
time :-)

Here is an load-balancing example:

Task A, B, C, and D are all running/runnable constantly. To avoid
decimals we assume the sched tick to have a 9 ms period. We have four
cpus in a single sched_domain.

rq == rq->avg
uw == unweighted tracked load

cpu0:
    | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
 A:   run    rq     rq
 B:   rq     run    rq
 C:   rq     rq     run
 D:   rq     rq     rq     run    run    run    run    run    run
rq:  100%    100%   100%   100%   100%   100%   100%   100%   100%
uw:  400%    400%   400%   100%   100%   100%   100%   100%   100%

cpu1:
    | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
 A:                        run    rq     run    rq     run    rq
 B:                        rq     run    rq     run    rq     run
 C:
 D:
rq:    0%      0%     0%     0%     6%    12%    18%    23%    28%
uw:    0%      0%     0%   200%   200%   200%   200%   200%   200%

cpu2:
    | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
 A: 
 B:
 C:                        run    run    run    run    run    run
 D:
rq:    0%      0%     0%     0%     6%    12%    18%    23%    28%
uw:    0%      0%     0%   100%   100%   100%   100%   100%   100%

cpu3:
    | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
 A: 
 B:
 C:
 D:
rq:    0%      0%     0%     0%     0%     0%     0%     0%     0%
uw:    0%      0%     0%     0%     0%     0%     0%     0%     0%

A periodic load-balance occurs on cpu1 after 9 ms. cpu0 rq->avg
indicates overload. Consequently cpu1 pulls task A and B.

Shortly after (<1 ms) cpu2 does a periodic load-balance. cpu0 rq->avg
hasn't changed so cpu0 still appears overloaded. cpu2 pulls task C.

Shortly after (<1 ms) cpu3 does a periodic load-balance. cpu0 rq->avg
still indicates overload so cpu3 tries to pull tasks but fails since
there is only task D left.

9 ms later the sched tick causes periodic load-balances on all the cpus.
cpu0 rq->avg still indicates that it has the highest load since cpu1
rq->avg has not had time to indicate overload. Consequently cpu1, 2,
and 3 will try to pull from that and fail. The balance will only change
once cpu1 rq->avg has increased enough to indicate overload.

Unweighted load will on the other hand indicate the load changes
instantaneously, so cpu3 would observe the overload of cpu1 immediately
and pull task A or B.

In this example using rq->avg leads to imbalance whereas unweighted load
would not. Correct me if I missed anything.

Coming back to the previous example. I'm not convinced that inflation of
the unweighted load sum when tasks overlap in time is a bad thing. I
have mentioned this before. The average cpu utilization over the 40ms
period is 50%. However the true compute capacity demand is 200% for the
first 15ms of the period, 100% for the next 5ms and 0% for the remaining
25ms. The cpu is actually overloaded for 15ms every 40ms. This fact is
factored into the unweighted load whereas rq->avg would give you the
same utilization no matter if the tasks are overlapped or not. Hence
unweighted load would give us an indication that the mix of tasks isn't
optimal even if the cpu has spare cycles.

If you don't care about overlap and latency, the unweighted sum of task
running time (that Peter has proposed a number of times) is better
metric, IMHO. As long the cpu isn't fully utilized.

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot July 11, 2014, 5:39 p.m. UTC | #5
On 11 July 2014 17:13, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Jul 11, 2014 at 09:51:06AM +0200, Vincent Guittot wrote:
>> On 10 July 2014 15:16, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
>> >> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
>> >>
>> >> We are going to use runnable_avg_sum and runnable_avg_period in order to get
>> >> the utilization of the CPU. This statistic includes all tasks that run the CPU
>> >> and not only CFS tasks.
>> >
>> > But this rq->avg is not the one that is migration aware, right? So why
>> > use this?
>>
>> Yes, it's not the one that is migration aware
>>
>> >
>> > We already compensate cpu_capacity for !fair tasks, so I don't see why
>> > we can't use the migration aware one (and kill this one as Yuyang keeps
>> > proposing) and compensate with the capacity factor.
>>
>> The 1st point is that cpu_capacity is compensated by both !fair_tasks
>> and frequency scaling and we should not take into account frequency
>> scaling for detecting overload
>
> dvfs could help? Also we should not use arch_scale_freq_capacity() for
> things like cpufreq-ondemand etc. Because for those the compute capacity
> is still the max. We should only use it when we hard limit things.

In my mind, arch_scale_cpu_freq was intend to scale the capacity of
the CPU according to the current dvfs operating point.
As it's no more use anywhere now that we have arch_scale_cpu, we could
probably remove it .. and see when it will become used.

>
>> What we have now is the the weighted load avg that is the sum of the
>> weight load of entities on the run queue. This is not usable to detect
>> overload because of the weight. An unweighted version of this figure
>> would be more usefull but it's not as accurate as the one I use IMHO.
>> The example that has been discussed during the review of the last
>> version has shown some limitations
>>
>> With the following schedule pattern from Morten's example
>>
>>    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
>> A:   run     rq     run  ----------- sleeping -------------  run
>> B:   rq      run    rq    run   ---- sleeping -------------  rq
>>
>> The scheduler will see the following values:
>> Task A unweighted load value is 47%
>> Task B unweight load is 60%
>> The maximum Sum of unweighted load is 104%
>> rq->avg load is 60%
>>
>> And the real CPU load is 50%
>>
>> So we will have opposite decision depending of the used values: the
>> rq->avg or the Sum of unweighted load
>>
>> The sum of unweighted load has the main advantage of showing
>> immediately what will be the relative impact of adding/removing a
>> task. In the example, we can see that removing task A or B will remove
>> around half the CPU load but it's not so good for giving the current
>> utilization of the CPU
>
> In that same discussion ISTR a suggestion about adding avg_running time,
> as opposed to the current avg_runnable. The sum of avg_running should be
> much more accurate, and still react correctly to migrations.

I haven't look in details but I agree that avg_running would be much
more accurate than avg_runnable and should probably fit the
requirement. Does it means that we could re-add the avg_running (or
something similar) that has disappeared during the review of load avg
tracking patchset ?

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra July 11, 2014, 8:12 p.m. UTC | #6
On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
> In my mind, arch_scale_cpu_freq was intend to scale the capacity of
> the CPU according to the current dvfs operating point.
> As it's no more use anywhere now that we have arch_scale_cpu, we could
> probably remove it .. and see when it will become used.

I probably should have written comments when I wrote that code, but it
was meant to be used only where, as described above, we limit things.
Ondemand and such, which will temporarily decrease freq, will ramp it up
again at demand, and therefore lowering the capacity will skew things.

You'll put less load on because its run slower, and then you'll run it
slower because there's less load on -> cyclic FAIL.

> > In that same discussion ISTR a suggestion about adding avg_running time,
> > as opposed to the current avg_runnable. The sum of avg_running should be
> > much more accurate, and still react correctly to migrations.
> 
> I haven't look in details but I agree that avg_running would be much
> more accurate than avg_runnable and should probably fit the
> requirement. Does it means that we could re-add the avg_running (or
> something similar) that has disappeared during the review of load avg
> tracking patchset ?

Sure, I think we killed it there because there wasn't an actual use for
it and I'm always in favour of stripping everything to their bare bones,
esp big and complex things.

And then later, add things back once we have need for it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen July 14, 2014, 12:55 p.m. UTC | #7
On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote:
> On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
> > In my mind, arch_scale_cpu_freq was intend to scale the capacity of
> > the CPU according to the current dvfs operating point.
> > As it's no more use anywhere now that we have arch_scale_cpu, we could
> > probably remove it .. and see when it will become used.
> 
> I probably should have written comments when I wrote that code, but it
> was meant to be used only where, as described above, we limit things.
> Ondemand and such, which will temporarily decrease freq, will ramp it up
> again at demand, and therefore lowering the capacity will skew things.
> 
> You'll put less load on because its run slower, and then you'll run it
> slower because there's less load on -> cyclic FAIL.

Agreed. We can't use a frequency scaled compute capacity for all
load-balancing decisions. However, IMHO, it would be useful to have know
the current compute capacity in addition to the max compute capacity
when considering energy costs. So we would have something like:

* capacity_max: cpu capacity at highest frequency.

* capacity_cur: cpu capacity at current frequency.

* capacity_avail: cpu capacity currently available. Basically
  capacity_cur taking rt, deadline, and irq accounting into account.

capacity_max should probably include rt, deadline, and irq accounting as
well. Or we need both?

Based on your description arch_scale_freq_capacity() can't be abused to
implement capacity_cur (and capacity_avail) unless it is repurposed.
Nobody seems to implement it. Otherwise we would need something similar
to update capacity_cur (and capacity_avail). 

As a side note, we can potentially get into a similar fail cycle already
due to the lack of scale invariance in the entity load tracking.

> 
> > > In that same discussion ISTR a suggestion about adding avg_running time,
> > > as opposed to the current avg_runnable. The sum of avg_running should be
> > > much more accurate, and still react correctly to migrations.
> > 
> > I haven't look in details but I agree that avg_running would be much
> > more accurate than avg_runnable and should probably fit the
> > requirement. Does it means that we could re-add the avg_running (or
> > something similar) that has disappeared during the review of load avg
> > tracking patchset ?
> 
> Sure, I think we killed it there because there wasn't an actual use for
> it and I'm always in favour of stripping everything to their bare bones,
> esp big and complex things.
> 
> And then later, add things back once we have need for it.

I think it is a useful addition to the set of utilization metrics. I
don't think it is universally more accurate than runnable_avg. Actually
quite the opposite when the cpu is overloaded. But for partially loaded
cpus it is very useful if you don't want to factor in waiting time on
the rq.

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra July 14, 2014, 1:20 p.m. UTC | #8
On Mon, Jul 14, 2014 at 01:55:29PM +0100, Morten Rasmussen wrote:
> On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote:
> > On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
> > > In my mind, arch_scale_cpu_freq was intend to scale the capacity of
> > > the CPU according to the current dvfs operating point.
> > > As it's no more use anywhere now that we have arch_scale_cpu, we could
> > > probably remove it .. and see when it will become used.
> > 
> > I probably should have written comments when I wrote that code, but it
> > was meant to be used only where, as described above, we limit things.
> > Ondemand and such, which will temporarily decrease freq, will ramp it up
> > again at demand, and therefore lowering the capacity will skew things.
> > 
> > You'll put less load on because its run slower, and then you'll run it
> > slower because there's less load on -> cyclic FAIL.
> 
> Agreed. We can't use a frequency scaled compute capacity for all
> load-balancing decisions. However, IMHO, it would be useful to have know
> the current compute capacity in addition to the max compute capacity
> when considering energy costs. So we would have something like:
> 
> * capacity_max: cpu capacity at highest frequency.
> 
> * capacity_cur: cpu capacity at current frequency.
> 
> * capacity_avail: cpu capacity currently available. Basically
>   capacity_cur taking rt, deadline, and irq accounting into account.
> 
> capacity_max should probably include rt, deadline, and irq accounting as
> well. Or we need both?

I'm struggling to fully grasp your intent. We need DVFS like accounting
for sure, and that means a current freq hook, but I'm not entirely sure
how that relates to capacity.

> Based on your description arch_scale_freq_capacity() can't be abused to
> implement capacity_cur (and capacity_avail) unless it is repurposed.
> Nobody seems to implement it. Otherwise we would need something similar
> to update capacity_cur (and capacity_avail).

Yeah, I never got around to doing so. I started doing a APERF/MPERF SMT
capacity thing for x86 but never finished that. The naive implementation
suffered the same FAIL loop as above because APERF stops on idle. So
when idle your capacity drops to nothing, leading to no new work,
leading to more idle etc.

I never got around to fixing that -- adding an idle filter, and ever
since things have somewhat bitrotted.

> As a side note, we can potentially get into a similar fail cycle already
> due to the lack of scale invariance in the entity load tracking.

Yah, I think that got mentioned a long while ago.

> > > > In that same discussion ISTR a suggestion about adding avg_running time,
> > > > as opposed to the current avg_runnable. The sum of avg_running should be
> > > > much more accurate, and still react correctly to migrations.
> > > 
> > > I haven't look in details but I agree that avg_running would be much
> > > more accurate than avg_runnable and should probably fit the
> > > requirement. Does it means that we could re-add the avg_running (or
> > > something similar) that has disappeared during the review of load avg
> > > tracking patchset ?
> > 
> > Sure, I think we killed it there because there wasn't an actual use for
> > it and I'm always in favour of stripping everything to their bare bones,
> > esp big and complex things.
> > 
> > And then later, add things back once we have need for it.
> 
> I think it is a useful addition to the set of utilization metrics. I
> don't think it is universally more accurate than runnable_avg. Actually
> quite the opposite when the cpu is overloaded. But for partially loaded
> cpus it is very useful if you don't want to factor in waiting time on
> the rq.

Well, different things different names. Utilization as per literature is
simply the fraction of CPU time actually used. In that sense running_avg
is about right for that. Our current runnable_avg is entirely different
(as I think we all agree by now).

But yes, for application the tipping point is u == 1, up until that
point pure utilization makes sense, after that our runnable_avg makes
more sense.
Morten Rasmussen July 14, 2014, 2:04 p.m. UTC | #9
On Mon, Jul 14, 2014 at 02:20:52PM +0100, Peter Zijlstra wrote:
> On Mon, Jul 14, 2014 at 01:55:29PM +0100, Morten Rasmussen wrote:
> > On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote:
> > > On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
> > > > In my mind, arch_scale_cpu_freq was intend to scale the capacity of
> > > > the CPU according to the current dvfs operating point.
> > > > As it's no more use anywhere now that we have arch_scale_cpu, we could
> > > > probably remove it .. and see when it will become used.
> > > 
> > > I probably should have written comments when I wrote that code, but it
> > > was meant to be used only where, as described above, we limit things.
> > > Ondemand and such, which will temporarily decrease freq, will ramp it up
> > > again at demand, and therefore lowering the capacity will skew things.
> > > 
> > > You'll put less load on because its run slower, and then you'll run it
> > > slower because there's less load on -> cyclic FAIL.
> > 
> > Agreed. We can't use a frequency scaled compute capacity for all
> > load-balancing decisions. However, IMHO, it would be useful to have know
> > the current compute capacity in addition to the max compute capacity
> > when considering energy costs. So we would have something like:
> > 
> > * capacity_max: cpu capacity at highest frequency.
> > 
> > * capacity_cur: cpu capacity at current frequency.
> > 
> > * capacity_avail: cpu capacity currently available. Basically
> >   capacity_cur taking rt, deadline, and irq accounting into account.
> > 
> > capacity_max should probably include rt, deadline, and irq accounting as
> > well. Or we need both?
> 
> I'm struggling to fully grasp your intent. We need DVFS like accounting
> for sure, and that means a current freq hook, but I'm not entirely sure
> how that relates to capacity.

We can abstract all the factors that affect current compute capacity
(frequency, P-states, big.LITTLE,...) in the scheduler by having
something like capacity_{cur,avail} to tell us how much capacity does a
particular cpu have in its current state. Assuming that implement scale
invariance for entity load tracking (we are working on that), we can
directly compare task utilization with compute capacity for balancing
decisions. For example, we can figure out how much spare capacity a cpu
has in its current state by simply:

spare_capacity(cpu) = capacity_avail(cpu) - \sum_{tasks(cpu)}^{t} util(t)

If you put more than spare_capacity(cpu) worth of task utilization on
the cpu, you will cause the cpu (and any affected cpus) to change
P-state and potentially be less energy-efficient.

Does that make any sense?

Instead of dealing with frequencies directly in the scheduler code, we
can abstract it by just having scalable compute capacity.

> 
> > Based on your description arch_scale_freq_capacity() can't be abused to
> > implement capacity_cur (and capacity_avail) unless it is repurposed.
> > Nobody seems to implement it. Otherwise we would need something similar
> > to update capacity_cur (and capacity_avail).
> 
> Yeah, I never got around to doing so. I started doing a APERF/MPERF SMT
> capacity thing for x86 but never finished that. The naive implementation
> suffered the same FAIL loop as above because APERF stops on idle. So
> when idle your capacity drops to nothing, leading to no new work,
> leading to more idle etc.
> 
> I never got around to fixing that -- adding an idle filter, and ever
> since things have somewhat bitrotted.

I see.

> 
> > As a side note, we can potentially get into a similar fail cycle already
> > due to the lack of scale invariance in the entity load tracking.
> 
> Yah, I think that got mentioned a long while ago.

It did :-)

> 
> > > > > In that same discussion ISTR a suggestion about adding avg_running time,
> > > > > as opposed to the current avg_runnable. The sum of avg_running should be
> > > > > much more accurate, and still react correctly to migrations.
> > > > 
> > > > I haven't look in details but I agree that avg_running would be much
> > > > more accurate than avg_runnable and should probably fit the
> > > > requirement. Does it means that we could re-add the avg_running (or
> > > > something similar) that has disappeared during the review of load avg
> > > > tracking patchset ?
> > > 
> > > Sure, I think we killed it there because there wasn't an actual use for
> > > it and I'm always in favour of stripping everything to their bare bones,
> > > esp big and complex things.
> > > 
> > > And then later, add things back once we have need for it.
> > 
> > I think it is a useful addition to the set of utilization metrics. I
> > don't think it is universally more accurate than runnable_avg. Actually
> > quite the opposite when the cpu is overloaded. But for partially loaded
> > cpus it is very useful if you don't want to factor in waiting time on
> > the rq.
> 
> Well, different things different names. Utilization as per literature is
> simply the fraction of CPU time actually used. In that sense running_avg
> is about right for that. Our current runnable_avg is entirely different
> (as I think we all agree by now).
> 
> But yes, for application the tipping point is u == 1, up until that
> point pure utilization makes sense, after that our runnable_avg makes
> more sense.

Agreed.

If you really care about latency/performance you might be interested in
comparing running_avg and runnable_avg even for u < 1. If the
running_avg/runnable_avg ratio is significantly less than one, tasks are
waiting on the rq to be scheduled.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra July 14, 2014, 4:22 p.m. UTC | #10
On Mon, Jul 14, 2014 at 03:04:35PM +0100, Morten Rasmussen wrote:
> > I'm struggling to fully grasp your intent. We need DVFS like accounting
> > for sure, and that means a current freq hook, but I'm not entirely sure
> > how that relates to capacity.
> 
> We can abstract all the factors that affect current compute capacity
> (frequency, P-states, big.LITTLE,...) in the scheduler by having
> something like capacity_{cur,avail} to tell us how much capacity does a
> particular cpu have in its current state. Assuming that implement scale
> invariance for entity load tracking (we are working on that), we can
> directly compare task utilization with compute capacity for balancing
> decisions. For example, we can figure out how much spare capacity a cpu
> has in its current state by simply:
> 
> spare_capacity(cpu) = capacity_avail(cpu) - \sum_{tasks(cpu)}^{t} util(t)
> 
> If you put more than spare_capacity(cpu) worth of task utilization on
> the cpu, you will cause the cpu (and any affected cpus) to change
> P-state and potentially be less energy-efficient.
> 
> Does that make any sense?
> 
> Instead of dealing with frequencies directly in the scheduler code, we
> can abstract it by just having scalable compute capacity.

Ah, ok. Same thing then.

> > But yes, for application the tipping point is u == 1, up until that
> > point pure utilization makes sense, after that our runnable_avg makes
> > more sense.
> 
> Agreed.
> 
> If you really care about latency/performance you might be interested in
> comparing running_avg and runnable_avg even for u < 1. If the
> running_avg/runnable_avg ratio is significantly less than one, tasks are
> waiting on the rq to be scheduled.

Indeed, that gives a measure of queueing.
Dietmar Eggemann July 14, 2014, 5:54 p.m. UTC | #11
[...]

>> In that same discussion ISTR a suggestion about adding avg_running time,
>> as opposed to the current avg_runnable. The sum of avg_running should be
>> much more accurate, and still react correctly to migrations.
> 
> I haven't look in details but I agree that avg_running would be much
> more accurate than avg_runnable and should probably fit the
> requirement. Does it means that we could re-add the avg_running (or
> something similar) that has disappeared during the review of load avg
> tracking patchset ?

Are you referring to '[RFC PATCH 14/14] sched: implement usage tracking'
https://lkml.org/lkml/2012/2/1/769 from Paul Turner?

__update_entity_runnable_avg() has an additional parameter 'running' so
that it can be called for

a) sched_entities in update_entity_load_avg():

  __update_entity_runnable_avg(..., se->on_rq, cfs_rq->curr == se))


b) rq's in update_rq_runnable_avg():

  __update_entity_runnable_avg(..., runnable, runnable);

I can see how it gives us two different signals for a sched_entity but
for a rq?

Do I miss something here?

-- Dietmar

[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot July 15, 2014, 9:20 a.m. UTC | #12
On 11 July 2014 22:12, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
>> In my mind, arch_scale_cpu_freq was intend to scale the capacity of
>> the CPU according to the current dvfs operating point.
>> As it's no more use anywhere now that we have arch_scale_cpu, we could
>> probably remove it .. and see when it will become used.
>
> I probably should have written comments when I wrote that code, but it
> was meant to be used only where, as described above, we limit things.
> Ondemand and such, which will temporarily decrease freq, will ramp it up
> again at demand, and therefore lowering the capacity will skew things.
>
> You'll put less load on because its run slower, and then you'll run it
> slower because there's less load on -> cyclic FAIL.
>
>> > In that same discussion ISTR a suggestion about adding avg_running time,
>> > as opposed to the current avg_runnable. The sum of avg_running should be
>> > much more accurate, and still react correctly to migrations.
>>
>> I haven't look in details but I agree that avg_running would be much
>> more accurate than avg_runnable and should probably fit the
>> requirement. Does it means that we could re-add the avg_running (or
>> something similar) that has disappeared during the review of load avg
>> tracking patchset ?
>
> Sure, I think we killed it there because there wasn't an actual use for
> it and I'm always in favour of stripping everything to their bare bones,
> esp big and complex things.
>
> And then later, add things back once we have need for it.

Ok, i'm going to look how to add it back taking nto account current
Yuyang's rework of load_avg

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot July 15, 2014, 9:27 a.m. UTC | #13
On 11 July 2014 18:13, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

[snip]

>
> In this example using rq->avg leads to imbalance whereas unweighted load
> would not. Correct me if I missed anything.

You just miss to take into account how the imbalance is computed

>
> Coming back to the previous example. I'm not convinced that inflation of
> the unweighted load sum when tasks overlap in time is a bad thing. I
> have mentioned this before. The average cpu utilization over the 40ms
> period is 50%. However the true compute capacity demand is 200% for the
> first 15ms of the period, 100% for the next 5ms and 0% for the remaining
> 25ms. The cpu is actually overloaded for 15ms every 40ms. This fact is
> factored into the unweighted load whereas rq->avg would give you the
> same utilization no matter if the tasks are overlapped or not. Hence
> unweighted load would give us an indication that the mix of tasks isn't
> optimal even if the cpu has spare cycles.
>
> If you don't care about overlap and latency, the unweighted sum of task
> running time (that Peter has proposed a number of times) is better
> metric, IMHO. As long the cpu isn't fully utilized.
>
> Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen July 15, 2014, 9:32 a.m. UTC | #14
On Tue, Jul 15, 2014 at 10:27:19AM +0100, Vincent Guittot wrote:
> On 11 July 2014 18:13, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> 
> [snip]
> 
> >
> > In this example using rq->avg leads to imbalance whereas unweighted load
> > would not. Correct me if I missed anything.
> 
> You just miss to take into account how the imbalance is computed

I don't think so. I'm aware that the imbalance is calculated based on
the runnable_load_avg of the cpus. But if you pick the wrong cpus to
compare to begin with, it doesn't matter.

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot July 15, 2014, 9:53 a.m. UTC | #15
On 15 July 2014 11:32, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Jul 15, 2014 at 10:27:19AM +0100, Vincent Guittot wrote:
>> On 11 July 2014 18:13, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>
>> [snip]
>>
>> >
>> > In this example using rq->avg leads to imbalance whereas unweighted load
>> > would not. Correct me if I missed anything.
>>
>> You just miss to take into account how the imbalance is computed
>
> I don't think so. I'm aware that the imbalance is calculated based on
> the runnable_load_avg of the cpus. But if you pick the wrong cpus to
> compare to begin with, it doesn't matter.

The average load of your sched_domain is 1 task per CPU so CPU1 will
pull only 1 task and not 2

>
> Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Yuyang Du July 18, 2014, 1:27 a.m. UTC | #16
On Mon, Jul 14, 2014 at 06:54:11PM +0100, Dietmar Eggemann wrote:
> __update_entity_runnable_avg() has an additional parameter 'running' so
> that it can be called for
> 
> a) sched_entities in update_entity_load_avg():
> 
>   __update_entity_runnable_avg(..., se->on_rq, cfs_rq->curr == se))
> 
> 
> b) rq's in update_rq_runnable_avg():
> 
>   __update_entity_runnable_avg(..., runnable, runnable);
> 
> I can see how it gives us two different signals for a sched_entity but
> for a rq?
> 
For rq, 

__update_entity_runnable_avg(..., runnable, runnable > 0)

Then, first one would be effectively CPU ConCurrency (fair and !fair) and
second one would be CPU (has task) running (or about CPU utilization for
fair and !fair), :)

Thanks,
Yuyang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 742ad88..bdc5cb9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2428,19 +2428,12 @@  static inline void __update_group_entity_contrib(struct sched_entity *se)
 		se->avg.load_avg_contrib >>= NICE_0_SHIFT;
 	}
 }
-
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
-{
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
-	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
-}
 #else /* CONFIG_FAIR_GROUP_SCHED */
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 						 int force_update) {}
 static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 						  struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 static inline void __update_task_entity_contrib(struct sched_entity *se)
@@ -2539,6 +2532,12 @@  static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
 }
 
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
+{
+	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+}
+
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7b311de..fb53166 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -542,8 +542,6 @@  struct rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
 	struct list_head leaf_cfs_rq_list;
-
-	struct sched_avg avg;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 	/*
@@ -634,6 +632,8 @@  struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+
+	struct sched_avg avg;
 };
 
 static inline int cpu_of(struct rq *rq)