diff mbox

[v2,08/11] sched: get CPU's activity statistic

Message ID 1400860385-14555-9-git-send-email-vincent.guittot@linaro.org
State New
Headers show

Commit Message

Vincent Guittot May 23, 2014, 3:53 p.m. UTC
Monitor the activity level of each group of each sched_domain level. The
activity is the amount of cpu_power that is currently used on a CPU or group
of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
level. In the special use case where the CPU is fully loaded by more than 1
task, the activity level is set above the cpu_power in order to reflect the
overload of the CPU

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Comments

Peter Zijlstra May 27, 2014, 5:32 p.m. UTC | #1
On Fri, May 23, 2014 at 05:53:02PM +0200, Vincent Guittot wrote:
> Monitor the activity level of each group of each sched_domain level. The
> activity is the amount of cpu_power that is currently used on a CPU or group
> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> level. In the special use case where the CPU is fully loaded by more than 1
> task, the activity level is set above the cpu_power in order to reflect the
> overload of the CPU
> 

> +static int get_cpu_activity(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	u32 sum = rq->avg.runnable_avg_sum;
> +	u32 period = rq->avg.runnable_avg_period;
> +
> +	if (sum >= period)
> +		return power_orig_of(cpu) + rq->nr_running - 1;
> +
> +	return (sum * power_orig_of(cpu)) / period;
> +}

While I appreciate the need to signify the overload situation, I don't
think adding nr_running makes sense. The amount of tasks has no bearing
on the amount of overload.

Also, and note I've not yet seen the use, it strikes me as odd to use
the orig power here. I would've thought the current capacity (not the
max capacity) is relevant to balancing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot May 28, 2014, 7:01 a.m. UTC | #2
On 27 May 2014 19:32, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:02PM +0200, Vincent Guittot wrote:
>> Monitor the activity level of each group of each sched_domain level. The
>> activity is the amount of cpu_power that is currently used on a CPU or group
>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>> level. In the special use case where the CPU is fully loaded by more than 1
>> task, the activity level is set above the cpu_power in order to reflect the
>> overload of the CPU
>>
>
>> +static int get_cpu_activity(int cpu)
>> +{
>> +     struct rq *rq = cpu_rq(cpu);
>> +     u32 sum = rq->avg.runnable_avg_sum;
>> +     u32 period = rq->avg.runnable_avg_period;
>> +
>> +     if (sum >= period)
>> +             return power_orig_of(cpu) + rq->nr_running - 1;
>> +
>> +     return (sum * power_orig_of(cpu)) / period;
>> +}
>
> While I appreciate the need to signify the overload situation, I don't
> think adding nr_running makes sense. The amount of tasks has no bearing
> on the amount of overload.

I agree that it's not the best way to evaluate overload but it's the
only simple one that is available without additional computation. I'm
going to try to find a better metric or come back the solution which
only adds +1  and compute the overload somewhere else

>
> Also, and note I've not yet seen the use, it strikes me as odd to use
> the orig power here. I would've thought the current capacity (not the
> max capacity) is relevant to balancing.

activity also measures the non cfs tasks activity whereas current
capacity removes the capacity used by non cfs tasks. So as long as
arch_scale_cpu_freq is not used (which is the case for now), original
capacity is ok but we might need a 3rd metric so we would have:
original capacity (max capacity that can provide a CPU)
usable capacity (new value that reflects current capacity available
for any kind of processing rt tasks, irq, cfs tasks)
current capacity (capacity available for cfs tasks)

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen May 28, 2014, 12:10 p.m. UTC | #3
On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
> Monitor the activity level of each group of each sched_domain level. The
> activity is the amount of cpu_power that is currently used on a CPU or group
> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> level. In the special use case where the CPU is fully loaded by more than 1
> task, the activity level is set above the cpu_power in order to reflect the
> overload of the CPU
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b7c51be..c01d8b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
>  	return cpu_rq(cpu)->cpu_power;
>  }
>  
> +static unsigned long power_orig_of(int cpu)
> +{
> +	return cpu_rq(cpu)->cpu_power_orig;
> +}
> +
>  static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> @@ -4438,6 +4443,18 @@ done:
>  	return target;
>  }
>  
> +static int get_cpu_activity(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	u32 sum = rq->avg.runnable_avg_sum;
> +	u32 period = rq->avg.runnable_avg_period;
> +
> +	if (sum >= period)
> +		return power_orig_of(cpu) + rq->nr_running - 1;
> +
> +	return (sum * power_orig_of(cpu)) / period;
> +}

The rq runnable_avg_{sum, period} give a very long term view of the cpu
utilization (I will use the term utilization instead of activity as I
think that is what we are talking about here). IMHO, it is too slow to
be used as basis for load balancing decisions. I think that was also
agreed upon in the last discussion related to this topic [1].

The basic problem is that worst case: sum starting from 0 and period
already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
(ms) for sum to reach 47742. In other words, the cpu might have been
fully utilized for 345 ms before it is considered fully utilized.
Periodic load-balancing happens much more frequently than that.

Also, if load-balancing actually moves tasks around it may take quite a
while before runnable_avg_sum actually reflects this change. The next
periodic load-balance is likely to happen before runnable_avg_sum has
reflected the result of the previous periodic load-balance.

To avoid these problems, we need to base utilization on a metric which
is updated instantaneously when we add/remove tasks to a cpu (or a least
fast enough that we don't see the above problems). In the previous
discussion [1] it was suggested that a sum of unweighted task
runnable_avg_{sum,period} ratio instead. That is, an unweighted
equivalent to weighted_cpuload(). That isn't a perfect solution either.
It is fine as long as the cpus are not fully utilized, but when they are
we need to use weighted_cpuload() to preserve smp_nice. What to do
around the tipping point needs more thought, but I think that is
currently the best proposal for a solution for task and cpu utilization.

rq runnable_avg_sum is useful for decisions where we need a longer term
view of the cpu utilization, but I don't see how we can use as cpu
utilization metric for load-balancing decisions at wakeup or
periodically.

Morten

[1] https://lkml.org/lkml/2014/1/8/251
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot May 28, 2014, 1:15 p.m. UTC | #4
On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
>> Monitor the activity level of each group of each sched_domain level. The
>> activity is the amount of cpu_power that is currently used on a CPU or group
>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>> level. In the special use case where the CPU is fully loaded by more than 1
>> task, the activity level is set above the cpu_power in order to reflect the
>> overload of the CPU
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b7c51be..c01d8b6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
>>       return cpu_rq(cpu)->cpu_power;
>>  }
>>
>> +static unsigned long power_orig_of(int cpu)
>> +{
>> +     return cpu_rq(cpu)->cpu_power_orig;
>> +}
>> +
>>  static unsigned long cpu_avg_load_per_task(int cpu)
>>  {
>>       struct rq *rq = cpu_rq(cpu);
>> @@ -4438,6 +4443,18 @@ done:
>>       return target;
>>  }
>>
>> +static int get_cpu_activity(int cpu)
>> +{
>> +     struct rq *rq = cpu_rq(cpu);
>> +     u32 sum = rq->avg.runnable_avg_sum;
>> +     u32 period = rq->avg.runnable_avg_period;
>> +
>> +     if (sum >= period)
>> +             return power_orig_of(cpu) + rq->nr_running - 1;
>> +
>> +     return (sum * power_orig_of(cpu)) / period;
>> +}
>
> The rq runnable_avg_{sum, period} give a very long term view of the cpu
> utilization (I will use the term utilization instead of activity as I
> think that is what we are talking about here). IMHO, it is too slow to
> be used as basis for load balancing decisions. I think that was also
> agreed upon in the last discussion related to this topic [1].
>
> The basic problem is that worst case: sum starting from 0 and period
> already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> (ms) for sum to reach 47742. In other words, the cpu might have been
> fully utilized for 345 ms before it is considered fully utilized.
> Periodic load-balancing happens much more frequently than that.

I agree that it's not really responsive but several statistics of the
scheduler use the same kind of metrics and have the same kind of
responsiveness.
I agree that it's not enough and that's why i'm not using only this
metric but it gives information that the unweighted load_avg_contrib
(that you are speaking about below) can't give. So i would be less
contrasted than you and would say that we probably need additional
metrics

>
> Also, if load-balancing actually moves tasks around it may take quite a
> while before runnable_avg_sum actually reflects this change. The next
> periodic load-balance is likely to happen before runnable_avg_sum has
> reflected the result of the previous periodic load-balance.

runnable_avg_sum uses a 1ms unit step so i tend to disagree with your
point above

>
> To avoid these problems, we need to base utilization on a metric which
> is updated instantaneously when we add/remove tasks to a cpu (or a least
> fast enough that we don't see the above problems). In the previous
> discussion [1] it was suggested that a sum of unweighted task
> runnable_avg_{sum,period} ratio instead. That is, an unweighted
> equivalent to weighted_cpuload(). That isn't a perfect solution either.

Regarding the unweighted load_avg_contrib, you will have similar issue
because of the slowness in the variation of each sched_entity load
that will be added/removed in the unweighted load_avg_contrib.

The update of the runnable_avg_{sum,period}  of an sched_entity is
quite similar to cpu utilization. This value is linked to the CPU on
which it has run previously because of the time sharing with others
tasks, so the unweighted load of a freshly migrated task will reflect
its load on the previous CPU (with the time sharing with other tasks
on prev CPU).

I'm not saying that such metric is useless but it's not perfect as well.

Vincent

> It is fine as long as the cpus are not fully utilized, but when they are
> we need to use weighted_cpuload() to preserve smp_nice. What to do
> around the tipping point needs more thought, but I think that is
> currently the best proposal for a solution for task and cpu utilization.
>
> rq runnable_avg_sum is useful for decisions where we need a longer term
> view of the cpu utilization, but I don't see how we can use as cpu
> utilization metric for load-balancing decisions at wakeup or
> periodically.
>
> Morten
>
> [1] https://lkml.org/lkml/2014/1/8/251
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra May 28, 2014, 3:17 p.m. UTC | #5
On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
> The basic problem is that worst case: sum starting from 0 and period
> already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> (ms) for sum to reach 47742. 

But it takes only 4 periods (4*32ms) to get ~94% of the way there, 

So our half-time is 32 * 1024 ms, now look at \Sum_i=1,4 1/(2^i) =
0.9375. So 4 periods (4*32*1024=131072 ms) gets us ~94% of the way
there.

Now, granted, 4 periods of 32-ish ms is still very long, but nowhere
near as horrid as the full convergence.
Morten Rasmussen May 28, 2014, 3:47 p.m. UTC | #6
On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
> >> Monitor the activity level of each group of each sched_domain level. The
> >> activity is the amount of cpu_power that is currently used on a CPU or group
> >> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> >> level. In the special use case where the CPU is fully loaded by more than 1
> >> task, the activity level is set above the cpu_power in order to reflect the
> >> overload of the CPU
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>  kernel/sched/fair.c | 22 ++++++++++++++++++++++
> >>  1 file changed, 22 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b7c51be..c01d8b6 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
> >>       return cpu_rq(cpu)->cpu_power;
> >>  }
> >>
> >> +static unsigned long power_orig_of(int cpu)
> >> +{
> >> +     return cpu_rq(cpu)->cpu_power_orig;
> >> +}
> >> +
> >>  static unsigned long cpu_avg_load_per_task(int cpu)
> >>  {
> >>       struct rq *rq = cpu_rq(cpu);
> >> @@ -4438,6 +4443,18 @@ done:
> >>       return target;
> >>  }
> >>
> >> +static int get_cpu_activity(int cpu)
> >> +{
> >> +     struct rq *rq = cpu_rq(cpu);
> >> +     u32 sum = rq->avg.runnable_avg_sum;
> >> +     u32 period = rq->avg.runnable_avg_period;
> >> +
> >> +     if (sum >= period)
> >> +             return power_orig_of(cpu) + rq->nr_running - 1;
> >> +
> >> +     return (sum * power_orig_of(cpu)) / period;
> >> +}
> >
> > The rq runnable_avg_{sum, period} give a very long term view of the cpu
> > utilization (I will use the term utilization instead of activity as I
> > think that is what we are talking about here). IMHO, it is too slow to
> > be used as basis for load balancing decisions. I think that was also
> > agreed upon in the last discussion related to this topic [1].
> >
> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
> 
> I agree that it's not really responsive but several statistics of the
> scheduler use the same kind of metrics and have the same kind of
> responsiveness.

I might be wrong, but I don't think we use anything similar to this to
estimate cpu load/utilization for load-balancing purposes except for
{source, target}_load() where it is used to bias the decisions whether
or not to balance if the difference is small. That is what the
discussion was about last time.

> I agree that it's not enough and that's why i'm not using only this
> metric but it gives information that the unweighted load_avg_contrib
> (that you are speaking about below) can't give. So i would be less
> contrasted than you and would say that we probably need additional
> metrics

I'm not saying that we shouldn't this metric at all, I'm just saying
that I don't think it is suitable for estimating the short term view cpu
utilization which is what you need to make load-balancing decisions. We
can't observe the effect of recent load-balancing decisions if the
metric is too slow to react.

I realize that what I mean by 'slow' might be unclear. Load tracking
(both task and rq) takes a certain amount of history into account in
runnable_avg_{sum, period}. This amount is determined by the 'y'-weight,
which has been chosen such that we consider the load in the past 345
time units, where the time unit is ~1 ms. The contribution is smaller
the further you go back due to y^n, which diminishes to 0 for n > 345.
So, if a task or cpu goes from having been idle for >345 ms to being
constantly busy, it will take 345 ms until the entire history that we
care about will reflect this change. Only then runnable_avg_sum will
reach 47742. The rate of change is faster to begin with since the weight
of the most recent history is higher. runnable_avg_sum will get to
47742/2 in just 32 ms.

Since we may do periodic load-balance every 10 ms or so, we will perform
a number of load-balances where runnable_avg_sum will mostly be
reflecting the state of the world before a change (new task queued or
moved a task to a different cpu). If you had have two tasks continuously
on one cpu and your other cpu is idle, and you move one of the tasks to
the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
first cpu while it starts from 0 on the other one. 10 ms later it will
have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
it reaches 47742. In the mean time the cpu doesn't appear fully utilized
and we might decide to put more tasks on it because we don't know if
runnable_avg_sum represents a partially utilized cpu (for example a 50%
task) or if it will continue to rise and eventually get to 47742.

IMO, we need cpu utilization to clearly represent the current
utilization of the cpu such that any changes since the last wakeup or
load-balance are clearly visible.

> 
> >
> > Also, if load-balancing actually moves tasks around it may take quite a
> > while before runnable_avg_sum actually reflects this change. The next
> > periodic load-balance is likely to happen before runnable_avg_sum has
> > reflected the result of the previous periodic load-balance.
> 
> runnable_avg_sum uses a 1ms unit step so i tend to disagree with your
> point above

See explanation above. The important thing is how much history we take
into account. That is 345x 1 ms time units. The rate at which the sum is
updated doesn't have any change anything. 1 ms after a change (wakeup,
load-balance,...) runnable_avg_sum can only change by 1024. The
remaining ~98% of your weighted history still reflects the world before
the change.

> > To avoid these problems, we need to base utilization on a metric which
> > is updated instantaneously when we add/remove tasks to a cpu (or a least
> > fast enough that we don't see the above problems). In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
> 
> Regarding the unweighted load_avg_contrib, you will have similar issue
> because of the slowness in the variation of each sched_entity load
> that will be added/removed in the unweighted load_avg_contrib.
> 
> The update of the runnable_avg_{sum,period}  of an sched_entity is
> quite similar to cpu utilization.

Yes, runnable_avg_{sum, period} for tasks and rqs are exactly the same.
No difference there :)

> This value is linked to the CPU on
> which it has run previously because of the time sharing with others
> tasks, so the unweighted load of a freshly migrated task will reflect
> its load on the previous CPU (with the time sharing with other tasks
> on prev CPU).

I agree that the task runnable_avg_sum is always affected by the
circumstances on the cpu where it is running, and that it takes this
history with it. However, I think cfs.runnable_load_avg leads to less
problems than using the rq runnable_avg_sum. It would work nicely for
the two tasks on two cpus example I mentioned earlier. We don't need add
something on top when the cpu is fully utilized by more than one task.
It comes more naturally with cfs.runnable_load_avg. If it is much larger
than 47742, it should be fairly safe to assume that you shouldn't stick
more tasks on that cpu.

> 
> I'm not saying that such metric is useless but it's not perfect as well.

It comes with its own set of problems, agreed. Based on my current
understanding (or lack thereof) they just seem smaller :)

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot May 28, 2014, 4:39 p.m. UTC | #7
On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
>> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:

[snip]

>
>> This value is linked to the CPU on
>> which it has run previously because of the time sharing with others
>> tasks, so the unweighted load of a freshly migrated task will reflect
>> its load on the previous CPU (with the time sharing with other tasks
>> on prev CPU).
>
> I agree that the task runnable_avg_sum is always affected by the
> circumstances on the cpu where it is running, and that it takes this
> history with it. However, I think cfs.runnable_load_avg leads to less
> problems than using the rq runnable_avg_sum. It would work nicely for
> the two tasks on two cpus example I mentioned earlier. We don't need add

i would say that nr_running is an even better metrics for such
situation as the load doesn't give any additional information.
Just to point that we can spent a lot of time listing which use case
are better covered by which metrics :-)

> something on top when the cpu is fully utilized by more than one task.
> It comes more naturally with cfs.runnable_load_avg. If it is much larger
> than 47742, it should be fairly safe to assume that you shouldn't stick
> more tasks on that cpu.
>
>>
>> I'm not saying that such metric is useless but it's not perfect as well.
>
> It comes with its own set of problems, agreed. Based on my current
> understanding (or lack thereof) they just seem smaller :)

I think it's worth using the cpu utilization for some cases because it
has got some information that are not available elsewhere. And the
replacement of the current capacity computation is one example.
As explained previously, I'm not against adding other metrics and i'm
not sure to understand why you oppose these 2 metrics whereas they
could be complementary

Vincent

>
> Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dietmar Eggemann May 30, 2014, 9:50 a.m. UTC | #8
On 23/05/14 16:53, Vincent Guittot wrote:
> Monitor the activity level of each group of each sched_domain level. The
> activity is the amount of cpu_power that is currently used on a CPU or group
> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> level. In the special use case where the CPU is fully loaded by more than 1
> task, the activity level is set above the cpu_power in order to reflect the
> overload of the CPU
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

[...]

>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>  	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>  	unsigned long load_per_task;
>  	unsigned long group_power;
> +	unsigned long group_activity; /* Total activity of the group */
>  	unsigned int sum_nr_running; /* Nr tasks running in the group */
>  	unsigned int group_capacity;
>  	unsigned int idle_cpus;
> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>  	struct sched_group *busiest;	/* Busiest group in this sd */
>  	struct sched_group *local;	/* Local group in this sd */
>  	unsigned long total_load;	/* Total load of all groups in sd */
> +	unsigned long total_activity;  /* Total activity of all groups in sd */
>  	unsigned long total_pwr;	/* Total power of all groups in sd */
>  	unsigned long avg_load;	/* Average load across all groups in sd */
>  
> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>  		.busiest = NULL,
>  		.local = NULL,
>  		.total_load = 0UL,
> +		.total_activity = 0UL,

AFAICS, total_activity is not used right now. Do you intend to use it to
calculate something like avg_activity later (like total_load/avg_load)?

>  		.total_pwr = 0UL,
>  		.busiest_stat = {
>  			.avg_load = 0UL,

[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot May 30, 2014, 7:20 p.m. UTC | #9
On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/05/14 16:53, Vincent Guittot wrote:
>> Monitor the activity level of each group of each sched_domain level. The
>> activity is the amount of cpu_power that is currently used on a CPU or group
>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>> level. In the special use case where the CPU is fully loaded by more than 1
>> task, the activity level is set above the cpu_power in order to reflect the
>> overload of the CPU
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> [...]
>
>>  /*
>>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>>       unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>>       unsigned long load_per_task;
>>       unsigned long group_power;
>> +     unsigned long group_activity; /* Total activity of the group */
>>       unsigned int sum_nr_running; /* Nr tasks running in the group */
>>       unsigned int group_capacity;
>>       unsigned int idle_cpus;
>> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>>       struct sched_group *busiest;    /* Busiest group in this sd */
>>       struct sched_group *local;      /* Local group in this sd */
>>       unsigned long total_load;       /* Total load of all groups in sd */
>> +     unsigned long total_activity;  /* Total activity of all groups in sd */
>>       unsigned long total_pwr;        /* Total power of all groups in sd */
>>       unsigned long avg_load; /* Average load across all groups in sd */
>>
>> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>>               .busiest = NULL,
>>               .local = NULL,
>>               .total_load = 0UL,
>> +             .total_activity = 0UL,
>
> AFAICS, total_activity is not used right now. Do you intend to use it to
> calculate something like avg_activity later (like total_load/avg_load)?

this patch only creates group_activity and total_activity.The use of
these statistics appears on the next patches. I have not planned to
compute such avg_activity for the moment mainly because i have need it
yet.

Do you have in mind any use of such avg_activity ?

Vincent

>
>>               .total_pwr = 0UL,
>>               .busiest_stat = {
>>                       .avg_load = 0UL,
>
> [...]
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dietmar Eggemann June 1, 2014, 11:33 a.m. UTC | #10
On 30/05/14 20:20, Vincent Guittot wrote:
> On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 23/05/14 16:53, Vincent Guittot wrote:
>>> Monitor the activity level of each group of each sched_domain level. The
>>> activity is the amount of cpu_power that is currently used on a CPU or group
>>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>>> level. In the special use case where the CPU is fully loaded by more than 1
>>> task, the activity level is set above the cpu_power in order to reflect the
>>> overload of the CPU
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>
>> [...]
>>
>>>   /*
>>>    * select_task_rq_fair: Select target runqueue for the waking task in domains
>>>    * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>>> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>>>        unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>>>        unsigned long load_per_task;
>>>        unsigned long group_power;
>>> +     unsigned long group_activity; /* Total activity of the group */
>>>        unsigned int sum_nr_running; /* Nr tasks running in the group */
>>>        unsigned int group_capacity;
>>>        unsigned int idle_cpus;
>>> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>>>        struct sched_group *busiest;    /* Busiest group in this sd */
>>>        struct sched_group *local;      /* Local group in this sd */
>>>        unsigned long total_load;       /* Total load of all groups in sd */
>>> +     unsigned long total_activity;  /* Total activity of all groups in sd */
>>>        unsigned long total_pwr;        /* Total power of all groups in sd */
>>>        unsigned long avg_load; /* Average load across all groups in sd */
>>>
>>> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>>>                .busiest = NULL,
>>>                .local = NULL,
>>>                .total_load = 0UL,
>>> +             .total_activity = 0UL,
>>
>> AFAICS, total_activity is not used right now. Do you intend to use it to
>> calculate something like avg_activity later (like total_load/avg_load)?
>
> this patch only creates group_activity and total_activity.The use of
> these statistics appears on the next patches. I have not planned to
> compute such avg_activity for the moment mainly because i have need it
> yet.
>
> Do you have in mind any use of such avg_activity ?

No, not really but I still don't see (taken all your patches) where 
sds->total_activity is used.

$ find . -name "*.[ch]" | xargs grep total_activity
./kernel/sched/fair.c:	unsigned long total_activity;  /* Total activity 
of all groups in sd */
./kernel/sched/fair.c:		.total_activity = 0UL,
./kernel/sched/fair.c:		sds->total_activity += sgs->group_activity;

OOTH, sgs->group_activity is used to set sgs->group_capacity which is 
then used in if conditions in update_sd_pick_busiest(), 
update_sd_lb_stats() and find_busiest_group().

[...]


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot June 2, 2014, 2:07 p.m. UTC | #11
On 1 June 2014 13:33, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 30/05/14 20:20, Vincent Guittot wrote:
>>
>> On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 23/05/14 16:53, Vincent Guittot wrote:
>>>>
>>>> Monitor the activity level of each group of each sched_domain level. The
>>>> activity is the amount of cpu_power that is currently used on a CPU or
>>>> group
>>>> of CPUs. We use the runnable_avg_sum and _period to evaluate this
>>>> activity
>>>> level. In the special use case where the CPU is fully loaded by more
>>>> than 1
>>>> task, the activity level is set above the cpu_power in order to reflect
>>>> the
>>>> overload of the CPU
>>>>
>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>
>>>
>>> [...]
>>>
>>>>   /*
>>>>    * select_task_rq_fair: Select target runqueue for the waking task in
>>>> domains
>>>>    * that have the 'sd_flag' flag set. In practice, this is
>>>> SD_BALANCE_WAKE,
>>>> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>>>>        unsigned long sum_weighted_load; /* Weighted load of group's
>>>> tasks */
>>>>        unsigned long load_per_task;
>>>>        unsigned long group_power;
>>>> +     unsigned long group_activity; /* Total activity of the group */
>>>>        unsigned int sum_nr_running; /* Nr tasks running in the group */
>>>>        unsigned int group_capacity;
>>>>        unsigned int idle_cpus;
>>>> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>>>>        struct sched_group *busiest;    /* Busiest group in this sd */
>>>>        struct sched_group *local;      /* Local group in this sd */
>>>>        unsigned long total_load;       /* Total load of all groups in sd
>>>> */
>>>> +     unsigned long total_activity;  /* Total activity of all groups in
>>>> sd */
>>>>        unsigned long total_pwr;        /* Total power of all groups in
>>>> sd */
>>>>        unsigned long avg_load; /* Average load across all groups in sd
>>>> */
>>>>
>>>> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct
>>>> sd_lb_stats *sds)
>>>>                .busiest = NULL,
>>>>                .local = NULL,
>>>>                .total_load = 0UL,
>>>> +             .total_activity = 0UL,
>>>
>>>
>>> AFAICS, total_activity is not used right now. Do you intend to use it to
>>> calculate something like avg_activity later (like total_load/avg_load)?
>>
>>
>> this patch only creates group_activity and total_activity.The use of
>> these statistics appears on the next patches. I have not planned to
>> compute such avg_activity for the moment mainly because i have need it
>> yet.
>>
>> Do you have in mind any use of such avg_activity ?
>
>
> No, not really but I still don't see (taken all your patches) where
> sds->total_activity is used.
>
> $ find . -name "*.[ch]" | xargs grep total_activity
> ./kernel/sched/fair.c:  unsigned long total_activity;  /* Total activity of
> all groups in sd */
> ./kernel/sched/fair.c:          .total_activity = 0UL,
> ./kernel/sched/fair.c:          sds->total_activity += sgs->group_activity;
>
> OOTH, sgs->group_activity is used to set sgs->group_capacity which is then
> used in if conditions in update_sd_pick_busiest(), update_sd_lb_stats() and
> find_busiest_group().

Sorry, i misread you 1st email and haven't understood that you were
specifically speaking about total_activity.
When i created the activity statistic, i have added both sg and sd
fields but i finally only use the sg one. If i don't need the sd field
after updating my patches, i will remove it

Thanks
Vincent

>
> [...]
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen June 3, 2014, 12:03 p.m. UTC | #12
On Wed, May 28, 2014 at 05:39:10PM +0100, Vincent Guittot wrote:
> On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
> >> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
> 
> [snip]
> 
> >
> >> This value is linked to the CPU on
> >> which it has run previously because of the time sharing with others
> >> tasks, so the unweighted load of a freshly migrated task will reflect
> >> its load on the previous CPU (with the time sharing with other tasks
> >> on prev CPU).
> >
> > I agree that the task runnable_avg_sum is always affected by the
> > circumstances on the cpu where it is running, and that it takes this
> > history with it. However, I think cfs.runnable_load_avg leads to less
> > problems than using the rq runnable_avg_sum. It would work nicely for
> > the two tasks on two cpus example I mentioned earlier. We don't need add
> 
> i would say that nr_running is an even better metrics for such
> situation as the load doesn't give any additional information.

I fail to understand how nr_running can be used. nr_running doesn't tell
you anything about the utilization of the cpu, just the number tasks
that happen to be runnable at a point in time on a specific cpu. It
might be two small tasks that just happened to be running while you read
nr_running.

An unweighted version of cfs.runnable_load_avg gives you a metric that
captures cpu utilization to some extend, but not the number of tasks.
And it reflects task migrations immediately unlike the rq
runnable_avg_sum.

> Just to point that we can spent a lot of time listing which use case
> are better covered by which metrics :-)

Agreed, but I think it is quite important to discuss what we understand
by cpu utilization. It seems to be different depending on what you want
to use it for. I think it is also clear that none of the metrics that
have been proposed are perfect. We therefore have to be careful to only
use metrics in scenarios where they make sense. IMHO, both rq
runnable_avg_sum and unweighted cfs.runnable_load_avg capture cpu
utilization, but in different ways.

We have done experiments internally with rq runnable_avg_sum for
load-balancing decisions in the past and found it unsuitable due to its
slow response to task migrations. That is why I brought it up here.
AFAICT, you use rq runnable_avg_sum more like a flag than a quantitative
measure of cpu utilization. Viewing things from an energy-awareness
point of view I'm more interested in the latter for estimating the
implications of moving tasks around. I don't have any problems with
using rq runnable_avg_sum for other things as long we are fully aware of
how this metric works.

> > something on top when the cpu is fully utilized by more than one task.
> > It comes more naturally with cfs.runnable_load_avg. If it is much larger
> > than 47742, it should be fairly safe to assume that you shouldn't stick
> > more tasks on that cpu.
> >
> >>
> >> I'm not saying that such metric is useless but it's not perfect as well.
> >
> > It comes with its own set of problems, agreed. Based on my current
> > understanding (or lack thereof) they just seem smaller :)
> 
> I think it's worth using the cpu utilization for some cases because it
> has got some information that are not available elsewhere. And the
> replacement of the current capacity computation is one example.
> As explained previously, I'm not against adding other metrics and i'm
> not sure to understand why you oppose these 2 metrics whereas they
> could be complementary

I think we more or less agree :) I'm fine with both metrics and I agree
that they complement each other. My concern is using the right metric
for the right job. If you choose to use rq runnable_avg_sum you have to
keep its slow reaction time in mind. I think that might be
difficult/not possible for some load-balancing decisions. That is
basically my point :)

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra June 3, 2014, 3:40 p.m. UTC | #13
On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
> The rq runnable_avg_{sum, period} give a very long term view of the cpu
> utilization (I will use the term utilization instead of activity as I
> think that is what we are talking about here). IMHO, it is too slow to
> be used as basis for load balancing decisions. I think that was also
> agreed upon in the last discussion related to this topic [1].
> 
> The basic problem is that worst case: sum starting from 0 and period
> already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> (ms) for sum to reach 47742. In other words, the cpu might have been
> fully utilized for 345 ms before it is considered fully utilized.
> Periodic load-balancing happens much more frequently than that.

Like said earlier the 94% mark is actually hit much sooner, but yes,
likely still too slow.

50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..

> Also, if load-balancing actually moves tasks around it may take quite a
> while before runnable_avg_sum actually reflects this change. The next
> periodic load-balance is likely to happen before runnable_avg_sum has
> reflected the result of the previous periodic load-balance.
> 
> To avoid these problems, we need to base utilization on a metric which
> is updated instantaneously when we add/remove tasks to a cpu (or a least
> fast enough that we don't see the above problems).

So the per-task-load-tracking stuff already does that. It updates the
per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().

And keeping an unweighted per-cpu variant isn't that much more work.

> In the previous
> discussion [1] it was suggested that a sum of unweighted task
> runnable_avg_{sum,period} ratio instead. That is, an unweighted
> equivalent to weighted_cpuload(). That isn't a perfect solution either.
> It is fine as long as the cpus are not fully utilized, but when they are
> we need to use weighted_cpuload() to preserve smp_nice. What to do
> around the tipping point needs more thought, but I think that is
> currently the best proposal for a solution for task and cpu utilization.

I'm not too worried about the tipping point, per task runnable figures
of an overloaded cpu are higher, so migration between an overloaded cpu
and an underloaded cpu are going to be tricky no matter what we do.

> rq runnable_avg_sum is useful for decisions where we need a longer term
> view of the cpu utilization, but I don't see how we can use as cpu
> utilization metric for load-balancing decisions at wakeup or
> periodically.

So keeping one with a faster decay would add extra per-task storage. But
would be possible..
Peter Zijlstra June 3, 2014, 3:50 p.m. UTC | #14
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> Since we may do periodic load-balance every 10 ms or so, we will perform
> a number of load-balances where runnable_avg_sum will mostly be
> reflecting the state of the world before a change (new task queued or
> moved a task to a different cpu). If you had have two tasks continuously
> on one cpu and your other cpu is idle, and you move one of the tasks to
> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> first cpu while it starts from 0 on the other one. 10 ms later it will
> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> and we might decide to put more tasks on it because we don't know if
> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> task) or if it will continue to rise and eventually get to 47742.

Ah, no, since we track per task, and update the per-cpu ones when we
migrate tasks, the per-cpu values should be instantly updated.

If we were to increase per task storage, we might as well also track
running_avg not only runnable_avg.
Peter Zijlstra June 3, 2014, 3:59 p.m. UTC | #15
On Tue, Jun 03, 2014 at 01:03:54PM +0100, Morten Rasmussen wrote:
> On Wed, May 28, 2014 at 05:39:10PM +0100, Vincent Guittot wrote:
> > On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > > On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
> > >> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > >> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:

> > > I agree that the task runnable_avg_sum is always affected by the
> > > circumstances on the cpu where it is running, and that it takes this
> > > history with it. However, I think cfs.runnable_load_avg leads to less
> > > problems than using the rq runnable_avg_sum. It would work nicely for
> > > the two tasks on two cpus example I mentioned earlier. We don't need add
> > 
> > i would say that nr_running is an even better metrics for such
> > situation as the load doesn't give any additional information.
> 
> I fail to understand how nr_running can be used. nr_running doesn't tell
> you anything about the utilization of the cpu, just the number tasks
> that happen to be runnable at a point in time on a specific cpu. It
> might be two small tasks that just happened to be running while you read
> nr_running.

Agreed, I'm not at all seeing how nr_running is useful here.

> An unweighted version of cfs.runnable_load_avg gives you a metric that
> captures cpu utilization to some extend, but not the number of tasks.
> And it reflects task migrations immediately unlike the rq
> runnable_avg_sum.

So runnable_avg would be equal to the utilization as long as
there's idle time, as soon as we're over-loaded the metric shows how
much extra cpu is required.

That is, runnable_avg - running_avg >= 0 and the amount is the
exact amount of extra cpu required to make all tasks run but not have
idle time.

> Agreed, but I think it is quite important to discuss what we understand
> by cpu utilization. It seems to be different depending on what you want
> to use it for.

I understand utilization to be however much cpu is actually used, so I
would, per the existing naming, call running_avg to be the avg
utilization of a task/group/cpu whatever.

> We have done experiments internally with rq runnable_avg_sum for
> load-balancing decisions in the past and found it unsuitable due to its
> slow response to task migrations. That is why I brought it up here.

So I'm not entirely seeing that from the code (I've not traced this),
afaict we actually update the per-cpu values on migration based on the
task values.

old_rq->sum -= p->val;
new_rq->sum += p->val;

like,.. except of course totally obscured.
Morten Rasmussen June 3, 2014, 5:16 p.m. UTC | #16
On Tue, Jun 03, 2014 at 04:40:58PM +0100, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
> > The rq runnable_avg_{sum, period} give a very long term view of the cpu
> > utilization (I will use the term utilization instead of activity as I
> > think that is what we are talking about here). IMHO, it is too slow to
> > be used as basis for load balancing decisions. I think that was also
> > agreed upon in the last discussion related to this topic [1].
> > 
> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
> 
> Like said earlier the 94% mark is actually hit much sooner, but yes,
> likely still too slow.
> 
> 50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..

Agreed.

> 
> > Also, if load-balancing actually moves tasks around it may take quite a
> > while before runnable_avg_sum actually reflects this change. The next
> > periodic load-balance is likely to happen before runnable_avg_sum has
> > reflected the result of the previous periodic load-balance.
> > 
> > To avoid these problems, we need to base utilization on a metric which
> > is updated instantaneously when we add/remove tasks to a cpu (or a least
> > fast enough that we don't see the above problems).
> 
> So the per-task-load-tracking stuff already does that. It updates the
> per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().

I think there is some confusion here. There are two per-cpu load metrics
that tracks differently.

The cfs.runnable_load_avg is basically the sum of the load contributions
of the tasks on the cfs rq. The sum gets updated whenever tasks are
{en,de}queued by adding/subtracting the load contribution of the task
being added/removed. That is the one you are referring to.

The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is
tracking whether the cpu has something to do or not. It doesn't matter
many tasks are runnable or what their load is. It is updated in
update_rq_runnable_avg(). It increases when rq->nr_running > 0 and
decays if not. It also takes time spent running rt tasks into account in
idle_{enter, exit}_fair(). So if you remove tasks from the rq, this
metric will start decaying and eventually get to 0, unlike the
cfs.runnable_load_avg where the task load contribution subtracted every
time a task is removed. The rq runnable_avg_sum is the one being used in
this patch set.

Ben, pjt, please correct me if I'm wrong.

> And keeping an unweighted per-cpu variant isn't that much more work.

Agreed.

> 
> > In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
> > It is fine as long as the cpus are not fully utilized, but when they are
> > we need to use weighted_cpuload() to preserve smp_nice. What to do
> > around the tipping point needs more thought, but I think that is
> > currently the best proposal for a solution for task and cpu utilization.
> 
> I'm not too worried about the tipping point, per task runnable figures
> of an overloaded cpu are higher, so migration between an overloaded cpu
> and an underloaded cpu are going to be tricky no matter what we do.

Yes, agreed. I just got the impression that you were concerned about
smp_nice last time we discussed this.

> > rq runnable_avg_sum is useful for decisions where we need a longer term
> > view of the cpu utilization, but I don't see how we can use as cpu
> > utilization metric for load-balancing decisions at wakeup or
> > periodically.
> 
> So keeping one with a faster decay would add extra per-task storage. But
> would be possible..

I have had that thought when we discussed potential replacements for
cpu_load[]. It will require some messing around with the nicely
optimized load tracking maths if we want to have load tracking with a
different y-coefficient.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen June 3, 2014, 5:20 p.m. UTC | #17
On Tue, Jun 03, 2014 at 04:50:07PM +0100, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> > Since we may do periodic load-balance every 10 ms or so, we will perform
> > a number of load-balances where runnable_avg_sum will mostly be
> > reflecting the state of the world before a change (new task queued or
> > moved a task to a different cpu). If you had have two tasks continuously
> > on one cpu and your other cpu is idle, and you move one of the tasks to
> > the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> > first cpu while it starts from 0 on the other one. 10 ms later it will
> > have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> > it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> > and we might decide to put more tasks on it because we don't know if
> > runnable_avg_sum represents a partially utilized cpu (for example a 50%
> > task) or if it will continue to rise and eventually get to 47742.
> 
> Ah, no, since we track per task, and update the per-cpu ones when we
> migrate tasks, the per-cpu values should be instantly updated.

No, not for this per-cpu tracking metric :) For cfs.runnable_load_avg
you are right, but this patch set is using rq->avg.runnable_avg_sum
which is different. See my other reply.

> If we were to increase per task storage, we might as well also track
> running_avg not only runnable_avg.

That could probably make sense. We had that in pjt's first proposal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra June 3, 2014, 5:37 p.m. UTC | #18
On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote:
> > I'm not too worried about the tipping point, per task runnable figures
> > of an overloaded cpu are higher, so migration between an overloaded cpu
> > and an underloaded cpu are going to be tricky no matter what we do.
> 
> Yes, agreed. I just got the impression that you were concerned about
> smp_nice last time we discussed this.

Well, yes, we need to keep that working, but the exact detail around the
tipping point are near impossible to get right, so I'm not too bothered
there.

> > > rq runnable_avg_sum is useful for decisions where we need a longer term
> > > view of the cpu utilization, but I don't see how we can use as cpu
> > > utilization metric for load-balancing decisions at wakeup or
> > > periodically.
> > 
> > So keeping one with a faster decay would add extra per-task storage. But
> > would be possible..
> 
> I have had that thought when we discussed potential replacements for
> cpu_load[]. It will require some messing around with the nicely
> optimized load tracking maths if we want to have load tracking with a
> different y-coefficient.

My initial thought was a y=0.5, which is really >>=1. But yes, if we
want something else that'll get messy real fast methinks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra June 3, 2014, 5:39 p.m. UTC | #19
On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote:
> > So the per-task-load-tracking stuff already does that. It updates the
> > per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().
> 
> I think there is some confusion here. There are two per-cpu load metrics
> that tracks differently.
> 
> The cfs.runnable_load_avg is basically the sum of the load contributions
> of the tasks on the cfs rq. The sum gets updated whenever tasks are
> {en,de}queued by adding/subtracting the load contribution of the task
> being added/removed. That is the one you are referring to.
> 
> The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is
> tracking whether the cpu has something to do or not. It doesn't matter
> many tasks are runnable or what their load is. It is updated in
> update_rq_runnable_avg(). It increases when rq->nr_running > 0 and
> decays if not. It also takes time spent running rt tasks into account in
> idle_{enter, exit}_fair(). So if you remove tasks from the rq, this
> metric will start decaying and eventually get to 0, unlike the
> cfs.runnable_load_avg where the task load contribution subtracted every
> time a task is removed. The rq runnable_avg_sum is the one being used in
> this patch set.
> 
> Ben, pjt, please correct me if I'm wrong.

Argh, ok I completely missed that. I think the cfs.runnable_load_avg is
the sane number, not entirely sure what good rq->avg.runnable_avg is
good for, it seems a weird metric on first consideration.

Will have to ponder that a bit more.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen June 3, 2014, 5:41 p.m. UTC | #20
On Tue, Jun 03, 2014 at 04:59:39PM +0100, Peter Zijlstra wrote:
> On Tue, Jun 03, 2014 at 01:03:54PM +0100, Morten Rasmussen wrote:
> > An unweighted version of cfs.runnable_load_avg gives you a metric that
> > captures cpu utilization to some extend, but not the number of tasks.
> > And it reflects task migrations immediately unlike the rq
> > runnable_avg_sum.
> 
> So runnable_avg would be equal to the utilization as long as
> there's idle time, as soon as we're over-loaded the metric shows how
> much extra cpu is required.
> 
> That is, runnable_avg - running_avg >= 0 and the amount is the
> exact amount of extra cpu required to make all tasks run but not have
> idle time.

Yes, roughly. runnable_avg goes up quite steeply if you have many tasks
on a fully utilized cpu, so the actual amount of extra cpu required
might be somewhat lower. I can't come up with something better, so I
agree.

> 
> > Agreed, but I think it is quite important to discuss what we understand
> > by cpu utilization. It seems to be different depending on what you want
> > to use it for.
> 
> I understand utilization to be however much cpu is actually used, so I
> would, per the existing naming, call running_avg to be the avg
> utilization of a task/group/cpu whatever.

I see your point, but for load balancing purposes we are more intested
in the runnable_avg as it tells us about the cpu capacity requirements.
I don't like to throw more terms into the mix, but you could call
runnable_avg the potential task/group/cpu utilization. This is an
estimate of how much utilization a task would cause if we moved it to an
idle cpu. That might be quite different from running_avg on an
over-utilized cpu.

> 
> > We have done experiments internally with rq runnable_avg_sum for
> > load-balancing decisions in the past and found it unsuitable due to its
> > slow response to task migrations. That is why I brought it up here.
> 
> So I'm not entirely seeing that from the code (I've not traced this),
> afaict we actually update the per-cpu values on migration based on the
> task values.
> 
> old_rq->sum -= p->val;
> new_rq->sum += p->val;
> 
> like,.. except of course totally obscured.

Yes, for cfs.runnable_load_avg, rq->avg.runnable_avg_sum is different.
See the other reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Yuyang Du June 3, 2014, 11:11 p.m. UTC | #21
> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
> 
> Like said earlier the 94% mark is actually hit much sooner, but yes,
> likely still too slow.
> 
> 50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..
> 
> > In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
> > It is fine as long as the cpus are not fully utilized, but when they are
> > we need to use weighted_cpuload() to preserve smp_nice. What to do
> > around the tipping point needs more thought, but I think that is
> > currently the best proposal for a solution for task and cpu utilization.
> 
> I'm not too worried about the tipping point, per task runnable figures
> of an overloaded cpu are higher, so migration between an overloaded cpu
> and an underloaded cpu are going to be tricky no matter what we do.
> 
Hi,

Can I join this dicussion late?

As I understand, you are talking about a metric for cpu activity. And the
issues about runnable_avg_sum is its sluggishness to latest change, and also
need unweighted load averages.

You might be aware of my recent proposal to CPU ConCurrency (CC). It is 1) an
average of nr_running, or 2) nr_running weighted CPU utilization. So it is a
combination of CPU utlization and run queue (both factored  natually). It
meets the needs you talked, I think. You can take it as a candidate, or at
least we can talk about it?

Thanks,
Yuyang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot June 4, 2014, 7:47 a.m. UTC | #22
On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> Since we may do periodic load-balance every 10 ms or so, we will perform
>> a number of load-balances where runnable_avg_sum will mostly be
>> reflecting the state of the world before a change (new task queued or
>> moved a task to a different cpu). If you had have two tasks continuously
>> on one cpu and your other cpu is idle, and you move one of the tasks to
>> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> first cpu while it starts from 0 on the other one. 10 ms later it will
>> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> and we might decide to put more tasks on it because we don't know if
>> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> task) or if it will continue to rise and eventually get to 47742.
>
> Ah, no, since we track per task, and update the per-cpu ones when we
> migrate tasks, the per-cpu values should be instantly updated.
>
> If we were to increase per task storage, we might as well also track
> running_avg not only runnable_avg.

I agree that the removed running_avg should give more useful
information about the the load of a CPU.

The main issue with running_avg is that it's disturbed by other tasks
(as point out previously). As a typical example,  if we have 2 tasks
with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
in the range of [100% - 50%] depending of the parallelism of the
runtime of the tasks whereas the reality is 50% and the use of
running_avg will return this value

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra June 4, 2014, 8:08 a.m. UTC | #23
On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> a number of load-balances where runnable_avg_sum will mostly be
> >> reflecting the state of the world before a change (new task queued or
> >> moved a task to a different cpu). If you had have two tasks continuously
> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> and we might decide to put more tasks on it because we don't know if
> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> task) or if it will continue to rise and eventually get to 47742.
> >
> > Ah, no, since we track per task, and update the per-cpu ones when we
> > migrate tasks, the per-cpu values should be instantly updated.
> >
> > If we were to increase per task storage, we might as well also track
> > running_avg not only runnable_avg.
> 
> I agree that the removed running_avg should give more useful
> information about the the load of a CPU.
> 
> The main issue with running_avg is that it's disturbed by other tasks
> (as point out previously). As a typical example,  if we have 2 tasks
> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> in the range of [100% - 50%] depending of the parallelism of the
> runtime of the tasks whereas the reality is 50% and the use of
> running_avg will return this value

I'm not sure I see how 100% is possible, but yes I agree that runnable
can indeed be inflated due to this queueing effect.
Morten Rasmussen June 4, 2014, 8:55 a.m. UTC | #24
On Wed, Jun 04, 2014 at 09:08:09AM +0100, Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> > On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> > >> Since we may do periodic load-balance every 10 ms or so, we will perform
> > >> a number of load-balances where runnable_avg_sum will mostly be
> > >> reflecting the state of the world before a change (new task queued or
> > >> moved a task to a different cpu). If you had have two tasks continuously
> > >> on one cpu and your other cpu is idle, and you move one of the tasks to
> > >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> > >> first cpu while it starts from 0 on the other one. 10 ms later it will
> > >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> > >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> > >> and we might decide to put more tasks on it because we don't know if
> > >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> > >> task) or if it will continue to rise and eventually get to 47742.
> > >
> > > Ah, no, since we track per task, and update the per-cpu ones when we
> > > migrate tasks, the per-cpu values should be instantly updated.
> > >
> > > If we were to increase per task storage, we might as well also track
> > > running_avg not only runnable_avg.
> > 
> > I agree that the removed running_avg should give more useful
> > information about the the load of a CPU.
> > 
> > The main issue with running_avg is that it's disturbed by other tasks
> > (as point out previously). As a typical example,  if we have 2 tasks
> > with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> > in the range of [100% - 50%] depending of the parallelism of the
> > runtime of the tasks whereas the reality is 50% and the use of
> > running_avg will return this value

Both running_avg and runnable_avg are affected by other tasks on the
same cpus, but in different ways. They are equal if you only have one
task on a cpu. If you have more, running_avg will give you the true
requirement of the tasks until the cpu is fully utilized. At which point
the task running_avg will drop if you add more tasks (the unweighted sum
of task running_avgs remains constant).

runnable_avg on the other hand, might be affected as soon as you have
two task running on the same cpu if they are runnable at the same time.
That isn't necessarily a bad thing for load-balancing purposes, because
tasks that are runnable at the same time are likely to be run more
efficiently by placing them on different cpus. You might view as at sort
of built in concurrency factor, somewhat similar to what Yuyang is
proposing. runnable_avg increases rapidly when the cpu is over-utilized.

> I'm not sure I see how 100% is possible, but yes I agree that runnable
> can indeed be inflated due to this queueing effect.

You should only be able to get to 75% worst case for runnable_avg for
that example. The total running_avg is 50% no matter if the tasks
overlaps or not.

f you had five tasks on one cpu that each have a 25% requirement you can
get individual task runnable_avgs of up to 100% (cpu unweighted
runnable_load_avg can get up 500%, I think), but the task running_avgs
would be 20% each (total of 100%). 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra June 4, 2014, 9:23 a.m. UTC | #25
On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
> Both running_avg and runnable_avg are affected by other tasks on the
> same cpus, but in different ways. They are equal if you only have one
> task on a cpu. If you have more, running_avg will give you the true
> requirement of the tasks until the cpu is fully utilized. At which point
> the task running_avg will drop if you add more tasks (the unweighted sum
> of task running_avgs remains constant).
> 
> runnable_avg on the other hand, might be affected as soon as you have
> two task running on the same cpu if they are runnable at the same time.
> That isn't necessarily a bad thing for load-balancing purposes, because
> tasks that are runnable at the same time are likely to be run more
> efficiently by placing them on different cpus. You might view as at sort
> of built in concurrency factor, somewhat similar to what Yuyang is
> proposing. runnable_avg increases rapidly when the cpu is over-utilized.

Agreed.

> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > can indeed be inflated due to this queueing effect.
> 
> You should only be able to get to 75% worst case for runnable_avg for
> that example. The total running_avg is 50% no matter if the tasks
> overlaps or not.

Yes, 75% is what I ended up with.

> f you had five tasks on one cpu that each have a 25% requirement you can
> get individual task runnable_avgs of up to 100% (cpu unweighted
> runnable_load_avg can get up 500%, I think), but the task running_avgs
> would be 20% each (total of 100%). 

Yeah, more or less so indeed. I had not considered the queueing effects
on runnable_avg yesterday, so good that that got raised.

That does indeed invalidate my: runnable - running := extra cpu required
thing. It ends up being the extra cpu required for 0 latency but gobs of
idle time, which is something else entirely.
Vincent Guittot June 4, 2014, 9:32 a.m. UTC | #26
On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
>> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> >> Since we may do periodic load-balance every 10 ms or so, we will perform
>> >> a number of load-balances where runnable_avg_sum will mostly be
>> >> reflecting the state of the world before a change (new task queued or
>> >> moved a task to a different cpu). If you had have two tasks continuously
>> >> on one cpu and your other cpu is idle, and you move one of the tasks to
>> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> >> first cpu while it starts from 0 on the other one. 10 ms later it will
>> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> >> and we might decide to put more tasks on it because we don't know if
>> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> >> task) or if it will continue to rise and eventually get to 47742.
>> >
>> > Ah, no, since we track per task, and update the per-cpu ones when we
>> > migrate tasks, the per-cpu values should be instantly updated.
>> >
>> > If we were to increase per task storage, we might as well also track
>> > running_avg not only runnable_avg.
>>
>> I agree that the removed running_avg should give more useful
>> information about the the load of a CPU.
>>
>> The main issue with running_avg is that it's disturbed by other tasks
>> (as point out previously). As a typical example,  if we have 2 tasks
>> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
>> in the range of [100% - 50%] depending of the parallelism of the
>> runtime of the tasks whereas the reality is 50% and the use of
>> running_avg will return this value
>
> I'm not sure I see how 100% is possible, but yes I agree that runnable
> can indeed be inflated due to this queueing effect.

In fact, it can be even worse than that because i forgot to take into
account the geometric series effect which implies that it depends of
the runtime (idletime) of the task

Take 3 examples:

2 tasks that need to run 10ms  simultaneously each 40ms. If they share
the same CPU, they will be on the runqueue 20ms (in fact a bit less
for one of them), Their load (runnable_avg_sum/runnable_avg_period)
will be 33% each so the unweighted runnable_load_avg of the CPU will
be 66%

2 tasks that need to run 25ms simultaneously each 100ms. If they share
the same CPU, they will be on the runqueue 50ms (in fact a bit less
for one of them), Their load (runnable_avg_sum/runnable_avg_period)
will be 74% each so the unweighted runnable_load_avg of the CPU will
be 148%

2 tasks that need to run 50ms  simultaneously each 200ms. If they
share the same CPU, they will be on the runqueue 100ms (in fact a bit
less for one of them), Their load
(runnable_avg_sum/runnable_avg_period) will be 89% each so the
unweighted runnable_load_avg of the CPU will be 180%

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot June 4, 2014, 9:35 a.m. UTC | #27
On 4 June 2014 11:23, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
>> Both running_avg and runnable_avg are affected by other tasks on the
>> same cpus, but in different ways. They are equal if you only have one
>> task on a cpu. If you have more, running_avg will give you the true
>> requirement of the tasks until the cpu is fully utilized. At which point
>> the task running_avg will drop if you add more tasks (the unweighted sum
>> of task running_avgs remains constant).
>>
>> runnable_avg on the other hand, might be affected as soon as you have
>> two task running on the same cpu if they are runnable at the same time.
>> That isn't necessarily a bad thing for load-balancing purposes, because
>> tasks that are runnable at the same time are likely to be run more
>> efficiently by placing them on different cpus. You might view as at sort
>> of built in concurrency factor, somewhat similar to what Yuyang is
>> proposing. runnable_avg increases rapidly when the cpu is over-utilized.
>
> Agreed.
>
>> > I'm not sure I see how 100% is possible, but yes I agree that runnable
>> > can indeed be inflated due to this queueing effect.
>>
>> You should only be able to get to 75% worst case for runnable_avg for
>> that example. The total running_avg is 50% no matter if the tasks
>> overlaps or not.
>
> Yes, 75% is what I ended up with.

Can you explain how you reach 75% as it depends on the runtime and a
runtime longer than 345ms will end to a 100% load whatever the
idletime was previously ?


>
>> f you had five tasks on one cpu that each have a 25% requirement you can
>> get individual task runnable_avgs of up to 100% (cpu unweighted
>> runnable_load_avg can get up 500%, I think), but the task running_avgs
>> would be 20% each (total of 100%).
>
> Yeah, more or less so indeed. I had not considered the queueing effects
> on runnable_avg yesterday, so good that that got raised.
>
> That does indeed invalidate my: runnable - running := extra cpu required
> thing. It ends up being the extra cpu required for 0 latency but gobs of
> idle time, which is something else entirely.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen June 4, 2014, 9:44 a.m. UTC | #28
On Wed, Jun 04, 2014 at 10:23:13AM +0100, Peter Zijlstra wrote:
> > f you had five tasks on one cpu that each have a 25% requirement you can
> > get individual task runnable_avgs of up to 100% (cpu unweighted
> > runnable_load_avg can get up 500%, I think), but the task running_avgs
> > would be 20% each (total of 100%). 
> 
> Yeah, more or less so indeed. I had not considered the queueing effects
> on runnable_avg yesterday, so good that that got raised.
> 
> That does indeed invalidate my: runnable - running := extra cpu required
> thing. It ends up being the extra cpu required for 0 latency but gobs of
> idle time, which is something else entirely.

Agreed, but I think it is still a useful estimate of the required
compute capacity. If there is a significant difference between runnable
and running on a cpu, the current mix of tasks is not good for latency.
However, we need to treat it as a worst case estimate and not necessarily
try to move exactly runnable-running worth of tasks to another cpu.

So far I haven't been able to come up with something better.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen June 4, 2014, 10 a.m. UTC | #29
On Wed, Jun 04, 2014 at 10:32:10AM +0100, Vincent Guittot wrote:
> On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> >> a number of load-balances where runnable_avg_sum will mostly be
> >> >> reflecting the state of the world before a change (new task queued or
> >> >> moved a task to a different cpu). If you had have two tasks continuously
> >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> >> and we might decide to put more tasks on it because we don't know if
> >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> >> task) or if it will continue to rise and eventually get to 47742.
> >> >
> >> > Ah, no, since we track per task, and update the per-cpu ones when we
> >> > migrate tasks, the per-cpu values should be instantly updated.
> >> >
> >> > If we were to increase per task storage, we might as well also track
> >> > running_avg not only runnable_avg.
> >>
> >> I agree that the removed running_avg should give more useful
> >> information about the the load of a CPU.
> >>
> >> The main issue with running_avg is that it's disturbed by other tasks
> >> (as point out previously). As a typical example,  if we have 2 tasks
> >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> >> in the range of [100% - 50%] depending of the parallelism of the
> >> runtime of the tasks whereas the reality is 50% and the use of
> >> running_avg will return this value
> >
> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > can indeed be inflated due to this queueing effect.
> 
> In fact, it can be even worse than that because i forgot to take into
> account the geometric series effect which implies that it depends of
> the runtime (idletime) of the task
> 
> Take 3 examples:
> 
> 2 tasks that need to run 10ms  simultaneously each 40ms. If they share
> the same CPU, they will be on the runqueue 20ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 33% each so the unweighted runnable_load_avg of the CPU will
> be 66%

Right, it actually depends on how often you switch between the tasks. If
you sched_tick happens every 10 ms then in this example one task will
run for 10 ms an be done, while the other one waits for 10 ms and then
runs to completion in 10 ms. The result is that one task is runnable for
10 ms and the other one is runnable for 20 ms. That gives you 25% and
50% for a total of 75%.

> 
> 2 tasks that need to run 25ms simultaneously each 100ms. If they share
> the same CPU, they will be on the runqueue 50ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 74% each so the unweighted runnable_load_avg of the CPU will
> be 148%
> 
> 2 tasks that need to run 50ms  simultaneously each 200ms. If they
> share the same CPU, they will be on the runqueue 100ms (in fact a bit
> less for one of them), Their load
> (runnable_avg_sum/runnable_avg_period) will be 89% each so the
> unweighted runnable_load_avg of the CPU will be 180%

In this case you switch been the tasks before they complete, so in this
case both tasks are waiting so runnable will get much righer than 75%.
You are right. It was thinking about tasks where the busy period is
short enough that they run to completion when they are scheduled.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra June 4, 2014, 10:17 a.m. UTC | #30
On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
> On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> >> a number of load-balances where runnable_avg_sum will mostly be
> >> >> reflecting the state of the world before a change (new task queued or
> >> >> moved a task to a different cpu). If you had have two tasks continuously
> >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> >> and we might decide to put more tasks on it because we don't know if
> >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> >> task) or if it will continue to rise and eventually get to 47742.
> >> >
> >> > Ah, no, since we track per task, and update the per-cpu ones when we
> >> > migrate tasks, the per-cpu values should be instantly updated.
> >> >
> >> > If we were to increase per task storage, we might as well also track
> >> > running_avg not only runnable_avg.
> >>
> >> I agree that the removed running_avg should give more useful
> >> information about the the load of a CPU.
> >>
> >> The main issue with running_avg is that it's disturbed by other tasks
> >> (as point out previously). As a typical example,  if we have 2 tasks
> >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> >> in the range of [100% - 50%] depending of the parallelism of the
> >> runtime of the tasks whereas the reality is 50% and the use of
> >> running_avg will return this value
> >
> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > can indeed be inflated due to this queueing effect.
 
Let me explain the 75%, take any one of the above scenarios. Lets call
the two tasks A and B, and let for a moment assume A always wins and
runs first, and then B.

So A will be runnable for 25%, B otoh will be runnable the entire time A
is actually running plus its own running time, giving 50%. Together that
makes 75%.

If you release the assumption that A runs first, but instead assume they
equally win the first execution, you get them averaging at 37.5% each,
which combined will still give 75%.

> In fact, it can be even worse than that because i forgot to take into
> account the geometric series effect which implies that it depends of
> the runtime (idletime) of the task
> 
> Take 3 examples:
> 
> 2 tasks that need to run 10ms  simultaneously each 40ms. If they share
> the same CPU, they will be on the runqueue 20ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 33% each so the unweighted runnable_load_avg of the CPU will
> be 66%
> 
> 2 tasks that need to run 25ms simultaneously each 100ms. If they share
> the same CPU, they will be on the runqueue 50ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 74% each so the unweighted runnable_load_avg of the CPU will
> be 148%
> 
> 2 tasks that need to run 50ms  simultaneously each 200ms. If they
> share the same CPU, they will be on the runqueue 100ms (in fact a bit
> less for one of them), Their load
> (runnable_avg_sum/runnable_avg_period) will be 89% each so the
> unweighted runnable_load_avg of the CPU will be 180%

And this is because the running time is 'large' compared to the decay
and we get hit by the weight of the recent state? Yes, I can see that,
the avg will fluctuate due to the nature of this thing.
Morten Rasmussen June 4, 2014, 10:25 a.m. UTC | #31
On Wed, Jun 04, 2014 at 10:35:15AM +0100, Vincent Guittot wrote:
> On 4 June 2014 11:23, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
> >> Both running_avg and runnable_avg are affected by other tasks on the
> >> same cpus, but in different ways. They are equal if you only have one
> >> task on a cpu. If you have more, running_avg will give you the true
> >> requirement of the tasks until the cpu is fully utilized. At which point
> >> the task running_avg will drop if you add more tasks (the unweighted sum
> >> of task running_avgs remains constant).
> >>
> >> runnable_avg on the other hand, might be affected as soon as you have
> >> two task running on the same cpu if they are runnable at the same time.
> >> That isn't necessarily a bad thing for load-balancing purposes, because
> >> tasks that are runnable at the same time are likely to be run more
> >> efficiently by placing them on different cpus. You might view as at sort
> >> of built in concurrency factor, somewhat similar to what Yuyang is
> >> proposing. runnable_avg increases rapidly when the cpu is over-utilized.
> >
> > Agreed.
> >
> >> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> >> > can indeed be inflated due to this queueing effect.
> >>
> >> You should only be able to get to 75% worst case for runnable_avg for
> >> that example. The total running_avg is 50% no matter if the tasks
> >> overlaps or not.
> >
> > Yes, 75% is what I ended up with.
> 
> Can you explain how you reach 75% as it depends on the runtime and a
> runtime longer than 345ms will end to a 100% load whatever the
> idletime was previously ?

If the busy period of each task is long enough that the first one to run
runs to completetion before the other task is scheduled you get 25% and
50%. But as I said in my other reply, you can get higher task runnable
if the tasks busy period is long enough that you switch between them
before the first one completes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen June 4, 2014, 10:36 a.m. UTC | #32
On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> > >> >> a number of load-balances where runnable_avg_sum will mostly be
> > >> >> reflecting the state of the world before a change (new task queued or
> > >> >> moved a task to a different cpu). If you had have two tasks continuously
> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> > >> >> and we might decide to put more tasks on it because we don't know if
> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> > >> >> task) or if it will continue to rise and eventually get to 47742.
> > >> >
> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
> > >> > migrate tasks, the per-cpu values should be instantly updated.
> > >> >
> > >> > If we were to increase per task storage, we might as well also track
> > >> > running_avg not only runnable_avg.
> > >>
> > >> I agree that the removed running_avg should give more useful
> > >> information about the the load of a CPU.
> > >>
> > >> The main issue with running_avg is that it's disturbed by other tasks
> > >> (as point out previously). As a typical example,  if we have 2 tasks
> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> > >> in the range of [100% - 50%] depending of the parallelism of the
> > >> runtime of the tasks whereas the reality is 50% and the use of
> > >> running_avg will return this value
> > >
> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > > can indeed be inflated due to this queueing effect.
>  
> Let me explain the 75%, take any one of the above scenarios. Lets call
> the two tasks A and B, and let for a moment assume A always wins and
> runs first, and then B.
> 
> So A will be runnable for 25%, B otoh will be runnable the entire time A
> is actually running plus its own running time, giving 50%. Together that
> makes 75%.
> 
> If you release the assumption that A runs first, but instead assume they
> equally win the first execution, you get them averaging at 37.5% each,
> which combined will still give 75%.

But that is assuming that the first task gets to run to completion of it
busy period. If it uses up its sched_slice and we switch to the other
tasks, they both get to wait.

For example, if the sched_slice is 5 ms and the busy period is 10 ms,
the execution pattern would be: A, B, A, B, idle, ... In that case A is
runnable for 15 ms and B is for 20 ms. Assuming that the overall period
is 40 ms, the A runnable is 37.5% and B is 50%.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra June 4, 2014, 10:55 a.m. UTC | #33
On Wed, Jun 04, 2014 at 11:36:19AM +0100, Morten Rasmussen wrote:
> On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
> > Let me explain the 75%, take any one of the above scenarios. Lets call
> > the two tasks A and B, and let for a moment assume A always wins and
> > runs first, and then B.
> > 
> > So A will be runnable for 25%, B otoh will be runnable the entire time A
> > is actually running plus its own running time, giving 50%. Together that
> > makes 75%.
> > 
> > If you release the assumption that A runs first, but instead assume they
> > equally win the first execution, you get them averaging at 37.5% each,
> > which combined will still give 75%.
> 
> But that is assuming that the first task gets to run to completion of it
> busy period. If it uses up its sched_slice and we switch to the other
> tasks, they both get to wait.
> 
> For example, if the sched_slice is 5 ms and the busy period is 10 ms,
> the execution pattern would be: A, B, A, B, idle, ... In that case A is
> runnable for 15 ms and B is for 20 ms. Assuming that the overall period
> is 40 ms, the A runnable is 37.5% and B is 50%.

Indeed, with preemption added you can pull this out further. You can
then indeed get infinitesimally close to 100% with this.
Vincent Guittot June 4, 2014, 11:07 a.m. UTC | #34
On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
>> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
>> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
>> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
>> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
>> > >> >> a number of load-balances where runnable_avg_sum will mostly be
>> > >> >> reflecting the state of the world before a change (new task queued or
>> > >> >> moved a task to a different cpu). If you had have two tasks continuously
>> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
>> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
>> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> > >> >> and we might decide to put more tasks on it because we don't know if
>> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> > >> >> task) or if it will continue to rise and eventually get to 47742.
>> > >> >
>> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
>> > >> > migrate tasks, the per-cpu values should be instantly updated.
>> > >> >
>> > >> > If we were to increase per task storage, we might as well also track
>> > >> > running_avg not only runnable_avg.
>> > >>
>> > >> I agree that the removed running_avg should give more useful
>> > >> information about the the load of a CPU.
>> > >>
>> > >> The main issue with running_avg is that it's disturbed by other tasks
>> > >> (as point out previously). As a typical example,  if we have 2 tasks
>> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
>> > >> in the range of [100% - 50%] depending of the parallelism of the
>> > >> runtime of the tasks whereas the reality is 50% and the use of
>> > >> running_avg will return this value
>> > >
>> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
>> > > can indeed be inflated due to this queueing effect.
>>
>> Let me explain the 75%, take any one of the above scenarios. Lets call
>> the two tasks A and B, and let for a moment assume A always wins and
>> runs first, and then B.
>>
>> So A will be runnable for 25%, B otoh will be runnable the entire time A
>> is actually running plus its own running time, giving 50%. Together that
>> makes 75%.
>>
>> If you release the assumption that A runs first, but instead assume they
>> equally win the first execution, you get them averaging at 37.5% each,
>> which combined will still give 75%.
>
> But that is assuming that the first task gets to run to completion of it
> busy period. If it uses up its sched_slice and we switch to the other
> tasks, they both get to wait.
>
> For example, if the sched_slice is 5 ms and the busy period is 10 ms,
> the execution pattern would be: A, B, A, B, idle, ... In that case A is
> runnable for 15 ms and B is for 20 ms. Assuming that the overall period
> is 40 ms, the A runnable is 37.5% and B is 50%.

The exact value for your scheduling example above is:
A runnable will be 47% and B runnable will be 60% (unless i make a
mistake in my computation)
and CPU runnable will be 60% too

Vincent

>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen June 4, 2014, 11:23 a.m. UTC | #35
On Wed, Jun 04, 2014 at 12:07:29PM +0100, Vincent Guittot wrote:
> On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
> >> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
> >> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> >> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> > >> >> a number of load-balances where runnable_avg_sum will mostly be
> >> > >> >> reflecting the state of the world before a change (new task queued or
> >> > >> >> moved a task to a different cpu). If you had have two tasks continuously
> >> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> > >> >> and we might decide to put more tasks on it because we don't know if
> >> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> > >> >> task) or if it will continue to rise and eventually get to 47742.
> >> > >> >
> >> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
> >> > >> > migrate tasks, the per-cpu values should be instantly updated.
> >> > >> >
> >> > >> > If we were to increase per task storage, we might as well also track
> >> > >> > running_avg not only runnable_avg.
> >> > >>
> >> > >> I agree that the removed running_avg should give more useful
> >> > >> information about the the load of a CPU.
> >> > >>
> >> > >> The main issue with running_avg is that it's disturbed by other tasks
> >> > >> (as point out previously). As a typical example,  if we have 2 tasks
> >> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> >> > >> in the range of [100% - 50%] depending of the parallelism of the
> >> > >> runtime of the tasks whereas the reality is 50% and the use of
> >> > >> running_avg will return this value
> >> > >
> >> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
> >> > > can indeed be inflated due to this queueing effect.
> >>
> >> Let me explain the 75%, take any one of the above scenarios. Lets call
> >> the two tasks A and B, and let for a moment assume A always wins and
> >> runs first, and then B.
> >>
> >> So A will be runnable for 25%, B otoh will be runnable the entire time A
> >> is actually running plus its own running time, giving 50%. Together that
> >> makes 75%.
> >>
> >> If you release the assumption that A runs first, but instead assume they
> >> equally win the first execution, you get them averaging at 37.5% each,
> >> which combined will still give 75%.
> >
> > But that is assuming that the first task gets to run to completion of it
> > busy period. If it uses up its sched_slice and we switch to the other
> > tasks, they both get to wait.
> >
> > For example, if the sched_slice is 5 ms and the busy period is 10 ms,
> > the execution pattern would be: A, B, A, B, idle, ... In that case A is
> > runnable for 15 ms and B is for 20 ms. Assuming that the overall period
> > is 40 ms, the A runnable is 37.5% and B is 50%.
> 
> The exact value for your scheduling example above is:
> A runnable will be 47% and B runnable will be 60% (unless i make a
> mistake in my computation)

I get:

A: 15/40 ms = 37.5%
B: 20/40 ms = 50%

Schedule:

   | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
A:   run     rq     run  ----------- sleeping -------------  run
B:   rq      run    rq    run   ---- sleeping -------------  rq

> and CPU runnable will be 60% too

rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
20 ms every 40 ms.

Right?

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot June 4, 2014, 11:52 a.m. UTC | #36
On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Jun 04, 2014 at 12:07:29PM +0100, Vincent Guittot wrote:
>> On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
>> >> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
>> >> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
>> >> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> >> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
>> >> > >> >> a number of load-balances where runnable_avg_sum will mostly be
>> >> > >> >> reflecting the state of the world before a change (new task queued or
>> >> > >> >> moved a task to a different cpu). If you had have two tasks continuously
>> >> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
>> >> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> >> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
>> >> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> >> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> >> > >> >> and we might decide to put more tasks on it because we don't know if
>> >> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> >> > >> >> task) or if it will continue to rise and eventually get to 47742.
>> >> > >> >
>> >> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
>> >> > >> > migrate tasks, the per-cpu values should be instantly updated.
>> >> > >> >
>> >> > >> > If we were to increase per task storage, we might as well also track
>> >> > >> > running_avg not only runnable_avg.
>> >> > >>
>> >> > >> I agree that the removed running_avg should give more useful
>> >> > >> information about the the load of a CPU.
>> >> > >>
>> >> > >> The main issue with running_avg is that it's disturbed by other tasks
>> >> > >> (as point out previously). As a typical example,  if we have 2 tasks
>> >> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
>> >> > >> in the range of [100% - 50%] depending of the parallelism of the
>> >> > >> runtime of the tasks whereas the reality is 50% and the use of
>> >> > >> running_avg will return this value
>> >> > >
>> >> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
>> >> > > can indeed be inflated due to this queueing effect.
>> >>
>> >> Let me explain the 75%, take any one of the above scenarios. Lets call
>> >> the two tasks A and B, and let for a moment assume A always wins and
>> >> runs first, and then B.
>> >>
>> >> So A will be runnable for 25%, B otoh will be runnable the entire time A
>> >> is actually running plus its own running time, giving 50%. Together that
>> >> makes 75%.
>> >>
>> >> If you release the assumption that A runs first, but instead assume they
>> >> equally win the first execution, you get them averaging at 37.5% each,
>> >> which combined will still give 75%.
>> >
>> > But that is assuming that the first task gets to run to completion of it
>> > busy period. If it uses up its sched_slice and we switch to the other
>> > tasks, they both get to wait.
>> >
>> > For example, if the sched_slice is 5 ms and the busy period is 10 ms,
>> > the execution pattern would be: A, B, A, B, idle, ... In that case A is
>> > runnable for 15 ms and B is for 20 ms. Assuming that the overall period
>> > is 40 ms, the A runnable is 37.5% and B is 50%.
>>
>> The exact value for your scheduling example above is:
>> A runnable will be 47% and B runnable will be 60% (unless i make a
>> mistake in my computation)
>
> I get:
>
> A: 15/40 ms = 37.5%
> B: 20/40 ms = 50%
>
> Schedule:
>
>    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> A:   run     rq     run  ----------- sleeping -------------  run
> B:   rq      run    rq    run   ---- sleeping -------------  rq
>
>> and CPU runnable will be 60% too
>
> rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
> 20 ms every 40 ms.
>
> Right?

ok, i see the misunderstood.
it's depends of what we mean by runnable. You take the % of time
whereas i take the runnable_avg_sum/period

so A is on_rq 15/40 ms = 37.5% of the time which gives a
runnable_avg_sum/runnable_avg_period of 47%
B is on_rq  20/40 ms = 50% of the time which gives a
runnable_avg_sum/runnable_avg_period of 60%
and CPU has a task on its rq 20/40ms = 50% of the time which gives a
runnable_avg_sum/runnable_avg_period of 60%

Vincent

>
> Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen June 4, 2014, 1:09 p.m. UTC | #37
On Wed, Jun 04, 2014 at 12:52:46PM +0100, Vincent Guittot wrote:
> On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > I get:
> >
> > A: 15/40 ms = 37.5%
> > B: 20/40 ms = 50%
> >
> > Schedule:
> >
> >    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> > A:   run     rq     run  ----------- sleeping -------------  run
> > B:   rq      run    rq    run   ---- sleeping -------------  rq
> >
> >> and CPU runnable will be 60% too
> >
> > rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
> > 20 ms every 40 ms.
> >
> > Right?
> 
> ok, i see the misunderstood.
> it's depends of what we mean by runnable. You take the % of time
> whereas i take the runnable_avg_sum/period

Right. There is a difference.

> 
> so A is on_rq 15/40 ms = 37.5% of the time which gives a
> runnable_avg_sum/runnable_avg_period of 47%
> B is on_rq  20/40 ms = 50% of the time which gives a
> runnable_avg_sum/runnable_avg_period of 60%
> and CPU has a task on its rq 20/40ms = 50% of the time which gives a
> runnable_avg_sum/runnable_avg_period of 60%

Yes, that seems about right.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen June 4, 2014, 1:23 p.m. UTC | #38
On Wed, Jun 04, 2014 at 02:09:18PM +0100, Morten Rasmussen wrote:
> On Wed, Jun 04, 2014 at 12:52:46PM +0100, Vincent Guittot wrote:
> > On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > > I get:
> > >
> > > A: 15/40 ms = 37.5%
> > > B: 20/40 ms = 50%
> > >
> > > Schedule:
> > >
> > >    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> > > A:   run     rq     run  ----------- sleeping -------------  run
> > > B:   rq      run    rq    run   ---- sleeping -------------  rq
> > >
> > >> and CPU runnable will be 60% too
> > >
> > > rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
> > > 20 ms every 40 ms.
> > >
> > > Right?
> > 
> > ok, i see the misunderstood.
> > it's depends of what we mean by runnable. You take the % of time
> > whereas i take the runnable_avg_sum/period
> 
> Right. There is a difference.
> 
> > 
> > so A is on_rq 15/40 ms = 37.5% of the time which gives a
> > runnable_avg_sum/runnable_avg_period of 47%
> > B is on_rq  20/40 ms = 50% of the time which gives a
> > runnable_avg_sum/runnable_avg_period of 60%
> > and CPU has a task on its rq 20/40ms = 50% of the time which gives a
> > runnable_avg_sum/runnable_avg_period of 60%
> 
> Yes, that seems about right.

If my calculations are right, cfs.runnable_load_avg would peak 15 ms
into the period at about 104%. After 20 ms A has decayed more than B has
gained so the sum is slightly lower.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7c51be..c01d8b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4044,6 +4044,11 @@  static unsigned long power_of(int cpu)
 	return cpu_rq(cpu)->cpu_power;
 }
 
+static unsigned long power_orig_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_power_orig;
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -4438,6 +4443,18 @@  done:
 	return target;
 }
 
+static int get_cpu_activity(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	u32 sum = rq->avg.runnable_avg_sum;
+	u32 period = rq->avg.runnable_avg_period;
+
+	if (sum >= period)
+		return power_orig_of(cpu) + rq->nr_running - 1;
+
+	return (sum * power_orig_of(cpu)) / period;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5518,6 +5535,7 @@  struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_power;
+	unsigned long group_activity; /* Total activity of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity;
 	unsigned int idle_cpus;
@@ -5538,6 +5556,7 @@  struct sd_lb_stats {
 	struct sched_group *busiest;	/* Busiest group in this sd */
 	struct sched_group *local;	/* Local group in this sd */
 	unsigned long total_load;	/* Total load of all groups in sd */
+	unsigned long total_activity;  /* Total activity of all groups in sd */
 	unsigned long total_pwr;	/* Total power of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
 
@@ -5557,6 +5576,7 @@  static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 		.busiest = NULL,
 		.local = NULL,
 		.total_load = 0UL,
+		.total_activity = 0UL,
 		.total_pwr = 0UL,
 		.busiest_stat = {
 			.avg_load = 0UL,
@@ -5876,6 +5896,7 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
+		sgs->group_activity += get_cpu_activity(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
@@ -6034,6 +6055,7 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 next_group:
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
+		sds->total_activity += sgs->group_activity;
 		sds->total_pwr += sgs->group_power;
 
 		sg = sg->next;