[v5,11/12] sched: replace capacity_factor by utilization

Message ID 1409051215-16788-12-git-send-email-vincent.guittot@linaro.org
State New
Headers show

Commit Message

Vincent Guittot Aug. 26, 2014, 11:06 a.m.
The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
SCHED_CAPACITY_SCALE.
We can now have a better idea of the capacity of a group of CPUs and of the
utilization of this group thanks to the rework of group_capacity_orig and the
group_utilization. We can now deduct how many capacity is still available.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 121 ++++++++++++++++++++++------------------------------
 1 file changed, 51 insertions(+), 70 deletions(-)

Comments

Peter Zijlstra Sept. 11, 2014, 4:15 p.m. | #1
On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
> +			struct lb_env *env)
>  {
> +	if ((sgs->group_capacity_orig * 100) >
> +			(sgs->group_utilization * env->sd->imbalance_pct))
> +		return 1;
> +
> +	if (sgs->sum_nr_running < sgs->group_weight)
> +		return 1;
>  
> +	return 0;
> +}
>  
> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
> +			struct lb_env *env)
> +{
> +	if (sgs->sum_nr_running <= sgs->group_weight)
> +		return 0;
>  
> +	if ((sgs->group_capacity_orig * 100) <
> +			(sgs->group_utilization * env->sd->imbalance_pct))
> +		return 1;
>  
> +	return 0;
>  }

I'm confused about the utilization vs capacity_orig. I see how we should
maybe scale things with the capacity when comparing between CPUs/groups,
but not on the same CPU/group.

I would have expected something simple like:

static inline bool group_has_capacity()
{
	/* Is there a spare cycle? */
	if (sgs->group_utilization < sgs->group_weight * SCHED_LOAD_SCALE)
		return true;

	/* Are there less tasks than logical CPUs? */
	if (sgs->sum_nr_running < sgs->group_weight)
		return true;

	return false;
}

Where group_utilization a pure sum of running_avg.

Now this has a hole when there are RT tasks on the system, in that case
the utilization will never hit 1, but we could fix that another way. I
don't think the capacity_orig thing is right.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot Sept. 11, 2014, 5:26 p.m. | #2
On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
>> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
>> +                     struct lb_env *env)
>>  {
>> +     if ((sgs->group_capacity_orig * 100) >
>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>> +             return 1;
>> +
>> +     if (sgs->sum_nr_running < sgs->group_weight)
>> +             return 1;
>>
>> +     return 0;
>> +}
>>
>> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
>> +                     struct lb_env *env)
>> +{
>> +     if (sgs->sum_nr_running <= sgs->group_weight)
>> +             return 0;
>>
>> +     if ((sgs->group_capacity_orig * 100) <
>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>> +             return 1;
>>
>> +     return 0;
>>  }
>
> I'm confused about the utilization vs capacity_orig. I see how we should

1st point is that I should compare utilization vs capacity and not
capacity_orig.
I should have replaced capacity_orig by capacity in the functions
above when i move the utilization statistic from
rq->avg.runnable_avg_sum to cfs.usage_load_avg.
rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
cfs.usage_load_avg integrates only cfs tasks

With this change, we don't need sgs->group_capacity_orig anymore but
only sgs->group_capacity. So sgs->group_capacity_orig can be removed
as it's no more used in the code as sg_capacity_factor has been
removed

> maybe scale things with the capacity when comparing between CPUs/groups,
> but not on the same CPU/group.
>
> I would have expected something simple like:
>
> static inline bool group_has_capacity()
> {
>         /* Is there a spare cycle? */
>         if (sgs->group_utilization < sgs->group_weight * SCHED_LOAD_SCALE)
>                 return true;
>
>         /* Are there less tasks than logical CPUs? */
>         if (sgs->sum_nr_running < sgs->group_weight)
>                 return true;
>
>         return false;
> }
>
> Where group_utilization a pure sum of running_avg.
>
> Now this has a hole when there are RT tasks on the system, in that case
> the utilization will never hit 1, but we could fix that another way. I
> don't think the capacity_orig thing is right.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 14, 2014, 7:41 p.m. | #3
On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
> >> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
> >> +                     struct lb_env *env)
> >>  {
> >> +     if ((sgs->group_capacity_orig * 100) >
> >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> >> +             return 1;
> >> +
> >> +     if (sgs->sum_nr_running < sgs->group_weight)
> >> +             return 1;
> >>
> >> +     return 0;
> >> +}
> >>
> >> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
> >> +                     struct lb_env *env)
> >> +{
> >> +     if (sgs->sum_nr_running <= sgs->group_weight)
> >> +             return 0;
> >>
> >> +     if ((sgs->group_capacity_orig * 100) <
> >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> >> +             return 1;
> >>
> >> +     return 0;
> >>  }
> >
> > I'm confused about the utilization vs capacity_orig. I see how we should
> 
> 1st point is that I should compare utilization vs capacity and not
> capacity_orig.
> I should have replaced capacity_orig by capacity in the functions
> above when i move the utilization statistic from
> rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> cfs.usage_load_avg integrates only cfs tasks
> 
> With this change, we don't need sgs->group_capacity_orig anymore but
> only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> as it's no more used in the code as sg_capacity_factor has been
> removed

Yes, but.. so I suppose we need to add DVFS accounting and remove
cpufreq from the capacity thing. Otherwise I don't see it make sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 14, 2014, 7:51 p.m. | #4
On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
> > >> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
> > >> +                     struct lb_env *env)
> > >>  {
> > >> +     if ((sgs->group_capacity_orig * 100) >
> > >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> > >> +             return 1;
> > >> +
> > >> +     if (sgs->sum_nr_running < sgs->group_weight)
> > >> +             return 1;
> > >>
> > >> +     return 0;
> > >> +}
> > >>
> > >> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
> > >> +                     struct lb_env *env)
> > >> +{
> > >> +     if (sgs->sum_nr_running <= sgs->group_weight)
> > >> +             return 0;
> > >>
> > >> +     if ((sgs->group_capacity_orig * 100) <
> > >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> > >> +             return 1;
> > >>
> > >> +     return 0;
> > >>  }
> > >
> > > I'm confused about the utilization vs capacity_orig. I see how we should
> > 
> > 1st point is that I should compare utilization vs capacity and not
> > capacity_orig.
> > I should have replaced capacity_orig by capacity in the functions
> > above when i move the utilization statistic from
> > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> > cfs.usage_load_avg integrates only cfs tasks
> > 
> > With this change, we don't need sgs->group_capacity_orig anymore but
> > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> > as it's no more used in the code as sg_capacity_factor has been
> > removed
> 
> Yes, but.. so I suppose we need to add DVFS accounting and remove
> cpufreq from the capacity thing. Otherwise I don't see it make sense.

That is to say, please also explain these details in the changelog, and
preferably also in a XXX/FIXME/TODO comment near wherever so we don't
forget about it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 15, 2014, 11:42 a.m. | #5
On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:

> > > I'm confused about the utilization vs capacity_orig. I see how we should
> > 
> > 1st point is that I should compare utilization vs capacity and not
> > capacity_orig.
> > I should have replaced capacity_orig by capacity in the functions
> > above when i move the utilization statistic from
> > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> > cfs.usage_load_avg integrates only cfs tasks
> > 
> > With this change, we don't need sgs->group_capacity_orig anymore but
> > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> > as it's no more used in the code as sg_capacity_factor has been
> > removed
> 
> Yes, but.. so I suppose we need to add DVFS accounting and remove
> cpufreq from the capacity thing. Otherwise I don't see it make sense.

OK, I've reconsidered _again_, I still don't get it.

So fundamentally I think its wrong to scale with the capacity; it just
doesn't make any sense. Consider big.little stuff, their CPUs are
inherently asymmetric in capacity, but that doesn't matter one whit for
utilization numbers. If a core is fully consumed its fully consumed, no
matter how much work it can or can not do.


So the only thing that needs correcting is the fact that these
statistics are based on clock_task and some of that time can end up in
other scheduling classes, at which point we'll never get 100% even
though we're 'saturated'. But correcting for that using capacity doesn't
'work'.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Nicolas Pitre Sept. 15, 2014, 7:07 p.m. | #6
On Mon, 15 Sep 2014, Peter Zijlstra wrote:

> On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> > > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > > I'm confused about the utilization vs capacity_orig. I see how we should
> > > 
> > > 1st point is that I should compare utilization vs capacity and not
> > > capacity_orig.
> > > I should have replaced capacity_orig by capacity in the functions
> > > above when i move the utilization statistic from
> > > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> > > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> > > cfs.usage_load_avg integrates only cfs tasks
> > > 
> > > With this change, we don't need sgs->group_capacity_orig anymore but
> > > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> > > as it's no more used in the code as sg_capacity_factor has been
> > > removed
> > 
> > Yes, but.. so I suppose we need to add DVFS accounting and remove
> > cpufreq from the capacity thing. Otherwise I don't see it make sense.
> 
> OK, I've reconsidered _again_, I still don't get it.
> 
> So fundamentally I think its wrong to scale with the capacity; it just
> doesn't make any sense. Consider big.little stuff, their CPUs are
> inherently asymmetric in capacity, but that doesn't matter one whit for
> utilization numbers. If a core is fully consumed its fully consumed, no
> matter how much work it can or can not do.

Let's suppose a task running on a 1GHz CPU producing a load of 100.

The same task on a 100MHz CPU would produce a load of 1000 because that 
CPU is 10x slower.  So to properly evaluate the load of a task when 
moving it around, we want to normalize its load based on the CPU 
performance.  In this case the correction factor would be 0.1.

Given those normalized loads, we need to scale CPU capacity as well.  If 
the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.

In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
has a normalized capacity of 500, but only if the load metric is already 
normalized as well.

Or am I completely missing the point here?


Nicolas





> 
> 
> So the only thing that needs correcting is the fact that these
> statistics are based on clock_task and some of that time can end up in
> other scheduling classes, at which point we'll never get 100% even
> though we're 'saturated'. But correcting for that using capacity doesn't
> 'work'.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 15, 2014, 8:01 p.m. | #7
On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
> On Mon, 15 Sep 2014, Peter Zijlstra wrote:

> Let's suppose a task running on a 1GHz CPU producing a load of 100.
> 
> The same task on a 100MHz CPU would produce a load of 1000 because that 
> CPU is 10x slower.  So to properly evaluate the load of a task when 
> moving it around, we want to normalize its load based on the CPU 
> performance.  In this case the correction factor would be 0.1.
> 
> Given those normalized loads, we need to scale CPU capacity as well.  If 
> the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
> 
> In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
> has a normalized capacity of 500, but only if the load metric is already 
> normalized as well.
> 
> Or am I completely missing the point here?

So I was thinking of the usage as per the next patch; where we decide if
a cpu is 'full' or not based on the utilization measure. For this
measure we're not interested in inter CPU relations at all, and any use
of capacity scaling simply doesn't make sense.

But I think you asking this question shows a 'bigger' problem in that
the Changelogs are entirely failing at describing the actual problem and
proposed solution. Because if that were clear, I don't think we would be
having this particular discussion.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot Sept. 15, 2014, 10:14 p.m. | #8
On 15 September 2014 13:42, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
>> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
>> > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> > > I'm confused about the utilization vs capacity_orig. I see how we should
>> >
>> > 1st point is that I should compare utilization vs capacity and not
>> > capacity_orig.
>> > I should have replaced capacity_orig by capacity in the functions
>> > above when i move the utilization statistic from
>> > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
>> > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
>> > cfs.usage_load_avg integrates only cfs tasks
>> >
>> > With this change, we don't need sgs->group_capacity_orig anymore but
>> > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
>> > as it's no more used in the code as sg_capacity_factor has been
>> > removed
>>
>> Yes, but.. so I suppose we need to add DVFS accounting and remove
>> cpufreq from the capacity thing. Otherwise I don't see it make sense.
>
> OK, I've reconsidered _again_, I still don't get it.
>
> So fundamentally I think its wrong to scale with the capacity; it just
> doesn't make any sense. Consider big.little stuff, their CPUs are
> inherently asymmetric in capacity, but that doesn't matter one whit for
> utilization numbers. If a core is fully consumed its fully consumed, no
> matter how much work it can or can not do.
>
>
> So the only thing that needs correcting is the fact that these
> statistics are based on clock_task and some of that time can end up in
> other scheduling classes, at which point we'll never get 100% even
> though we're 'saturated'. But correcting for that using capacity doesn't
> 'work'.

I'm not sure to catch your last point because the capacity is the only
figures that take into account the "time" consumed by other classes.
Have you got in mind another way to take into account the other
classes ?

So we have cpu_capacity that is the capacity that can be currently
used by cfs class
We have cfs.usage_load_avg that is the sum of running time of cfs
tasks on the CPU and reflect the % of usage of this CPU by CFS tasks
We have to use the same metrics to compare available capacity for CFS
and current cfs usage

Now we have to use the same unit so we can either weight the
cpu_capacity_orig with the cfs.usage_load_avg and compare it with
cpu_capacity
or with divide cpu_capacity by cpu_capacity_orig and scale it into the
SCHED_LOAD_SCALE range. Is It what you are proposing ?

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dietmar Eggemann Sept. 16, 2014, 5 p.m. | #9
On 14/09/14 12:41, Peter Zijlstra wrote:
> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
>> On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
>>>> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
>>>> +                     struct lb_env *env)
>>>>   {
>>>> +     if ((sgs->group_capacity_orig * 100) >
>>>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>>>> +             return 1;
>>>> +
>>>> +     if (sgs->sum_nr_running < sgs->group_weight)
>>>> +             return 1;
>>>>
>>>> +     return 0;
>>>> +}
>>>>
>>>> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
>>>> +                     struct lb_env *env)
>>>> +{
>>>> +     if (sgs->sum_nr_running <= sgs->group_weight)
>>>> +             return 0;
>>>>
>>>> +     if ((sgs->group_capacity_orig * 100) <
>>>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>>>> +             return 1;
>>>>
>>>> +     return 0;
>>>>   }
>>>
>>> I'm confused about the utilization vs capacity_orig. I see how we should
>>
>> 1st point is that I should compare utilization vs capacity and not
>> capacity_orig.
>> I should have replaced capacity_orig by capacity in the functions
>> above when i move the utilization statistic from
>> rq->avg.runnable_avg_sum to cfs.usage_load_avg.
>> rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
>> cfs.usage_load_avg integrates only cfs tasks
>>
>> With this change, we don't need sgs->group_capacity_orig anymore but
>> only sgs->group_capacity. So sgs->group_capacity_orig can be removed
>> as it's no more used in the code as sg_capacity_factor has been
>> removed
>
> Yes, but.. so I suppose we need to add DVFS accounting and remove
> cpufreq from the capacity thing. Otherwise I don't see it make sense.

My understanding is that uArch scaling of capacacity_orig (therefore of 
capacity too) is done by calling arch_scale_cpu_capacity and frequency 
scaling for capacity is done by calling arch_scale_freq_capacity in 
update_cpu_capacity. I understand that this patch-set does not provide 
an implementation of arch_scale_freq_capacity though.

The uArch and frequency scaling of load & utilization will be added 
later. I know that Morten is currently working on a 'uArch and frequency 
invariant load & utilization tracking' patch-set.

Although I don't know exactly what you mean by DVFS accounting and 
remove cpufreq from the capacity here.

-- Dietmar

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen Sept. 17, 2014, 6:45 p.m. | #10
On Mon, Sep 15, 2014 at 09:01:59PM +0100, Peter Zijlstra wrote:
> On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
> > On Mon, 15 Sep 2014, Peter Zijlstra wrote:
> 
> > Let's suppose a task running on a 1GHz CPU producing a load of 100.
> > 
> > The same task on a 100MHz CPU would produce a load of 1000 because that 
> > CPU is 10x slower.  So to properly evaluate the load of a task when 
> > moving it around, we want to normalize its load based on the CPU 
> > performance.  In this case the correction factor would be 0.1.
> > 
> > Given those normalized loads, we need to scale CPU capacity as well.  If 
> > the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
> > 
> > In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
> > has a normalized capacity of 500, but only if the load metric is already 
> > normalized as well.
> > 
> > Or am I completely missing the point here?
> 
> So I was thinking of the usage as per the next patch; where we decide if
> a cpu is 'full' or not based on the utilization measure. For this
> measure we're not interested in inter CPU relations at all, and any use
> of capacity scaling simply doesn't make sense.

Right. You don't need to scale capacity to determine whether a cpu is
full or not if you don't have DVFS, but I don't think it hurts if it is
done right. We need the scaling to figure out how much capacity is
available.

> But I think you asking this question shows a 'bigger' problem in that
> the Changelogs are entirely failing at describing the actual problem and
> proposed solution. Because if that were clear, I don't think we would be
> having this particular discussion.

Yes, the bigger problem of scaling things with DVFS and taking
big.LITTLE into account is not addressed in this patch set. This is the
scale-invariance problem that we discussed at Ksummit.

I will send a few patches shortly that provides the usage
scale-invariance. The last bit missing is then to scale capacity too (in
the right places), such that we can determine whether as cpu is full or
not, and how much spare capacity it has by comparing scale usage
(utilization) with scaled capacity. Like in Nico's example above.

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen Sept. 17, 2014, 6:58 p.m. | #11
On Wed, Sep 17, 2014 at 07:45:27PM +0100, Morten Rasmussen wrote:
> On Mon, Sep 15, 2014 at 09:01:59PM +0100, Peter Zijlstra wrote:
> > On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
> > > On Mon, 15 Sep 2014, Peter Zijlstra wrote:
> > 
> > > Let's suppose a task running on a 1GHz CPU producing a load of 100.
> > > 
> > > The same task on a 100MHz CPU would produce a load of 1000 because that 
> > > CPU is 10x slower.  So to properly evaluate the load of a task when 
> > > moving it around, we want to normalize its load based on the CPU 
> > > performance.  In this case the correction factor would be 0.1.
> > > 
> > > Given those normalized loads, we need to scale CPU capacity as well.  If 
> > > the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
> > > 
> > > In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
> > > has a normalized capacity of 500, but only if the load metric is already 
> > > normalized as well.
> > > 
> > > Or am I completely missing the point here?
> > 
> > So I was thinking of the usage as per the next patch; where we decide if
> > a cpu is 'full' or not based on the utilization measure. For this
> > measure we're not interested in inter CPU relations at all, and any use
> > of capacity scaling simply doesn't make sense.
> 
> Right. You don't need to scale capacity to determine whether a cpu is
> full or not if you don't have DVFS, but I don't think it hurts if it is
> done right. We need the scaling to figure out how much capacity is
> available.
> 
> > But I think you asking this question shows a 'bigger' problem in that
> > the Changelogs are entirely failing at describing the actual problem and
> > proposed solution. Because if that were clear, I don't think we would be
> > having this particular discussion.
> 
> Yes, the bigger problem of scaling things with DVFS and taking
> big.LITTLE into account is not addressed in this patch set. This is the
> scale-invariance problem that we discussed at Ksummit.

big.LITTLE is factored in this patch set, but DVFS is not. My bad.

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 17, 2014, 10:25 p.m. | #12
On Tue, Sep 16, 2014 at 12:14:54AM +0200, Vincent Guittot wrote:
> On 15 September 2014 13:42, Peter Zijlstra <peterz@infradead.org> wrote:

> > OK, I've reconsidered _again_, I still don't get it.
> >
> > So fundamentally I think its wrong to scale with the capacity; it just
> > doesn't make any sense. Consider big.little stuff, their CPUs are
> > inherently asymmetric in capacity, but that doesn't matter one whit for
> > utilization numbers. If a core is fully consumed its fully consumed, no
> > matter how much work it can or can not do.
> >
> >
> > So the only thing that needs correcting is the fact that these
> > statistics are based on clock_task and some of that time can end up in
> > other scheduling classes, at which point we'll never get 100% even
> > though we're 'saturated'. But correcting for that using capacity doesn't
> > 'work'.
> 
> I'm not sure to catch your last point because the capacity is the only
> figures that take into account the "time" consumed by other classes.
> Have you got in mind another way to take into account the other
> classes ?

So that was the entire point of stuffing capacity in? Note that that
point was not at all clear.

This is very much like 'all we have is a hammer, and therefore
everything is a nail'. The rt fraction is a 'small' part of what the
capacity is.

> So we have cpu_capacity that is the capacity that can be currently
> used by cfs class
> We have cfs.usage_load_avg that is the sum of running time of cfs
> tasks on the CPU and reflect the % of usage of this CPU by CFS tasks
> We have to use the same metrics to compare available capacity for CFS
> and current cfs usage

-ENOPARSE

> Now we have to use the same unit so we can either weight the
> cpu_capacity_orig with the cfs.usage_load_avg and compare it with
> cpu_capacity
> or with divide cpu_capacity by cpu_capacity_orig and scale it into the
> SCHED_LOAD_SCALE range. Is It what you are proposing ?

I'm so not getting it; orig vs capacity still includes
arch_scale_freq_capacity(), so that is not enough to isolate the rt
fraction.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 17, 2014, 11:03 p.m. | #13
On Wed, Sep 17, 2014 at 07:45:27PM +0100, Morten Rasmussen wrote:
> Right. You don't need to scale capacity to determine whether a cpu is
> full or not if you don't have DVFS, but I don't think it hurts if it is
> done right. We need the scaling to figure out how much capacity is
> available.

Maybe I'm particularly dense, but I'm not seeing how the proposed thing
is 'right' in any way.

There are no words on what specifically is addressed and how it is
achieved.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot Sept. 18, 2014, 1:32 a.m. | #14
On 17 September 2014 15:25, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 16, 2014 at 12:14:54AM +0200, Vincent Guittot wrote:
>> On 15 September 2014 13:42, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> > OK, I've reconsidered _again_, I still don't get it.
>> >
>> > So fundamentally I think its wrong to scale with the capacity; it just
>> > doesn't make any sense. Consider big.little stuff, their CPUs are
>> > inherently asymmetric in capacity, but that doesn't matter one whit for
>> > utilization numbers. If a core is fully consumed its fully consumed, no
>> > matter how much work it can or can not do.
>> >
>> >
>> > So the only thing that needs correcting is the fact that these
>> > statistics are based on clock_task and some of that time can end up in
>> > other scheduling classes, at which point we'll never get 100% even
>> > though we're 'saturated'. But correcting for that using capacity doesn't
>> > 'work'.
>>
>> I'm not sure to catch your last point because the capacity is the only
>> figures that take into account the "time" consumed by other classes.
>> Have you got in mind another way to take into account the other
>> classes ?
>
> So that was the entire point of stuffing capacity in? Note that that
> point was not at all clear.
>
> This is very much like 'all we have is a hammer, and therefore
> everything is a nail'. The rt fraction is a 'small' part of what the
> capacity is.
>
>> So we have cpu_capacity that is the capacity that can be currently
>> used by cfs class
>> We have cfs.usage_load_avg that is the sum of running time of cfs
>> tasks on the CPU and reflect the % of usage of this CPU by CFS tasks
>> We have to use the same metrics to compare available capacity for CFS
>> and current cfs usage
>
> -ENOPARSE
>
>> Now we have to use the same unit so we can either weight the
>> cpu_capacity_orig with the cfs.usage_load_avg and compare it with
>> cpu_capacity
>> or with divide cpu_capacity by cpu_capacity_orig and scale it into the
>> SCHED_LOAD_SCALE range. Is It what you are proposing ?
>
> I'm so not getting it; orig vs capacity still includes
> arch_scale_freq_capacity(), so that is not enough to isolate the rt
> fraction.

This patch does not try to solve any scale invariance issue. This
patch removes capacity_factor because it rarely works correctly.
capacity_factor tries to compute how many tasks a group of CPUs can
handle at the time we are doing the load balance. The capacity_factor
is hardly working for SMT system: it sometimes works for big cores and
but fails to do the right thing for little cores.

Below are two examples to illustrate the problem that this patch solves:

capacity_factor makes the assumption that max capacity of a CPU is
SCHED_CAPACITY_SCALE and the load of a thread is always is
SCHED_LOAD_SCALE. It compares the output of these figures with the sum
of nr_running to decide if a group is overloaded or not.

But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor
of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
seen as overloaded if we have only one task per CPU.

Then, if the default capacity of a CPU is greater than
SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system
that prevent the apparition of ghost CPUs) but if one CPU is fully
used by a rt task (and its capacity is reduced to nearly nothing), the
capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5).

So, this patch tries to solve this issue by removing capacity_factor
and replacing it with the 2 following metrics :
-the available CPU capacity for CFS tasks which is the one currently
used by load_balance
-the capacity that are effectively used by CFS tasks on the CPU. For
that, i have re-introduced the usage_avg_contrib which is in the range
[0..SCHED_CPU_LOAD] whatever the capacity of the CPU on which the task
is running, is. This usage_avg_contrib doesn't solve the scaling
in-variance problem, so i have to scale the usage with original
capacity in get_cpu_utilization (that will become get_cpu_usage in the
next version) in order to compare it with available capacity.

Once the scaling invariance will have been added in usage_avg_contrib,
we can remove the scale by cpu_capacity_orig in get_cpu_utilization.
But the scaling invariance will come in another patchset.

Hope that this explanation makes the goal of this patchset clearer.
And I can add this explanation in the commit log if you found it clear
enough

Vincent

[0] https://lkml.org/lkml/2013/8/28/194
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f95d1c..80bd64e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5673,13 +5673,13 @@  struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
+	unsigned long group_capacity_orig;
 	unsigned long group_utilization; /* Total utilization of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_has_free_capacity;
+	int group_out_of_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5901,31 +5901,6 @@  void update_group_capacity(struct sched_domain *sd, int cpu)
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
-	/*
-	 * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUCAPACITY))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_capacity is still there, we're good.
-	 */
-	if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
-		return 1;
-
-	return 0;
-}
-
-/*
  * Group imbalance indicates (and tries to solve) the problem where balancing
  * groups is inadequate due to tsk_cpus_allowed() constraints.
  *
@@ -5959,38 +5934,37 @@  static inline int sg_imbalanced(struct sched_group *group)
 	return group->sgc->imbalance;
 }
 
-/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
- */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	unsigned int capacity_factor, smt, cpus;
-	unsigned int capacity, capacity_orig;
+	if ((sgs->group_capacity_orig * 100) >
+			(sgs->group_utilization * env->sd->imbalance_pct))
+		return 1;
+
+	if (sgs->sum_nr_running < sgs->group_weight)
+		return 1;
 
-	capacity = group->sgc->capacity;
-	capacity_orig = group->sgc->capacity_orig;
-	cpus = group->group_weight;
+	return 0;
+}
 
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
-	capacity_factor = cpus / smt; /* cores */
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
+			struct lb_env *env)
+{
+	if (sgs->sum_nr_running <= sgs->group_weight)
+		return 0;
 
-	capacity_factor = min_t(unsigned,
-		capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
-	if (!capacity_factor)
-		capacity_factor = fix_small_capacity(env->sd, group);
+	if ((sgs->group_capacity_orig * 100) <
+			(sgs->group_utilization * env->sd->imbalance_pct))
+		return 1;
 
-	return capacity_factor;
+	return 0;
 }
 
 static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+group_classify(struct sched_group *group, struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	if (sgs->sum_nr_running > sgs->group_capacity_factor)
+	if (group_is_overloaded(sgs, env))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6043,6 +6017,7 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 			sgs->idle_cpus++;
 	}
 
+	sgs->group_capacity_orig = group->sgc->capacity_orig;
 	/* Adjust by relative CPU capacity of the group */
 	sgs->group_capacity = group->sgc->capacity;
 	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
@@ -6051,11 +6026,10 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
 	sgs->group_weight = group->group_weight;
-	sgs->group_capacity_factor = sg_capacity_factor(env, group);
-	sgs->group_type = group_classify(group, sgs);
 
-	if (sgs->group_capacity_factor > sgs->sum_nr_running)
-		sgs->group_has_free_capacity = 1;
+	sgs->group_type = group_classify(group, sgs, env);
+
+	sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
 }
 
 /**
@@ -6185,17 +6159,21 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity factor to one so that we'll try
+		 * first, lower the sg capacity to one so that we'll try
 		 * and move all the excess tasks away. We lower the capacity
 		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
+		 * these excess tasks, i.e. group_capacity > 0. The
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity)
-			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
+		    group_has_free_capacity(&sds->local_stat, env)) {
+			if (sgs->sum_nr_running > 1)
+				sgs->group_out_of_capacity = 1;
+			sgs->group_capacity = min(sgs->group_capacity,
+						SCHED_CAPACITY_SCALE);
+		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -6373,11 +6351,12 @@  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity_factor);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
-		load_above_capacity /= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running *
+					SCHED_LOAD_SCALE;
+		if (load_above_capacity > busiest->group_capacity)
+			load_above_capacity -= busiest->group_capacity;
+		else
+			load_above_capacity = ~0UL;
 	}
 
 	/*
@@ -6440,6 +6419,7 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6460,8 +6440,9 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
-	    !busiest->group_has_free_capacity)
+	if (env->idle == CPU_NEWLY_IDLE &&
+			group_has_free_capacity(local, env) &&
+			busiest->group_out_of_capacity)
 		goto force_balance;
 
 	/*
@@ -6519,7 +6500,7 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long capacity, capacity_factor, wl;
+		unsigned long capacity, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6548,9 +6529,6 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		capacity = capacity_of(i);
-		capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
-		if (!capacity_factor)
-			capacity_factor = fix_small_capacity(env->sd, group);
 
 		wl = weighted_cpuload(i);
 
@@ -6558,7 +6536,10 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
-		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    ((capacity * env->sd->imbalance_pct) >=
+				(rq->cpu_capacity_orig * 100)))
 			continue;
 
 		/*