[v6,5/6] sched: replace capacity_factor by usage

Message ID 1411488485-10025-6-git-send-email-vincent.guittot@linaro.org
State New
Headers show

Commit Message

Vincent Guittot Sept. 23, 2014, 4:08 p.m.
The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system,
it sometimes works for big cores but fails to do the right thing for little
cores.

Below are two examples to illustrate the problem that this patch solves:

1 - capacity_factor makes the assumption that max capacity of a CPU is
SCHED_CAPACITY_SCALE and the load of a thread is always is
SCHED_LOAD_SCALE. It compares the output of these figures with the sum
of nr_running to decide if a group is overloaded or not.

But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor
of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
seen as overloaded if we have only one task per CPU.

2 - Then, if the default capacity of a CPU is greater than
SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system
that prevent the apparition of ghost CPUs) but if one CPU is fully
used by a rt task (and its capacity is reduced to nearly nothing), the
capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5).

So, this patch tries to solve this issue by removing capacity_factor
and replacing it with the 2 following metrics :
-The available CPU's capacity for CFS tasks which is the already used by
load_balance.
-The usage of the CPU by the CFS tasks. For the latter, I have
re-introduced the utilization_avg_contrib which is in the range
[0..SCHED_CPU_LOAD] whatever the capacity of the CPU is.

This implementation of utilization_avg_contrib doesn't solve the scaling
in-variance problem, so i have to scale the utilization with original
capacity of the CPU in order to get the CPU usage and compare it with the
capacity. Once the scaling invariance will have been added in
utilization_avg_contrib, we will remove the scale of utilization_avg_contrib
by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come
in another patchset.

Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
because it's more used during load balance.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  |  12 -----
 kernel/sched/fair.c  | 131 ++++++++++++++++++++-------------------------------
 kernel/sched/sched.h |   2 +-
 3 files changed, 52 insertions(+), 93 deletions(-)

Comments

Dietmar Eggemann Sept. 29, 2014, 1:39 p.m. | #1
On 23/09/14 17:08, Vincent Guittot wrote:
> The scheduler tries to compute how many tasks a group of CPUs can handle by
> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
> SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system,
> it sometimes works for big cores but fails to do the right thing for little
> cores.
> 
> Below are two examples to illustrate the problem that this patch solves:
> 
> 1 - capacity_factor makes the assumption that max capacity of a CPU is
> SCHED_CAPACITY_SCALE and the load of a thread is always is
> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
> of nr_running to decide if a group is overloaded or not.
> 
> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
> (640 as an example), a group of 3 CPUS will have a max capacity_factor
> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
> seen as overloaded if we have only one task per CPU.
> 
> 2 - Then, if the default capacity of a CPU is greater than
> SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
> a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system
> that prevent the apparition of ghost CPUs) but if one CPU is fully
> used by a rt task (and its capacity is reduced to nearly nothing), the
> capacity factor of the group will still be 4
> (div_round_closest(3*1512/1024) = 5).
> 
> So, this patch tries to solve this issue by removing capacity_factor
> and replacing it with the 2 following metrics :
> -The available CPU's capacity for CFS tasks which is the already used by
> load_balance.
> -The usage of the CPU by the CFS tasks. For the latter, I have
> re-introduced the utilization_avg_contrib which is in the range
> [0..SCHED_CPU_LOAD] whatever the capacity of the CPU is.

IMHO, this last sentence is misleading. The usage of a cpu can be
temporally unbounded (in case a lot of tasks have just been spawned on
this cpu, testcase: hackbench) but it converges very quickly towards a
value between [0..1024]. Your implementation is already handling this
case by capping usage to cpu_rq(cpu)->capacity_orig + 1 .
BTW, couldn't find the definition of SCHED_CPU_LOAD.

[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen Oct. 2, 2014, 4:57 p.m. | #2
On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
> Below are two examples to illustrate the problem that this patch solves:
> 
> 1 - capacity_factor makes the assumption that max capacity of a CPU is
> SCHED_CAPACITY_SCALE and the load of a thread is always is
> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
> of nr_running to decide if a group is overloaded or not.
> 
> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
> (640 as an example), a group of 3 CPUS will have a max capacity_factor
> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
> seen as overloaded if we have only one task per CPU.

I did some testing on TC2 which has the setup you describe above on the
A7 cluster when the clock-frequency property is set in DT. The two A15s
have max capacities above 1024. When using sysbench with five threads I
still get three tasks on the two A15s and two tasks on the three A7
leaving one cpu idle (on average).

Using cpu utilization (usage) does correctly identify the A15 cluster as
overloaded and it even gets as far as selecting the A15 running two
tasks as the source cpu in load_balance(). However, load_balance() bails
out without pulling the task due to calculate_imbalance() returning a
zero imbalance. calculate_imbalance() bails out just before the hunk you
changed due to comparison of the sched_group avg_loads. sgs->avg_load is
basically the sum of load in the group divided by its capacity. Since
load isn't scaled the avg_load of the overloaded A15 group is slightly
_lower_ than the partially idle A7 group. Hence calculate_imbalance()
bails out, which isn't what we want.

I think we need to have a closer look at the imbalance calculation code
and any other users of sgs->avg_load to get rid of all code making
assumptions about cpu capacity.

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot Oct. 3, 2014, 7:24 a.m. | #3
On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
>> Below are two examples to illustrate the problem that this patch solves:
>>
>> 1 - capacity_factor makes the assumption that max capacity of a CPU is
>> SCHED_CAPACITY_SCALE and the load of a thread is always is
>> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
>> of nr_running to decide if a group is overloaded or not.
>>
>> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
>> (640 as an example), a group of 3 CPUS will have a max capacity_factor
>> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
>> seen as overloaded if we have only one task per CPU.
>
> I did some testing on TC2 which has the setup you describe above on the
> A7 cluster when the clock-frequency property is set in DT. The two A15s
> have max capacities above 1024. When using sysbench with five threads I
> still get three tasks on the two A15s and two tasks on the three A7
> leaving one cpu idle (on average).
>
> Using cpu utilization (usage) does correctly identify the A15 cluster as
> overloaded and it even gets as far as selecting the A15 running two
> tasks as the source cpu in load_balance(). However, load_balance() bails
> out without pulling the task due to calculate_imbalance() returning a
> zero imbalance. calculate_imbalance() bails out just before the hunk you
> changed due to comparison of the sched_group avg_loads. sgs->avg_load is
> basically the sum of load in the group divided by its capacity. Since
> load isn't scaled the avg_load of the overloaded A15 group is slightly
> _lower_ than the partially idle A7 group. Hence calculate_imbalance()
> bails out, which isn't what we want.
>
> I think we need to have a closer look at the imbalance calculation code
> and any other users of sgs->avg_load to get rid of all code making
> assumptions about cpu capacity.

We already had this discussion during the review of a previous version
https://lkml.org/lkml/2014/6/3/422
and my answer has not changed; This patch is necessary to solve the 1
task per CPU issue of the HMP system but is not enough. I have a patch
for solving the imbalance calculation issue and i have planned to send
it once this patch will be in a good shape for being accepted by
Peter.

I don't want to mix this patch and the next one because there are
addressing different problem: this one is how evaluate the capacity of
a system and detect when a group is overloaded and the next one will
handle the case when the balance of the system can't rely on the
average load figures of the group because we have a significant
capacity difference between groups. Not that i haven't specifically
mentioned HMP for the last patch because SMP system can also take
advantage of it

Regards,
Vincent

>
> Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen Oct. 3, 2014, 9:35 a.m. | #4
On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
> On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
> >> Below are two examples to illustrate the problem that this patch solves:
> >>
> >> 1 - capacity_factor makes the assumption that max capacity of a CPU is
> >> SCHED_CAPACITY_SCALE and the load of a thread is always is
> >> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
> >> of nr_running to decide if a group is overloaded or not.
> >>
> >> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
> >> (640 as an example), a group of 3 CPUS will have a max capacity_factor
> >> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
> >> seen as overloaded if we have only one task per CPU.
> >
> > I did some testing on TC2 which has the setup you describe above on the
> > A7 cluster when the clock-frequency property is set in DT. The two A15s
> > have max capacities above 1024. When using sysbench with five threads I
> > still get three tasks on the two A15s and two tasks on the three A7
> > leaving one cpu idle (on average).
> >
> > Using cpu utilization (usage) does correctly identify the A15 cluster as
> > overloaded and it even gets as far as selecting the A15 running two
> > tasks as the source cpu in load_balance(). However, load_balance() bails
> > out without pulling the task due to calculate_imbalance() returning a
> > zero imbalance. calculate_imbalance() bails out just before the hunk you
> > changed due to comparison of the sched_group avg_loads. sgs->avg_load is
> > basically the sum of load in the group divided by its capacity. Since
> > load isn't scaled the avg_load of the overloaded A15 group is slightly
> > _lower_ than the partially idle A7 group. Hence calculate_imbalance()
> > bails out, which isn't what we want.
> >
> > I think we need to have a closer look at the imbalance calculation code
> > and any other users of sgs->avg_load to get rid of all code making
> > assumptions about cpu capacity.
> 
> We already had this discussion during the review of a previous version
> https://lkml.org/lkml/2014/6/3/422
> and my answer has not changed; This patch is necessary to solve the 1
> task per CPU issue of the HMP system but is not enough. I have a patch
> for solving the imbalance calculation issue and i have planned to send
> it once this patch will be in a good shape for being accepted by
> Peter.
> 
> I don't want to mix this patch and the next one because there are
> addressing different problem: this one is how evaluate the capacity of
> a system and detect when a group is overloaded and the next one will
> handle the case when the balance of the system can't rely on the
> average load figures of the group because we have a significant
> capacity difference between groups. Not that i haven't specifically
> mentioned HMP for the last patch because SMP system can also take
> advantage of it

You do mention 'big' and 'little' cores in your commit message and quote
example numbers with are identical to the cpu capacities for TC2. That
lead me to believe that this patch would address the issues we see for
HMP systems. But that is clearly wrong. I would suggest that you drop
mentioning big and little cores and stick to only describing cpu
capacity reductions due to rt tasks and irq to avoid any confusion about
the purpose of the patch. Maybe explicitly say that non-default cpu
capacities (capacity_orig) isn't addressed yet.

I think the two problems you describe are very much related. This patch
set is half the solution of the HMP balancing problem. We just need the
last bit for avg_load and then we can add scale-invariance on top.

IMHO, it would be good to have all the bits and pieces for cpu capacity
scaling and scaling of per-entity load-tracking on the table so we fit
things together. We only have patches for parts of the solution posted
so far.

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot Oct. 3, 2014, 12:50 p.m. | #5
On 3 October 2014 11:35, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
>> On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
>> >> Below are two examples to illustrate the problem that this patch solves:
>> >>
>> >> 1 - capacity_factor makes the assumption that max capacity of a CPU is
>> >> SCHED_CAPACITY_SCALE and the load of a thread is always is
>> >> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
>> >> of nr_running to decide if a group is overloaded or not.
>> >>
>> >> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
>> >> (640 as an example), a group of 3 CPUS will have a max capacity_factor
>> >> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
>> >> seen as overloaded if we have only one task per CPU.
>> >
>> > I did some testing on TC2 which has the setup you describe above on the
>> > A7 cluster when the clock-frequency property is set in DT. The two A15s
>> > have max capacities above 1024. When using sysbench with five threads I
>> > still get three tasks on the two A15s and two tasks on the three A7
>> > leaving one cpu idle (on average).
>> >
>> > Using cpu utilization (usage) does correctly identify the A15 cluster as
>> > overloaded and it even gets as far as selecting the A15 running two
>> > tasks as the source cpu in load_balance(). However, load_balance() bails
>> > out without pulling the task due to calculate_imbalance() returning a
>> > zero imbalance. calculate_imbalance() bails out just before the hunk you
>> > changed due to comparison of the sched_group avg_loads. sgs->avg_load is
>> > basically the sum of load in the group divided by its capacity. Since
>> > load isn't scaled the avg_load of the overloaded A15 group is slightly
>> > _lower_ than the partially idle A7 group. Hence calculate_imbalance()
>> > bails out, which isn't what we want.
>> >
>> > I think we need to have a closer look at the imbalance calculation code
>> > and any other users of sgs->avg_load to get rid of all code making
>> > assumptions about cpu capacity.
>>
>> We already had this discussion during the review of a previous version
>> https://lkml.org/lkml/2014/6/3/422
>> and my answer has not changed; This patch is necessary to solve the 1
>> task per CPU issue of the HMP system but is not enough. I have a patch
>> for solving the imbalance calculation issue and i have planned to send
>> it once this patch will be in a good shape for being accepted by
>> Peter.
>>
>> I don't want to mix this patch and the next one because there are
>> addressing different problem: this one is how evaluate the capacity of
>> a system and detect when a group is overloaded and the next one will
>> handle the case when the balance of the system can't rely on the
>> average load figures of the group because we have a significant
>> capacity difference between groups. Not that i haven't specifically
>> mentioned HMP for the last patch because SMP system can also take
>> advantage of it
>
> You do mention 'big' and 'little' cores in your commit message and quote
> example numbers with are identical to the cpu capacities for TC2. That

By last patch, i mean the patch about imbalance that i haven't sent
yet, but it's not related with this patch

> lead me to believe that this patch would address the issues we see for
> HMP systems. But that is clearly wrong. I would suggest that you drop

This patch addresses one issue: correctly detect how much capacity we
have and correctly detect when the group is overloaded; HMP system
clearly falls in this category but not only.
This is the only purpose of this patch and not the "1 task per CPU
issue" that you mentioned.

The second patch is about correctly balance the tasks on system where
the capacity of a group is significantly less than another group. It
has nothing to do in capacity computation and overload detection but
it will use these corrected/new metrics to make a better choice.

Then, there is the "1 task per CPU issue on HMP" that you mentioned
and this latter will be solved thanks to these 2 patchsets but this is
not the only/main target of these patchsets so i don't want to reduce
them into: "solve an HMP issue"

> mentioning big and little cores and stick to only describing cpu
> capacity reductions due to rt tasks and irq to avoid any confusion about
> the purpose of the patch. Maybe explicitly say that non-default cpu
> capacities (capacity_orig) isn't addressed yet.

But they are addressed by this patchset. you just mixed various goal
together (see above)

>
> I think the two problems you describe are very much related. This patch
> set is half the solution of the HMP balancing problem. We just need the
> last bit for avg_load and then we can add scale-invariance on top.

i don't see the link between scale invariance and a bad load-balancing
choice. The fact that the current load balancer puts more tasks than
CPUs in a group on HMP system should not influence or break the scale
invariance load tracking. Isn't it ?

Now, i could send the other patchset but i'm afraid that this will
generate more confusion than help in the process review.

Regards,
Vincent

>
> IMHO, it would be good to have all the bits and pieces for cpu capacity
> scaling and scaling of per-entity load-tracking on the table so we fit
> things together. We only have patches for parts of the solution posted
> so far.
>
> Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Oct. 3, 2014, 3:38 p.m. | #6
On Tue, Sep 23, 2014 at 06:08:04PM +0200, Vincent Guittot wrote:

> This implementation of utilization_avg_contrib doesn't solve the scaling
> in-variance problem, so i have to scale the utilization with original
> capacity of the CPU in order to get the CPU usage and compare it with the
> capacity. Once the scaling invariance will have been added in
> utilization_avg_contrib, we will remove the scale of utilization_avg_contrib
> by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come
> in another patchset.

I would have expected this in the previous patch that introduced that
lot. Including a few words on how/why the cpu_capacity is a 'good'
approximation etc..

> Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
> because it's more used during load balance.

That sentence is a contradiction, I expect there's a negative gone
missing someplace.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot Oct. 6, 2014, 8:55 a.m. | #7
On 3 October 2014 17:38, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 23, 2014 at 06:08:04PM +0200, Vincent Guittot wrote:
>
>> This implementation of utilization_avg_contrib doesn't solve the scaling
>> in-variance problem, so i have to scale the utilization with original
>> capacity of the CPU in order to get the CPU usage and compare it with the
>> capacity. Once the scaling invariance will have been added in
>> utilization_avg_contrib, we will remove the scale of utilization_avg_contrib
>> by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come
>> in another patchset.
>
> I would have expected this in the previous patch that introduced that
> lot. Including a few words on how/why the cpu_capacity is a 'good'
> approximation etc..

That's fair, i can move the explanation

>
>> Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
>> because it's more used during load balance.
>
> That sentence is a contradiction, I expect there's a negative gone
> missing someplace.

yes the "no" is of "because it is no more used during load balance" is missing
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot Nov. 24, 2014, 8:26 a.m. | #8
On 23 November 2014 at 01:22, Wanpeng Li <kernellwp@gmail.com> wrote:
> Hi Vincent,
>
> On 10/3/14, 8:50 PM, Vincent Guittot wrote:
>>
>> On 3 October 2014 11:35, Morten Rasmussen <morten.rasmussen@arm.com>
>> wrote:
>>>
>>> On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
>>>>
>>>> On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com>
>>>> wrote:
>>>>>
>>>>> On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
>>>>>>
>>>>>> Below are two examples to illustrate the problem that this patch
>>>>>> solves:
>>>>>>
>>>>>> 1 - capacity_factor makes the assumption that max capacity of a CPU is
>>>>>> SCHED_CAPACITY_SCALE and the load of a thread is always is
>>>>>> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
>>>>>> of nr_running to decide if a group is overloaded or not.
>>>>>>
>>>>>> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
>>>>>> (640 as an example), a group of 3 CPUS will have a max capacity_factor
>>>>>> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
>>>>>> seen as overloaded if we have only one task per CPU.
>>>>>
>>>>> I did some testing on TC2 which has the setup you describe above on the
>>>>> A7 cluster when the clock-frequency property is set in DT. The two A15s
>>>>> have max capacities above 1024. When using sysbench with five threads I
>>>>> still get three tasks on the two A15s and two tasks on the three A7
>>>>> leaving one cpu idle (on average).
>>>>>
>>>>> Using cpu utilization (usage) does correctly identify the A15 cluster
>>>>> as
>>>>> overloaded and it even gets as far as selecting the A15 running two
>>>>> tasks as the source cpu in load_balance(). However, load_balance()
>>>>> bails
>>>>> out without pulling the task due to calculate_imbalance() returning a
>>>>> zero imbalance. calculate_imbalance() bails out just before the hunk
>>>>> you
>>>>> changed due to comparison of the sched_group avg_loads. sgs->avg_load
>>>>> is
>>>>> basically the sum of load in the group divided by its capacity. Since
>>>>> load isn't scaled the avg_load of the overloaded A15 group is slightly
>>>>> _lower_ than the partially idle A7 group. Hence calculate_imbalance()
>>>>> bails out, which isn't what we want.
>>>>>
>>>>> I think we need to have a closer look at the imbalance calculation code
>>>>> and any other users of sgs->avg_load to get rid of all code making
>>>>> assumptions about cpu capacity.
>>>>
>>>> We already had this discussion during the review of a previous version
>>>> https://lkml.org/lkml/2014/6/3/422
>>>> and my answer has not changed; This patch is necessary to solve the 1
>>>> task per CPU issue of the HMP system but is not enough. I have a patch
>>>> for solving the imbalance calculation issue and i have planned to send
>>>> it once this patch will be in a good shape for being accepted by
>>>> Peter.
>>>>
>>>> I don't want to mix this patch and the next one because there are
>>>> addressing different problem: this one is how evaluate the capacity of
>>>> a system and detect when a group is overloaded and the next one will
>>>> handle the case when the balance of the system can't rely on the
>>>> average load figures of the group because we have a significant
>>>> capacity difference between groups. Not that i haven't specifically
>>>> mentioned HMP for the last patch because SMP system can also take
>>>> advantage of it
>>>
>>> You do mention 'big' and 'little' cores in your commit message and quote
>>> example numbers with are identical to the cpu capacities for TC2. That
>>
>> By last patch, i mean the patch about imbalance that i haven't sent
>> yet, but it's not related with this patch
>>
>>> lead me to believe that this patch would address the issues we see for
>>> HMP systems. But that is clearly wrong. I would suggest that you drop
>>
>> This patch addresses one issue: correctly detect how much capacity we
>> have and correctly detect when the group is overloaded; HMP system
>
>
> What's the meaning of "HMP system"?

Heterogeneous Multi Processor  system are system that gathers
processors with different micro architecture and/or different max
frequency. The main result is that processors don't have the same
maximum capacity and the same DMIPS/W efficiency

Regards,
Vincent
>
> Regards,
> Wanpeng Li
>
>> clearly falls in this category but not only.
>> This is the only purpose of this patch and not the "1 task per CPU
>> issue" that you mentioned.
>>
>> The second patch is about correctly balance the tasks on system where
>> the capacity of a group is significantly less than another group. It
>> has nothing to do in capacity computation and overload detection but
>> it will use these corrected/new metrics to make a better choice.
>>
>> Then, there is the "1 task per CPU issue on HMP" that you mentioned
>> and this latter will be solved thanks to these 2 patchsets but this is
>> not the only/main target of these patchsets so i don't want to reduce
>> them into: "solve an HMP issue"
>>
>>> mentioning big and little cores and stick to only describing cpu
>>> capacity reductions due to rt tasks and irq to avoid any confusion about
>>> the purpose of the patch. Maybe explicitly say that non-default cpu
>>> capacities (capacity_orig) isn't addressed yet.
>>
>> But they are addressed by this patchset. you just mixed various goal
>> together (see above)
>>
>>> I think the two problems you describe are very much related. This patch
>>> set is half the solution of the HMP balancing problem. We just need the
>>> last bit for avg_load and then we can add scale-invariance on top.
>>
>> i don't see the link between scale invariance and a bad load-balancing
>> choice. The fact that the current load balancer puts more tasks than
>> CPUs in a group on HMP system should not influence or break the scale
>> invariance load tracking. Isn't it ?
>>
>> Now, i could send the other patchset but i'm afraid that this will
>> generate more confusion than help in the process review.
>>
>> Regards,
>> Vincent
>>
>>> IMHO, it would be good to have all the bits and pieces for cpu capacity
>>> scaling and scaling of per-entity load-tracking on the table so we fit
>>> things together. We only have patches for parts of the solution posted
>>> so far.
>>>
>>> Morten
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e20f203..c7c8ac4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5342,17 +5342,6 @@  static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			break;
 		}
 
-		/*
-		 * Even though we initialize ->capacity to something semi-sane,
-		 * we leave capacity_orig unset. This allows us to detect if
-		 * domain iteration is still funny without causing /0 traps.
-		 */
-		if (!group->sgc->capacity_orig) {
-			printk(KERN_CONT "\n");
-			printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
-			break;
-		}
-
 		if (!cpumask_weight(sched_group_cpus(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: empty group\n");
@@ -5837,7 +5826,6 @@  build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->capacity_orig = sg->sgc->capacity;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4097e3f..2ba278c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5676,11 +5676,10 @@  struct sg_lb_stats {
 	unsigned long group_capacity;
 	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_has_free_capacity;
+	int group_out_of_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5821,7 +5820,6 @@  static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-	sdg->sgc->capacity_orig = capacity;
 
 	if (sched_feat(ARCH_CAPACITY))
 		capacity *= arch_scale_freq_capacity(sd, cpu);
@@ -5844,7 +5842,7 @@  void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity, capacity_orig;
+	unsigned long capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -5856,7 +5854,7 @@  void update_group_capacity(struct sched_domain *sd, int cpu)
 		return;
 	}
 
-	capacity_orig = capacity = 0;
+	capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -5876,19 +5874,15 @@  void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Use capacity_of(), which is set irrespective of domains
 			 * in update_cpu_capacity().
 			 *
-			 * This avoids capacity/capacity_orig from being 0 and
+			 * This avoids capacity from being 0 and
 			 * causing divide-by-zero issues on boot.
-			 *
-			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
 
 			sgc = rq->sd->groups->sgc;
-			capacity_orig += sgc->capacity_orig;
 			capacity += sgc->capacity;
 		}
 	} else  {
@@ -5899,42 +5893,15 @@  void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity_orig += group->sgc->capacity_orig;
 			capacity += group->sgc->capacity;
 			group = group->next;
 		} while (group != child->groups);
 	}
 
-	sdg->sgc->capacity_orig = capacity_orig;
 	sdg->sgc->capacity = capacity;
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
-	/*
-	 * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUCAPACITY))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_capacity is still there, we're good.
-	 */
-	if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
-		return 1;
-
-	return 0;
-}
-
-/*
  * Check whether the capacity of the rq has been noticeably reduced by side
  * activity. The imbalance_pct is used for the threshold.
  * Return true is the capacity is reduced
@@ -5980,38 +5947,37 @@  static inline int sg_imbalanced(struct sched_group *group)
 	return group->sgc->imbalance;
 }
 
-/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
- */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	unsigned int capacity_factor, smt, cpus;
-	unsigned int capacity, capacity_orig;
+	if ((sgs->group_capacity * 100) >
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	capacity = group->sgc->capacity;
-	capacity_orig = group->sgc->capacity_orig;
-	cpus = group->group_weight;
+	if (sgs->sum_nr_running < sgs->group_weight)
+		return true;
 
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
-	capacity_factor = cpus / smt; /* cores */
+	return false;
+}
 
-	capacity_factor = min_t(unsigned,
-		capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
-	if (!capacity_factor)
-		capacity_factor = fix_small_capacity(env->sd, group);
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
+			struct lb_env *env)
+{
+	if (sgs->sum_nr_running <= sgs->group_weight)
+		return false;
 
-	return capacity_factor;
+	if ((sgs->group_capacity * 100) <
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
+
+	return false;
 }
 
 static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+group_classify(struct sched_group *group, struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	if (sgs->sum_nr_running > sgs->group_capacity_factor)
+	if (group_is_overloaded(sgs, env))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6072,11 +6038,10 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
 	sgs->group_weight = group->group_weight;
-	sgs->group_capacity_factor = sg_capacity_factor(env, group);
-	sgs->group_type = group_classify(group, sgs);
 
-	if (sgs->group_capacity_factor > sgs->sum_nr_running)
-		sgs->group_has_free_capacity = 1;
+	sgs->group_type = group_classify(group, sgs, env);
+
+	sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
 }
 
 /**
@@ -6198,17 +6163,21 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity factor to one so that we'll try
+		 * first, lower the sg capacity to one so that we'll try
 		 * and move all the excess tasks away. We lower the capacity
 		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
+		 * these excess tasks, i.e. group_capacity > 0. The
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity)
-			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
+		    group_has_free_capacity(&sds->local_stat, env)) {
+			if (sgs->sum_nr_running > 1)
+				sgs->group_out_of_capacity = 1;
+			sgs->group_capacity = min(sgs->group_capacity,
+						SCHED_CAPACITY_SCALE);
+		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -6387,11 +6356,12 @@  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity_factor);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
-		load_above_capacity /= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running *
+					SCHED_LOAD_SCALE;
+		if (load_above_capacity > busiest->group_capacity)
+			load_above_capacity -= busiest->group_capacity;
+		else
+			load_above_capacity = ~0UL;
 	}
 
 	/*
@@ -6454,6 +6424,7 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6474,8 +6445,9 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
-	    !busiest->group_has_free_capacity)
+	if (env->idle == CPU_NEWLY_IDLE &&
+			group_has_free_capacity(local, env) &&
+			busiest->group_out_of_capacity)
 		goto force_balance;
 
 	/*
@@ -6533,7 +6505,7 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long capacity, capacity_factor, wl;
+		unsigned long capacity, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6562,9 +6534,6 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		capacity = capacity_of(i);
-		capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
-		if (!capacity_factor)
-			capacity_factor = fix_small_capacity(env->sd, group);
 
 		wl = weighted_cpuload(i);
 
@@ -6572,7 +6541,9 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
-		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    !check_cpu_capacity(rq, env->sd))
 			continue;
 
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3ccb136..8d224c5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -750,7 +750,7 @@  struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity, capacity_orig;
+	unsigned int capacity;
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*