diff mbox

[v3,01/12] sched: fix imbalance flag reset

Message ID 1404144343-18720-2-git-send-email-vincent.guittot@linaro.org
State New
Headers show

Commit Message

Vincent Guittot June 30, 2014, 4:05 p.m. UTC
The imbalance flag can stay set whereas there is no imbalance.

Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
We will have some idle load balance which are triggered during tick.
Unfortunately, the tick is also used to queue background work so we can reach
the situation where short work has been queued on a CPU which already runs a
task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
a worker thread that is pinned on a CPU so an imbalance due to pinned task is
detected and the imbalance flag is set.
Then, we will not be able to clear the flag because we have at most 1 task on
each CPU but the imbalance flag will trig to useless active load balance
between the idle CPU and the busy CPU.

We need to reset of the imbalance flag as soon as we have reached a balanced
state.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

Comments

Preeti U Murthy July 8, 2014, 3:13 a.m. UTC | #1
On 06/30/2014 09:35 PM, Vincent Guittot wrote:
> The imbalance flag can stay set whereas there is no imbalance.
> 
> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
> We will have some idle load balance which are triggered during tick.
> Unfortunately, the tick is also used to queue background work so we can reach
> the situation where short work has been queued on a CPU which already runs a
> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
> detected and the imbalance flag is set.
> Then, we will not be able to clear the flag because we have at most 1 task on
> each CPU but the imbalance flag will trig to useless active load balance
> between the idle CPU and the busy CPU.
> 
> We need to reset of the imbalance flag as soon as we have reached a balanced
> state.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d3c73122..0c48dff 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6615,10 +6615,8 @@ more_balance:
>  		if (sd_parent) {
>  			int *group_imbalance = &sd_parent->groups->sgc->imbalance;
> 
> -			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
> +			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
>  				*group_imbalance = 1;
> -			} else if (*group_imbalance)
> -				*group_imbalance = 0;
>  		}
> 
>  		/* All tasks on this runqueue were pinned by CPU affinity */
> @@ -6703,6 +6701,16 @@ more_balance:
>  	goto out;
> 
>  out_balanced:
> +	/*
> +	 * We reach balance although we may have faced some affinity
> +	 * constraints. Clear the imbalance flag if it was set.
> +	 */
> +	if (sd_parent) {
> +		int *group_imbalance = &sd_parent->groups->sgc->imbalance;
> +		if (*group_imbalance)
> +			*group_imbalance = 0;
> +	}
> +
>  	schedstat_inc(sd, lb_balanced[idle]);
> 
>  	sd->nr_balance_failed = 0;
> 
I am not convinced that we can clear the imbalance flag here. Lets take
a simple example. Assume at a particular level of sched_domain, there
are two sched_groups with one cpu each. There are 2 tasks on the source
cpu, one of which is running(t1) and the other thread(t2) does not have
the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the
dst_cpu due to affinity constraints. Note that t2 is *not pinned, it
just cannot run on the dst_cpu*. In this scenario also we reach the
out_balanced tag right? If we set the group_imbalance flag to 0, we are
ruling out the possibility of migrating t2 to any other cpu in a higher
level sched_domain by saying that all is well, there is no imbalance.
This is wrong, isn't it?

My point is that by clearing the imbalance flag in the out_balanced
case, you might be overlooking the fact that the tsk_cpus_allowed mask
of the tasks on the src_cpu may not be able to run on the dst_cpu in
*this* level of sched_domain, but can potentially run on a cpu at any
higher level of sched_domain. By clearing the flag, we are not
encouraging load balance at that level for t2.

Am I missing something?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot July 8, 2014, 10:12 a.m. UTC | #2
On 8 July 2014 05:13, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 06/30/2014 09:35 PM, Vincent Guittot wrote:
>> The imbalance flag can stay set whereas there is no imbalance.
>>
>> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
>> We will have some idle load balance which are triggered during tick.
>> Unfortunately, the tick is also used to queue background work so we can reach
>> the situation where short work has been queued on a CPU which already runs a
>> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
>> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
>> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
>> detected and the imbalance flag is set.
>> Then, we will not be able to clear the flag because we have at most 1 task on
>> each CPU but the imbalance flag will trig to useless active load balance
>> between the idle CPU and the busy CPU.
>>
>> We need to reset of the imbalance flag as soon as we have reached a balanced
>> state.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 14 +++++++++++---
>>  1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d3c73122..0c48dff 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6615,10 +6615,8 @@ more_balance:
>>               if (sd_parent) {
>>                       int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>>
>> -                     if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
>> +                     if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
>>                               *group_imbalance = 1;
>> -                     } else if (*group_imbalance)
>> -                             *group_imbalance = 0;
>>               }
>>
>>               /* All tasks on this runqueue were pinned by CPU affinity */
>> @@ -6703,6 +6701,16 @@ more_balance:
>>       goto out;
>>
>>  out_balanced:
>> +     /*
>> +      * We reach balance although we may have faced some affinity
>> +      * constraints. Clear the imbalance flag if it was set.
>> +      */
>> +     if (sd_parent) {
>> +             int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>> +             if (*group_imbalance)
>> +                     *group_imbalance = 0;
>> +     }
>> +
>>       schedstat_inc(sd, lb_balanced[idle]);
>>
>>       sd->nr_balance_failed = 0;
>>
> I am not convinced that we can clear the imbalance flag here. Lets take
> a simple example. Assume at a particular level of sched_domain, there
> are two sched_groups with one cpu each. There are 2 tasks on the source
> cpu, one of which is running(t1) and the other thread(t2) does not have
> the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the
> dst_cpu due to affinity constraints. Note that t2 is *not pinned, it
> just cannot run on the dst_cpu*. In this scenario also we reach the
> out_balanced tag right? If we set the group_imbalance flag to 0, we are

No we will not. If we have 2 tasks on 1 CPU in one sched_group and the
other group with an idle CPU,  we are not balanced so we will not go
to out_balanced and the group_imbalance will staty set until we reach
a balanced state (by migrating t1).

> ruling out the possibility of migrating t2 to any other cpu in a higher
> level sched_domain by saying that all is well, there is no imbalance.
> This is wrong, isn't it?
>
> My point is that by clearing the imbalance flag in the out_balanced
> case, you might be overlooking the fact that the tsk_cpus_allowed mask
> of the tasks on the src_cpu may not be able to run on the dst_cpu in
> *this* level of sched_domain, but can potentially run on a cpu at any
> higher level of sched_domain. By clearing the flag, we are not

The imbalance flag is per sched_domain level so we will not clear
group_imbalance flag of other levels if the imbalance is also detected
at a higher level it will migrate t2

Regards,
Vincent

> encouraging load balance at that level for t2.
>
> Am I missing something?
>
> Regards
> Preeti U Murthy
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Rik van Riel July 9, 2014, 3:05 a.m. UTC | #3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/30/2014 12:05 PM, Vincent Guittot wrote:
> The imbalance flag can stay set whereas there is no imbalance.
> 
> Let assume that we have 3 tasks that run on a dual cores /dual
> cluster system. We will have some idle load balance which are
> triggered during tick. Unfortunately, the tick is also used to
> queue background work so we can reach the situation where short
> work has been queued on a CPU which already runs a task. The load
> balance will detect this imbalance (2 tasks on 1 CPU and an idle 
> CPU) and will try to pull the waiting task on the idle CPU. The
> waiting task is a worker thread that is pinned on a CPU so an
> imbalance due to pinned task is detected and the imbalance flag is
> set. Then, we will not be able to clear the flag because we have at
> most 1 task on each CPU but the imbalance flag will trig to useless
> active load balance between the idle CPU and the busy CPU.
> 
> We need to reset of the imbalance flag as soon as we have reached a
> balanced state.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Acked-by: Rik van Riel <riel@redhat.com>


- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTvLFvAAoJEM553pKExN6DUcYIAJkj0fl4DIpx/7ywqSSCo4Da
1IpJI5Hz2zb+NunG8M/4kugDSYvMmiuNhFgG1ET7me11jxTcTqg0e8UZ4zJbW55u
i14IVyXLW+AVhJNQc1Umu3c6tTnjawOIzLa4wJ5JCVFVTj/5AhHyJjbKfQJnew1q
XlYm8+FPGmXQgJ0G3itmpx3gAYsrQIQXtIhM9wwgmKiysF4s+HZZppyZKtGbOtm4
ia408LsmjOYYp4vGSTa4F4IWx1K0fJVpz33TsCLb2pwKy6t/4hKf9tOn/wXPSLVc
NbWrP7zYYJ8EaXgo/RV9OJnPXq0h0Tbp9eMtd4u/hRCrcHw1dUZBpDMIRkS5NzI=
=uoh0
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Rik van Riel July 9, 2014, 3:36 a.m. UTC | #4
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/08/2014 11:05 PM, Rik van Riel wrote:
> On 06/30/2014 12:05 PM, Vincent Guittot wrote:
>> The imbalance flag can stay set whereas there is no imbalance.
> 
>> Let assume that we have 3 tasks that run on a dual cores /dual 
>> cluster system. We will have some idle load balance which are 
>> triggered during tick. Unfortunately, the tick is also used to 
>> queue background work so we can reach the situation where short 
>> work has been queued on a CPU which already runs a task. The
>> load balance will detect this imbalance (2 tasks on 1 CPU and an
>> idle CPU) and will try to pull the waiting task on the idle CPU.
>> The waiting task is a worker thread that is pinned on a CPU so
>> an imbalance due to pinned task is detected and the imbalance
>> flag is set. Then, we will not be able to clear the flag because
>> we have at most 1 task on each CPU but the imbalance flag will
>> trig to useless active load balance between the idle CPU and the
>> busy CPU.
> 
>> We need to reset of the imbalance flag as soon as we have reached
>> a balanced state.
> 
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> 
> Acked-by: Rik van Riel <riel@redhat.com>

Never mind that, Preeti explained the failure mode in more detail
on irc, and I am no longer convinced this change is a good idea.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTvLi6AAoJEM553pKExN6DEAUH/A+dwZ4MYW1JafoKN3iaFzgr
FqUzDayZ5NO2fSKNoa35CqvPhyZzAjeQn4H2yh0u/EhNI+1MzIqMY4i8l6dmB3QO
woOj8pf6O5lB1+iWk9F2moEJRX2y3X8mhhysO3ujR8211Ic7t12Z195+SH7qD2OD
xDjmRd/hTyNtn8mHOpcC2A69jaac8ZHKL6zNJcE9ax0IQxLTxZ+BzQnJhCtILEbD
a2eYe/Dm039bvgZjIRmy0ht+GkTtUuA6Yn7/SxLbQMQ0eSolan0DC3M7TSB3SkDB
z6lczf4lrMbyJdPhq1MX+C9YfbUAsICE2XQZYhpn/nopem0MTKBfQoLG4W3fWOg=
=6eQR
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Preeti U Murthy July 9, 2014, 3:54 a.m. UTC | #5
Hi Vincent,

On 07/08/2014 03:42 PM, Vincent Guittot wrote:
> On 8 July 2014 05:13, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> On 06/30/2014 09:35 PM, Vincent Guittot wrote:
>>> The imbalance flag can stay set whereas there is no imbalance.
>>>
>>> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
>>> We will have some idle load balance which are triggered during tick.
>>> Unfortunately, the tick is also used to queue background work so we can reach
>>> the situation where short work has been queued on a CPU which already runs a
>>> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
>>> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
>>> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
>>> detected and the imbalance flag is set.
>>> Then, we will not be able to clear the flag because we have at most 1 task on
>>> each CPU but the imbalance flag will trig to useless active load balance
>>> between the idle CPU and the busy CPU.
>>>
>>> We need to reset of the imbalance flag as soon as we have reached a balanced
>>> state.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>  kernel/sched/fair.c | 14 +++++++++++---
>>>  1 file changed, 11 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index d3c73122..0c48dff 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6615,10 +6615,8 @@ more_balance:
>>>               if (sd_parent) {
>>>                       int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>>>
>>> -                     if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
>>> +                     if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
>>>                               *group_imbalance = 1;
>>> -                     } else if (*group_imbalance)
>>> -                             *group_imbalance = 0;
>>>               }
>>>
>>>               /* All tasks on this runqueue were pinned by CPU affinity */
>>> @@ -6703,6 +6701,16 @@ more_balance:
>>>       goto out;
>>>
>>>  out_balanced:
>>> +     /*
>>> +      * We reach balance although we may have faced some affinity
>>> +      * constraints. Clear the imbalance flag if it was set.
>>> +      */
>>> +     if (sd_parent) {
>>> +             int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>>> +             if (*group_imbalance)
>>> +                     *group_imbalance = 0;
>>> +     }
>>> +
>>>       schedstat_inc(sd, lb_balanced[idle]);
>>>
>>>       sd->nr_balance_failed = 0;
>>>
>> I am not convinced that we can clear the imbalance flag here. Lets take
>> a simple example. Assume at a particular level of sched_domain, there
>> are two sched_groups with one cpu each. There are 2 tasks on the source
>> cpu, one of which is running(t1) and the other thread(t2) does not have
>> the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the
>> dst_cpu due to affinity constraints. Note that t2 is *not pinned, it
>> just cannot run on the dst_cpu*. In this scenario also we reach the
>> out_balanced tag right? If we set the group_imbalance flag to 0, we are
> 
> No we will not. If we have 2 tasks on 1 CPU in one sched_group and the
> other group with an idle CPU,  we are not balanced so we will not go
> to out_balanced and the group_imbalance will staty set until we reach
> a balanced state (by migrating t1).

In the example that I mention above, t1 and t2 are on the rq of cpu0;
while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in
its cpus allowed mask. So during load balance, cpu1 tries to pull t2,
cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to
out_balanced. Note that there are only two sched groups at this level of
sched domain.one with cpu0 and the other with cpu1. In this scenario we
do not try to do active load balancing, atleast thats what the code does
now if LBF_ALL_PINNED flag is set.

> 
>> ruling out the possibility of migrating t2 to any other cpu in a higher
>> level sched_domain by saying that all is well, there is no imbalance.
>> This is wrong, isn't it?
>>
>> My point is that by clearing the imbalance flag in the out_balanced
>> case, you might be overlooking the fact that the tsk_cpus_allowed mask
>> of the tasks on the src_cpu may not be able to run on the dst_cpu in
>> *this* level of sched_domain, but can potentially run on a cpu at any
>> higher level of sched_domain. By clearing the flag, we are not
> 
> The imbalance flag is per sched_domain level so we will not clear
> group_imbalance flag of other levels if the imbalance is also detected
> at a higher level it will migrate t2

Continuing with the above explanation; when LBF_ALL_PINNED flag is
set,and we jump to out_balanced, we clear the imbalance flag for the
sched_group comprising of cpu0 and cpu1,although there is actually an
imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
its cpus allowed mask) in another sched group when load balancing is
done at the next sched domain level.

Elaborating on this, when cpu2 in another socket,lets say, begins load
balancing and update_sd_pick_busiest() is called, the group with cpu0
and cpu1 may not be picked as a potential imbalanced group. Had we not
cleared the imbalance flag for this group, we could have balanced out t2
to cpu2/3.

Is the scenario I am describing clear?

Regards
Preeti U Murthy
> 
> Regards,
> Vincent
> 
>> encouraging load balance at that level for t2.
>>
>> Am I missing something?
>>
>> Regards
>> Preeti U Murthy
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Vincent Guittot July 9, 2014, 8:27 a.m. UTC | #6
On 9 July 2014 05:54, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 07/08/2014 03:42 PM, Vincent Guittot wrote:

[ snip]

>>>>  out_balanced:
>>>> +     /*
>>>> +      * We reach balance although we may have faced some affinity
>>>> +      * constraints. Clear the imbalance flag if it was set.
>>>> +      */
>>>> +     if (sd_parent) {
>>>> +             int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>>>> +             if (*group_imbalance)
>>>> +                     *group_imbalance = 0;
>>>> +     }
>>>> +
>>>>       schedstat_inc(sd, lb_balanced[idle]);
>>>>
>>>>       sd->nr_balance_failed = 0;
>>>>
>>> I am not convinced that we can clear the imbalance flag here. Lets take
>>> a simple example. Assume at a particular level of sched_domain, there
>>> are two sched_groups with one cpu each. There are 2 tasks on the source
>>> cpu, one of which is running(t1) and the other thread(t2) does not have
>>> the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the
>>> dst_cpu due to affinity constraints. Note that t2 is *not pinned, it
>>> just cannot run on the dst_cpu*. In this scenario also we reach the
>>> out_balanced tag right? If we set the group_imbalance flag to 0, we are
>>
>> No we will not. If we have 2 tasks on 1 CPU in one sched_group and the
>> other group with an idle CPU,  we are not balanced so we will not go
>> to out_balanced and the group_imbalance will staty set until we reach
>> a balanced state (by migrating t1).
>
> In the example that I mention above, t1 and t2 are on the rq of cpu0;
> while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in
> its cpus allowed mask. So during load balance, cpu1 tries to pull t2,
> cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to

That's where I disagree: my understanding of can_migrate_task is that
the LBF_ALL_PINNED will be cleared before returning false when
checking t1 because we are testing all tasks even the running task

> out_balanced. Note that there are only two sched groups at this level of
> sched domain.one with cpu0 and the other with cpu1. In this scenario we
> do not try to do active load balancing, atleast thats what the code does
> now if LBF_ALL_PINNED flag is set.
>
>>
>>> ruling out the possibility of migrating t2 to any other cpu in a higher
>>> level sched_domain by saying that all is well, there is no imbalance.
>>> This is wrong, isn't it?
>>>
>>> My point is that by clearing the imbalance flag in the out_balanced
>>> case, you might be overlooking the fact that the tsk_cpus_allowed mask
>>> of the tasks on the src_cpu may not be able to run on the dst_cpu in
>>> *this* level of sched_domain, but can potentially run on a cpu at any
>>> higher level of sched_domain. By clearing the flag, we are not
>>
>> The imbalance flag is per sched_domain level so we will not clear
>> group_imbalance flag of other levels if the imbalance is also detected
>> at a higher level it will migrate t2
>
> Continuing with the above explanation; when LBF_ALL_PINNED flag is
> set,and we jump to out_balanced, we clear the imbalance flag for the
> sched_group comprising of cpu0 and cpu1,although there is actually an
> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
> its cpus allowed mask) in another sched group when load balancing is
> done at the next sched domain level.

The imbalance is per sched_domain level so it will not have any side
effect on the next level

Regards,
Vincent

>
> Elaborating on this, when cpu2 in another socket,lets say, begins load
> balancing and update_sd_pick_busiest() is called, the group with cpu0
> and cpu1 may not be picked as a potential imbalanced group. Had we not
> cleared the imbalance flag for this group, we could have balanced out t2
> to cpu2/3.
>
> Is the scenario I am describing clear?
>
> Regards
> Preeti U Murthy
>>
>> Regards,
>> Vincent
>>
>>> encouraging load balance at that level for t2.
>>>
>>> Am I missing something?
>>>
>>> Regards
>>> Preeti U Murthy
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra July 9, 2014, 10:14 a.m. UTC | #7
On Mon, Jun 30, 2014 at 06:05:32PM +0200, Vincent Guittot wrote:
> +++ b/kernel/sched/fair.c
> @@ -6615,10 +6615,8 @@ more_balance:

In order to avoid labels being used as functions; you can use:

.quiltrc:
QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"

.gitconfig:
[diff "default"]
        xfuncname = "^[[:alpha:]$_].*[^:]$"
Vincent Guittot July 9, 2014, 10:30 a.m. UTC | #8
On 9 July 2014 12:14, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jun 30, 2014 at 06:05:32PM +0200, Vincent Guittot wrote:
>> +++ b/kernel/sched/fair.c
>> @@ -6615,10 +6615,8 @@ more_balance:
>
> In order to avoid labels being used as functions; you can use:
>
> .quiltrc:
> QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"
>
> .gitconfig:
> [diff "default"]
>         xfuncname = "^[[:alpha:]$_].*[^:]$"
>

Thanks, i'm going to add it to my .gitconfig

>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra July 9, 2014, 10:43 a.m. UTC | #9
On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
> In the example that I mention above, t1 and t2 are on the rq of cpu0;
> while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in
> its cpus allowed mask. So during load balance, cpu1 tries to pull t2,
> cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to
> out_balanced. Note that there are only two sched groups at this level of
> sched domain.one with cpu0 and the other with cpu1. In this scenario we
> do not try to do active load balancing, atleast thats what the code does
> now if LBF_ALL_PINNED flag is set.

I think Vince is right in saying that in this scenario ALL_PINNED won't
be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also
include the current running task.

And can_migrate_task() only checks for current after the pinning bits.

> Continuing with the above explanation; when LBF_ALL_PINNED flag is
> set,and we jump to out_balanced, we clear the imbalance flag for the
> sched_group comprising of cpu0 and cpu1,although there is actually an
> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
> its cpus allowed mask) in another sched group when load balancing is
> done at the next sched domain level.

And this is where Vince is wrong; note how
update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but
load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly
one level up.

So what we can do I suppose is clear 'group->sgc->imbalance' at
out_balanced.

In any case, the entirely of this group imbalance crap is just that,
crap. Its a terribly difficult situation and the current bits more or
less fudge around some of the common cases. Also see the comment near
sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of
hacks trying to deal with hard cases.

A 'good' solution would be prohibitively expensive I fear.
Preeti U Murthy July 9, 2014, 11:41 a.m. UTC | #10
On 07/09/2014 04:13 PM, Peter Zijlstra wrote:
> On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
>> In the example that I mention above, t1 and t2 are on the rq of cpu0;
>> while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in
>> its cpus allowed mask. So during load balance, cpu1 tries to pull t2,
>> cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to
>> out_balanced. Note that there are only two sched groups at this level of
>> sched domain.one with cpu0 and the other with cpu1. In this scenario we
>> do not try to do active load balancing, atleast thats what the code does
>> now if LBF_ALL_PINNED flag is set.
> 
> I think Vince is right in saying that in this scenario ALL_PINNED won't
> be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also
> include the current running task.

Hmm.. really? Because while dequeueing a task from the rq so as to
schedule it on a cpu, we delete its entry from the list of cfs_tasks on
the rq.

list_del_init(&se->group_node) in account_entity_dequeue() does that.

> 
> And can_migrate_task() only checks for current after the pinning bits.
> 
>> Continuing with the above explanation; when LBF_ALL_PINNED flag is
>> set,and we jump to out_balanced, we clear the imbalance flag for the
>> sched_group comprising of cpu0 and cpu1,although there is actually an
>> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
>> its cpus allowed mask) in another sched group when load balancing is
>> done at the next sched domain level.
> 
> And this is where Vince is wrong; note how
> update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but
> load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly
> one level up.

One level up? The group->sgc->imbalance flag is checked during
update_sg_lb_stats(). This flag is *set during the load balancing at a
lower level sched domain*.IOW, when the 'group' formed the sched domain.

> 
> So what we can do I suppose is clear 'group->sgc->imbalance' at
> out_balanced.

You mean 'set'? If we clear it we will have no clue about imbalances at
lower level sched domains due to pinning. Specifically in LBF_ALL_PINNED
case. This might prevent us from balancing out these tasks to other
groups at a higher level domain. update_sd_pick_busiest() specifically
relies on this flag to choose the busiest group.

> 
> In any case, the entirely of this group imbalance crap is just that,
> crap. Its a terribly difficult situation and the current bits more or
> less fudge around some of the common cases. Also see the comment near
> sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of
> hacks trying to deal with hard cases.
> 
> A 'good' solution would be prohibitively expensive I fear.

So the problem that Vincent is trying to bring to the fore with this
patchset is that, when the busy group has cpus with only 1 running task
but imbalance flag set due to affinity constraints, we unnecessarily try
to do active balancing on this group. Active load balancing will not
succeed when there is only one task. To solve this issue will a simple
check on (busiest->nr_running > 1) in addition to !(ld_moved) before
doing active load balancing not work?


Thanks

Regards
Preeti U Murthy




> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra July 9, 2014, 2:44 p.m. UTC | #11
On Wed, Jul 09, 2014 at 05:11:20PM +0530, Preeti U Murthy wrote:
> On 07/09/2014 04:13 PM, Peter Zijlstra wrote:
> > On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
> >> In the example that I mention above, t1 and t2 are on the rq of cpu0;
> >> while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in
> >> its cpus allowed mask. So during load balance, cpu1 tries to pull t2,
> >> cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to
> >> out_balanced. Note that there are only two sched groups at this level of
> >> sched domain.one with cpu0 and the other with cpu1. In this scenario we
> >> do not try to do active load balancing, atleast thats what the code does
> >> now if LBF_ALL_PINNED flag is set.
> > 
> > I think Vince is right in saying that in this scenario ALL_PINNED won't
> > be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also
> > include the current running task.
> 
> Hmm.. really? Because while dequeueing a task from the rq so as to
> schedule it on a cpu, we delete its entry from the list of cfs_tasks on
> the rq.
> 
> list_del_init(&se->group_node) in account_entity_dequeue() does that.

But set_next_entity() doesn't call account_entity_dequeue(), only
__dequeue_entity() to take it out of the rb-tree.

> > And can_migrate_task() only checks for current after the pinning bits.
> > 
> >> Continuing with the above explanation; when LBF_ALL_PINNED flag is
> >> set,and we jump to out_balanced, we clear the imbalance flag for the
> >> sched_group comprising of cpu0 and cpu1,although there is actually an
> >> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
> >> its cpus allowed mask) in another sched group when load balancing is
> >> done at the next sched domain level.
> > 
> > And this is where Vince is wrong; note how
> > update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but
> > load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly
> > one level up.
> 
> One level up? The group->sgc->imbalance flag is checked during
> update_sg_lb_stats(). This flag is *set during the load balancing at a
> lower level sched domain*.IOW, when the 'group' formed the sched domain.

sd_parent is one level up.

> > 
> > So what we can do I suppose is clear 'group->sgc->imbalance' at
> > out_balanced.
> 
> You mean 'set'? If we clear it we will have no clue about imbalances at
> lower level sched domains due to pinning. Specifically in LBF_ALL_PINNED
> case. This might prevent us from balancing out these tasks to other
> groups at a higher level domain. update_sd_pick_busiest() specifically
> relies on this flag to choose the busiest group.

No, clear, in load_balance. So set one level up, clear the current
level.
Vincent Guittot July 10, 2014, 9:14 a.m. UTC | #12
On 9 July 2014 12:43, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:

[snip]

>
>> Continuing with the above explanation; when LBF_ALL_PINNED flag is
>> set,and we jump to out_balanced, we clear the imbalance flag for the
>> sched_group comprising of cpu0 and cpu1,although there is actually an
>> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
>> its cpus allowed mask) in another sched group when load balancing is
>> done at the next sched domain level.
>
> And this is where Vince is wrong; note how
> update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but
> load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly
> one level up.
>

I forgot this behavior when studying preeti use case

> So what we can do I suppose is clear 'group->sgc->imbalance' at
> out_balanced.
>
> In any case, the entirely of this group imbalance crap is just that,
> crap. Its a terribly difficult situation and the current bits more or
> less fudge around some of the common cases. Also see the comment near
> sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of
> hacks trying to deal with hard cases.
>
> A 'good' solution would be prohibitively expensive I fear.

I have tried to summarized several use cases that have been discussed
for this patch

The 1st use case is the one that i described in the commit message of
this patch: If we have a sporadic imbalance that set the imbalance
flag, we don't clear it after and it generates spurious and useless
active load balance

Then preeti came with the following use case :
we have a sched_domain made of CPU0 and CPU1 in 2 different sched_groups
2 tasks A and B are on CPU0, B can't run on CPU1, A is the running task.
When CPU1's sched_group is doing load balance, the imbalance should be
set. That's still happen with this patchset because the LBF_ALL_PINNED
flag will be cleared thanks to task A.

Preeti also explained me the following use cases on irc:

If we have both tasks A and B that can't run on CPU1, the
LBF_ALL_PINNED will stay set. As we can't do anything, we conclude
that we are balanced, we go to out_balanced and we clear the imbalance
flag. But we should not consider that as a balanced state but as a all
tasks pinned state instead and we should let the imbalance flag set.
If we now have 2 additional CPUs which are in the cpumask of task A
and/or B at the parent sched_domain level , we should migrate one task
in this group but this will not happen (with this patch) because the
sched_group made of CPU0 and CPU1 is not overloaded (2 tasks for 2
CPUs) and the imbalance flag has been cleared as described previously.

I'm going to send a new revision of the patchset with the correction

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Preeti U Murthy July 10, 2014, 11:04 a.m. UTC | #13
Hi Peter, Vincent,

On 07/10/2014 02:44 PM, Vincent Guittot wrote:
> On 9 July 2014 12:43, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
> 
> [snip]
> 
>>
>>> Continuing with the above explanation; when LBF_ALL_PINNED flag is
>>> set,and we jump to out_balanced, we clear the imbalance flag for the
>>> sched_group comprising of cpu0 and cpu1,although there is actually an
>>> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
>>> its cpus allowed mask) in another sched group when load balancing is
>>> done at the next sched domain level.
>>
>> And this is where Vince is wrong; note how
>> update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but
>> load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly
>> one level up.
>>
> 
> I forgot this behavior when studying preeti use case
> 
>> So what we can do I suppose is clear 'group->sgc->imbalance' at
>> out_balanced.
>>
>> In any case, the entirely of this group imbalance crap is just that,
>> crap. Its a terribly difficult situation and the current bits more or
>> less fudge around some of the common cases. Also see the comment near
>> sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of
>> hacks trying to deal with hard cases.
>>
>> A 'good' solution would be prohibitively expensive I fear.
> 
> I have tried to summarized several use cases that have been discussed
> for this patch
> 
> The 1st use case is the one that i described in the commit message of
> this patch: If we have a sporadic imbalance that set the imbalance
> flag, we don't clear it after and it generates spurious and useless
> active load balance
> 
> Then preeti came with the following use case :
> we have a sched_domain made of CPU0 and CPU1 in 2 different sched_groups
> 2 tasks A and B are on CPU0, B can't run on CPU1, A is the running task.
> When CPU1's sched_group is doing load balance, the imbalance should be
> set. That's still happen with this patchset because the LBF_ALL_PINNED
> flag will be cleared thanks to task A.
> 
> Preeti also explained me the following use cases on irc:
> 
> If we have both tasks A and B that can't run on CPU1, the
> LBF_ALL_PINNED will stay set. As we can't do anything, we conclude
> that we are balanced, we go to out_balanced and we clear the imbalance
> flag. But we should not consider that as a balanced state but as a all
> tasks pinned state instead and we should let the imbalance flag set.
> If we now have 2 additional CPUs which are in the cpumask of task A
> and/or B at the parent sched_domain level , we should migrate one task
> in this group but this will not happen (with this patch) because the
> sched_group made of CPU0 and CPU1 is not overloaded (2 tasks for 2
> CPUs) and the imbalance flag has been cleared as described previously.

Peter,

The above paragraph describes my concern with regard to clearing the
imbalance flag at a given level of sched domain in case of pinned tasks
in the below conversation.
https://lkml.org/lkml/2014/7/9/454.

You are right about iterating through all tasks including the current
task during load balancing.

Thanks

Regards
Preeti U Murthy
> 
> I'm going to send a new revision of the patchset with the correction
> 
> Vincent
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3c73122..0c48dff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6615,10 +6615,8 @@  more_balance:
 		if (sd_parent) {
 			int *group_imbalance = &sd_parent->groups->sgc->imbalance;
 
-			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
 				*group_imbalance = 1;
-			} else if (*group_imbalance)
-				*group_imbalance = 0;
 		}
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
@@ -6703,6 +6701,16 @@  more_balance:
 	goto out;
 
 out_balanced:
+	/*
+	 * We reach balance although we may have faced some affinity
+	 * constraints. Clear the imbalance flag if it was set.
+	 */
+	if (sd_parent) {
+		int *group_imbalance = &sd_parent->groups->sgc->imbalance;
+		if (*group_imbalance)
+			*group_imbalance = 0;
+	}
+
 	schedstat_inc(sd, lb_balanced[idle]);
 
 	sd->nr_balance_failed = 0;