Message ID | 1404144343-18720-2-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
On 06/30/2014 09:35 PM, Vincent Guittot wrote: > The imbalance flag can stay set whereas there is no imbalance. > > Let assume that we have 3 tasks that run on a dual cores /dual cluster system. > We will have some idle load balance which are triggered during tick. > Unfortunately, the tick is also used to queue background work so we can reach > the situation where short work has been queued on a CPU which already runs a > task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle > CPU) and will try to pull the waiting task on the idle CPU. The waiting task is > a worker thread that is pinned on a CPU so an imbalance due to pinned task is > detected and the imbalance flag is set. > Then, we will not be able to clear the flag because we have at most 1 task on > each CPU but the imbalance flag will trig to useless active load balance > between the idle CPU and the busy CPU. > > We need to reset of the imbalance flag as soon as we have reached a balanced > state. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/fair.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d3c73122..0c48dff 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6615,10 +6615,8 @@ more_balance: > if (sd_parent) { > int *group_imbalance = &sd_parent->groups->sgc->imbalance; > > - if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) { > + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) > *group_imbalance = 1; > - } else if (*group_imbalance) > - *group_imbalance = 0; > } > > /* All tasks on this runqueue were pinned by CPU affinity */ > @@ -6703,6 +6701,16 @@ more_balance: > goto out; > > out_balanced: > + /* > + * We reach balance although we may have faced some affinity > + * constraints. Clear the imbalance flag if it was set. > + */ > + if (sd_parent) { > + int *group_imbalance = &sd_parent->groups->sgc->imbalance; > + if (*group_imbalance) > + *group_imbalance = 0; > + } > + > schedstat_inc(sd, lb_balanced[idle]); > > sd->nr_balance_failed = 0; > I am not convinced that we can clear the imbalance flag here. Lets take a simple example. Assume at a particular level of sched_domain, there are two sched_groups with one cpu each. There are 2 tasks on the source cpu, one of which is running(t1) and the other thread(t2) does not have the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the dst_cpu due to affinity constraints. Note that t2 is *not pinned, it just cannot run on the dst_cpu*. In this scenario also we reach the out_balanced tag right? If we set the group_imbalance flag to 0, we are ruling out the possibility of migrating t2 to any other cpu in a higher level sched_domain by saying that all is well, there is no imbalance. This is wrong, isn't it? My point is that by clearing the imbalance flag in the out_balanced case, you might be overlooking the fact that the tsk_cpus_allowed mask of the tasks on the src_cpu may not be able to run on the dst_cpu in *this* level of sched_domain, but can potentially run on a cpu at any higher level of sched_domain. By clearing the flag, we are not encouraging load balance at that level for t2. Am I missing something? Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 8 July 2014 05:13, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote: > On 06/30/2014 09:35 PM, Vincent Guittot wrote: >> The imbalance flag can stay set whereas there is no imbalance. >> >> Let assume that we have 3 tasks that run on a dual cores /dual cluster system. >> We will have some idle load balance which are triggered during tick. >> Unfortunately, the tick is also used to queue background work so we can reach >> the situation where short work has been queued on a CPU which already runs a >> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle >> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is >> a worker thread that is pinned on a CPU so an imbalance due to pinned task is >> detected and the imbalance flag is set. >> Then, we will not be able to clear the flag because we have at most 1 task on >> each CPU but the imbalance flag will trig to useless active load balance >> between the idle CPU and the busy CPU. >> >> We need to reset of the imbalance flag as soon as we have reached a balanced >> state. >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >> --- >> kernel/sched/fair.c | 14 +++++++++++--- >> 1 file changed, 11 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index d3c73122..0c48dff 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -6615,10 +6615,8 @@ more_balance: >> if (sd_parent) { >> int *group_imbalance = &sd_parent->groups->sgc->imbalance; >> >> - if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) { >> + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) >> *group_imbalance = 1; >> - } else if (*group_imbalance) >> - *group_imbalance = 0; >> } >> >> /* All tasks on this runqueue were pinned by CPU affinity */ >> @@ -6703,6 +6701,16 @@ more_balance: >> goto out; >> >> out_balanced: >> + /* >> + * We reach balance although we may have faced some affinity >> + * constraints. Clear the imbalance flag if it was set. >> + */ >> + if (sd_parent) { >> + int *group_imbalance = &sd_parent->groups->sgc->imbalance; >> + if (*group_imbalance) >> + *group_imbalance = 0; >> + } >> + >> schedstat_inc(sd, lb_balanced[idle]); >> >> sd->nr_balance_failed = 0; >> > I am not convinced that we can clear the imbalance flag here. Lets take > a simple example. Assume at a particular level of sched_domain, there > are two sched_groups with one cpu each. There are 2 tasks on the source > cpu, one of which is running(t1) and the other thread(t2) does not have > the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the > dst_cpu due to affinity constraints. Note that t2 is *not pinned, it > just cannot run on the dst_cpu*. In this scenario also we reach the > out_balanced tag right? If we set the group_imbalance flag to 0, we are No we will not. If we have 2 tasks on 1 CPU in one sched_group and the other group with an idle CPU, we are not balanced so we will not go to out_balanced and the group_imbalance will staty set until we reach a balanced state (by migrating t1). > ruling out the possibility of migrating t2 to any other cpu in a higher > level sched_domain by saying that all is well, there is no imbalance. > This is wrong, isn't it? > > My point is that by clearing the imbalance flag in the out_balanced > case, you might be overlooking the fact that the tsk_cpus_allowed mask > of the tasks on the src_cpu may not be able to run on the dst_cpu in > *this* level of sched_domain, but can potentially run on a cpu at any > higher level of sched_domain. By clearing the flag, we are not The imbalance flag is per sched_domain level so we will not clear group_imbalance flag of other levels if the imbalance is also detected at a higher level it will migrate t2 Regards, Vincent > encouraging load balance at that level for t2. > > Am I missing something? > > Regards > Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 06/30/2014 12:05 PM, Vincent Guittot wrote: > The imbalance flag can stay set whereas there is no imbalance. > > Let assume that we have 3 tasks that run on a dual cores /dual > cluster system. We will have some idle load balance which are > triggered during tick. Unfortunately, the tick is also used to > queue background work so we can reach the situation where short > work has been queued on a CPU which already runs a task. The load > balance will detect this imbalance (2 tasks on 1 CPU and an idle > CPU) and will try to pull the waiting task on the idle CPU. The > waiting task is a worker thread that is pinned on a CPU so an > imbalance due to pinned task is detected and the imbalance flag is > set. Then, we will not be able to clear the flag because we have at > most 1 task on each CPU but the imbalance flag will trig to useless > active load balance between the idle CPU and the busy CPU. > > We need to reset of the imbalance flag as soon as we have reached a > balanced state. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Rik van Riel <riel@redhat.com> - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTvLFvAAoJEM553pKExN6DUcYIAJkj0fl4DIpx/7ywqSSCo4Da 1IpJI5Hz2zb+NunG8M/4kugDSYvMmiuNhFgG1ET7me11jxTcTqg0e8UZ4zJbW55u i14IVyXLW+AVhJNQc1Umu3c6tTnjawOIzLa4wJ5JCVFVTj/5AhHyJjbKfQJnew1q XlYm8+FPGmXQgJ0G3itmpx3gAYsrQIQXtIhM9wwgmKiysF4s+HZZppyZKtGbOtm4 ia408LsmjOYYp4vGSTa4F4IWx1K0fJVpz33TsCLb2pwKy6t/4hKf9tOn/wXPSLVc NbWrP7zYYJ8EaXgo/RV9OJnPXq0h0Tbp9eMtd4u/hRCrcHw1dUZBpDMIRkS5NzI= =uoh0 -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/08/2014 11:05 PM, Rik van Riel wrote: > On 06/30/2014 12:05 PM, Vincent Guittot wrote: >> The imbalance flag can stay set whereas there is no imbalance. > >> Let assume that we have 3 tasks that run on a dual cores /dual >> cluster system. We will have some idle load balance which are >> triggered during tick. Unfortunately, the tick is also used to >> queue background work so we can reach the situation where short >> work has been queued on a CPU which already runs a task. The >> load balance will detect this imbalance (2 tasks on 1 CPU and an >> idle CPU) and will try to pull the waiting task on the idle CPU. >> The waiting task is a worker thread that is pinned on a CPU so >> an imbalance due to pinned task is detected and the imbalance >> flag is set. Then, we will not be able to clear the flag because >> we have at most 1 task on each CPU but the imbalance flag will >> trig to useless active load balance between the idle CPU and the >> busy CPU. > >> We need to reset of the imbalance flag as soon as we have reached >> a balanced state. > >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > Acked-by: Rik van Riel <riel@redhat.com> Never mind that, Preeti explained the failure mode in more detail on irc, and I am no longer convinced this change is a good idea. - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTvLi6AAoJEM553pKExN6DEAUH/A+dwZ4MYW1JafoKN3iaFzgr FqUzDayZ5NO2fSKNoa35CqvPhyZzAjeQn4H2yh0u/EhNI+1MzIqMY4i8l6dmB3QO woOj8pf6O5lB1+iWk9F2moEJRX2y3X8mhhysO3ujR8211Ic7t12Z195+SH7qD2OD xDjmRd/hTyNtn8mHOpcC2A69jaac8ZHKL6zNJcE9ax0IQxLTxZ+BzQnJhCtILEbD a2eYe/Dm039bvgZjIRmy0ht+GkTtUuA6Yn7/SxLbQMQ0eSolan0DC3M7TSB3SkDB z6lczf4lrMbyJdPhq1MX+C9YfbUAsICE2XQZYhpn/nopem0MTKBfQoLG4W3fWOg= =6eQR -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi Vincent, On 07/08/2014 03:42 PM, Vincent Guittot wrote: > On 8 July 2014 05:13, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote: >> On 06/30/2014 09:35 PM, Vincent Guittot wrote: >>> The imbalance flag can stay set whereas there is no imbalance. >>> >>> Let assume that we have 3 tasks that run on a dual cores /dual cluster system. >>> We will have some idle load balance which are triggered during tick. >>> Unfortunately, the tick is also used to queue background work so we can reach >>> the situation where short work has been queued on a CPU which already runs a >>> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle >>> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is >>> a worker thread that is pinned on a CPU so an imbalance due to pinned task is >>> detected and the imbalance flag is set. >>> Then, we will not be able to clear the flag because we have at most 1 task on >>> each CPU but the imbalance flag will trig to useless active load balance >>> between the idle CPU and the busy CPU. >>> >>> We need to reset of the imbalance flag as soon as we have reached a balanced >>> state. >>> >>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >>> --- >>> kernel/sched/fair.c | 14 +++++++++++--- >>> 1 file changed, 11 insertions(+), 3 deletions(-) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index d3c73122..0c48dff 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -6615,10 +6615,8 @@ more_balance: >>> if (sd_parent) { >>> int *group_imbalance = &sd_parent->groups->sgc->imbalance; >>> >>> - if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) { >>> + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) >>> *group_imbalance = 1; >>> - } else if (*group_imbalance) >>> - *group_imbalance = 0; >>> } >>> >>> /* All tasks on this runqueue were pinned by CPU affinity */ >>> @@ -6703,6 +6701,16 @@ more_balance: >>> goto out; >>> >>> out_balanced: >>> + /* >>> + * We reach balance although we may have faced some affinity >>> + * constraints. Clear the imbalance flag if it was set. >>> + */ >>> + if (sd_parent) { >>> + int *group_imbalance = &sd_parent->groups->sgc->imbalance; >>> + if (*group_imbalance) >>> + *group_imbalance = 0; >>> + } >>> + >>> schedstat_inc(sd, lb_balanced[idle]); >>> >>> sd->nr_balance_failed = 0; >>> >> I am not convinced that we can clear the imbalance flag here. Lets take >> a simple example. Assume at a particular level of sched_domain, there >> are two sched_groups with one cpu each. There are 2 tasks on the source >> cpu, one of which is running(t1) and the other thread(t2) does not have >> the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the >> dst_cpu due to affinity constraints. Note that t2 is *not pinned, it >> just cannot run on the dst_cpu*. In this scenario also we reach the >> out_balanced tag right? If we set the group_imbalance flag to 0, we are > > No we will not. If we have 2 tasks on 1 CPU in one sched_group and the > other group with an idle CPU, we are not balanced so we will not go > to out_balanced and the group_imbalance will staty set until we reach > a balanced state (by migrating t1). In the example that I mention above, t1 and t2 are on the rq of cpu0; while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in its cpus allowed mask. So during load balance, cpu1 tries to pull t2, cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to out_balanced. Note that there are only two sched groups at this level of sched domain.one with cpu0 and the other with cpu1. In this scenario we do not try to do active load balancing, atleast thats what the code does now if LBF_ALL_PINNED flag is set. > >> ruling out the possibility of migrating t2 to any other cpu in a higher >> level sched_domain by saying that all is well, there is no imbalance. >> This is wrong, isn't it? >> >> My point is that by clearing the imbalance flag in the out_balanced >> case, you might be overlooking the fact that the tsk_cpus_allowed mask >> of the tasks on the src_cpu may not be able to run on the dst_cpu in >> *this* level of sched_domain, but can potentially run on a cpu at any >> higher level of sched_domain. By clearing the flag, we are not > > The imbalance flag is per sched_domain level so we will not clear > group_imbalance flag of other levels if the imbalance is also detected > at a higher level it will migrate t2 Continuing with the above explanation; when LBF_ALL_PINNED flag is set,and we jump to out_balanced, we clear the imbalance flag for the sched_group comprising of cpu0 and cpu1,although there is actually an imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in its cpus allowed mask) in another sched group when load balancing is done at the next sched domain level. Elaborating on this, when cpu2 in another socket,lets say, begins load balancing and update_sd_pick_busiest() is called, the group with cpu0 and cpu1 may not be picked as a potential imbalanced group. Had we not cleared the imbalance flag for this group, we could have balanced out t2 to cpu2/3. Is the scenario I am describing clear? Regards Preeti U Murthy > > Regards, > Vincent > >> encouraging load balance at that level for t2. >> >> Am I missing something? >> >> Regards >> Preeti U Murthy >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 9 July 2014 05:54, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote: > Hi Vincent, > > On 07/08/2014 03:42 PM, Vincent Guittot wrote: [ snip] >>>> out_balanced: >>>> + /* >>>> + * We reach balance although we may have faced some affinity >>>> + * constraints. Clear the imbalance flag if it was set. >>>> + */ >>>> + if (sd_parent) { >>>> + int *group_imbalance = &sd_parent->groups->sgc->imbalance; >>>> + if (*group_imbalance) >>>> + *group_imbalance = 0; >>>> + } >>>> + >>>> schedstat_inc(sd, lb_balanced[idle]); >>>> >>>> sd->nr_balance_failed = 0; >>>> >>> I am not convinced that we can clear the imbalance flag here. Lets take >>> a simple example. Assume at a particular level of sched_domain, there >>> are two sched_groups with one cpu each. There are 2 tasks on the source >>> cpu, one of which is running(t1) and the other thread(t2) does not have >>> the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the >>> dst_cpu due to affinity constraints. Note that t2 is *not pinned, it >>> just cannot run on the dst_cpu*. In this scenario also we reach the >>> out_balanced tag right? If we set the group_imbalance flag to 0, we are >> >> No we will not. If we have 2 tasks on 1 CPU in one sched_group and the >> other group with an idle CPU, we are not balanced so we will not go >> to out_balanced and the group_imbalance will staty set until we reach >> a balanced state (by migrating t1). > > In the example that I mention above, t1 and t2 are on the rq of cpu0; > while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in > its cpus allowed mask. So during load balance, cpu1 tries to pull t2, > cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to That's where I disagree: my understanding of can_migrate_task is that the LBF_ALL_PINNED will be cleared before returning false when checking t1 because we are testing all tasks even the running task > out_balanced. Note that there are only two sched groups at this level of > sched domain.one with cpu0 and the other with cpu1. In this scenario we > do not try to do active load balancing, atleast thats what the code does > now if LBF_ALL_PINNED flag is set. > >> >>> ruling out the possibility of migrating t2 to any other cpu in a higher >>> level sched_domain by saying that all is well, there is no imbalance. >>> This is wrong, isn't it? >>> >>> My point is that by clearing the imbalance flag in the out_balanced >>> case, you might be overlooking the fact that the tsk_cpus_allowed mask >>> of the tasks on the src_cpu may not be able to run on the dst_cpu in >>> *this* level of sched_domain, but can potentially run on a cpu at any >>> higher level of sched_domain. By clearing the flag, we are not >> >> The imbalance flag is per sched_domain level so we will not clear >> group_imbalance flag of other levels if the imbalance is also detected >> at a higher level it will migrate t2 > > Continuing with the above explanation; when LBF_ALL_PINNED flag is > set,and we jump to out_balanced, we clear the imbalance flag for the > sched_group comprising of cpu0 and cpu1,although there is actually an > imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in > its cpus allowed mask) in another sched group when load balancing is > done at the next sched domain level. The imbalance is per sched_domain level so it will not have any side effect on the next level Regards, Vincent > > Elaborating on this, when cpu2 in another socket,lets say, begins load > balancing and update_sd_pick_busiest() is called, the group with cpu0 > and cpu1 may not be picked as a potential imbalanced group. Had we not > cleared the imbalance flag for this group, we could have balanced out t2 > to cpu2/3. > > Is the scenario I am describing clear? > > Regards > Preeti U Murthy >> >> Regards, >> Vincent >> >>> encouraging load balance at that level for t2. >>> >>> Am I missing something? >>> >>> Regards >>> Preeti U Murthy >>> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, Jun 30, 2014 at 06:05:32PM +0200, Vincent Guittot wrote: > +++ b/kernel/sched/fair.c > @@ -6615,10 +6615,8 @@ more_balance: In order to avoid labels being used as functions; you can use: .quiltrc: QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$" .gitconfig: [diff "default"] xfuncname = "^[[:alpha:]$_].*[^:]$"
On 9 July 2014 12:14, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Jun 30, 2014 at 06:05:32PM +0200, Vincent Guittot wrote: >> +++ b/kernel/sched/fair.c >> @@ -6615,10 +6615,8 @@ more_balance: > > In order to avoid labels being used as functions; you can use: > > .quiltrc: > QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$" > > .gitconfig: > [diff "default"] > xfuncname = "^[[:alpha:]$_].*[^:]$" > Thanks, i'm going to add it to my .gitconfig > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote: > In the example that I mention above, t1 and t2 are on the rq of cpu0; > while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in > its cpus allowed mask. So during load balance, cpu1 tries to pull t2, > cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to > out_balanced. Note that there are only two sched groups at this level of > sched domain.one with cpu0 and the other with cpu1. In this scenario we > do not try to do active load balancing, atleast thats what the code does > now if LBF_ALL_PINNED flag is set. I think Vince is right in saying that in this scenario ALL_PINNED won't be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also include the current running task. And can_migrate_task() only checks for current after the pinning bits. > Continuing with the above explanation; when LBF_ALL_PINNED flag is > set,and we jump to out_balanced, we clear the imbalance flag for the > sched_group comprising of cpu0 and cpu1,although there is actually an > imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in > its cpus allowed mask) in another sched group when load balancing is > done at the next sched domain level. And this is where Vince is wrong; note how update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly one level up. So what we can do I suppose is clear 'group->sgc->imbalance' at out_balanced. In any case, the entirely of this group imbalance crap is just that, crap. Its a terribly difficult situation and the current bits more or less fudge around some of the common cases. Also see the comment near sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of hacks trying to deal with hard cases. A 'good' solution would be prohibitively expensive I fear.
On 07/09/2014 04:13 PM, Peter Zijlstra wrote: > On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote: >> In the example that I mention above, t1 and t2 are on the rq of cpu0; >> while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in >> its cpus allowed mask. So during load balance, cpu1 tries to pull t2, >> cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to >> out_balanced. Note that there are only two sched groups at this level of >> sched domain.one with cpu0 and the other with cpu1. In this scenario we >> do not try to do active load balancing, atleast thats what the code does >> now if LBF_ALL_PINNED flag is set. > > I think Vince is right in saying that in this scenario ALL_PINNED won't > be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also > include the current running task. Hmm.. really? Because while dequeueing a task from the rq so as to schedule it on a cpu, we delete its entry from the list of cfs_tasks on the rq. list_del_init(&se->group_node) in account_entity_dequeue() does that. > > And can_migrate_task() only checks for current after the pinning bits. > >> Continuing with the above explanation; when LBF_ALL_PINNED flag is >> set,and we jump to out_balanced, we clear the imbalance flag for the >> sched_group comprising of cpu0 and cpu1,although there is actually an >> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in >> its cpus allowed mask) in another sched group when load balancing is >> done at the next sched domain level. > > And this is where Vince is wrong; note how > update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but > load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly > one level up. One level up? The group->sgc->imbalance flag is checked during update_sg_lb_stats(). This flag is *set during the load balancing at a lower level sched domain*.IOW, when the 'group' formed the sched domain. > > So what we can do I suppose is clear 'group->sgc->imbalance' at > out_balanced. You mean 'set'? If we clear it we will have no clue about imbalances at lower level sched domains due to pinning. Specifically in LBF_ALL_PINNED case. This might prevent us from balancing out these tasks to other groups at a higher level domain. update_sd_pick_busiest() specifically relies on this flag to choose the busiest group. > > In any case, the entirely of this group imbalance crap is just that, > crap. Its a terribly difficult situation and the current bits more or > less fudge around some of the common cases. Also see the comment near > sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of > hacks trying to deal with hard cases. > > A 'good' solution would be prohibitively expensive I fear. So the problem that Vincent is trying to bring to the fore with this patchset is that, when the busy group has cpus with only 1 running task but imbalance flag set due to affinity constraints, we unnecessarily try to do active balancing on this group. Active load balancing will not succeed when there is only one task. To solve this issue will a simple check on (busiest->nr_running > 1) in addition to !(ld_moved) before doing active load balancing not work? Thanks Regards Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jul 09, 2014 at 05:11:20PM +0530, Preeti U Murthy wrote: > On 07/09/2014 04:13 PM, Peter Zijlstra wrote: > > On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote: > >> In the example that I mention above, t1 and t2 are on the rq of cpu0; > >> while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in > >> its cpus allowed mask. So during load balance, cpu1 tries to pull t2, > >> cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to > >> out_balanced. Note that there are only two sched groups at this level of > >> sched domain.one with cpu0 and the other with cpu1. In this scenario we > >> do not try to do active load balancing, atleast thats what the code does > >> now if LBF_ALL_PINNED flag is set. > > > > I think Vince is right in saying that in this scenario ALL_PINNED won't > > be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also > > include the current running task. > > Hmm.. really? Because while dequeueing a task from the rq so as to > schedule it on a cpu, we delete its entry from the list of cfs_tasks on > the rq. > > list_del_init(&se->group_node) in account_entity_dequeue() does that. But set_next_entity() doesn't call account_entity_dequeue(), only __dequeue_entity() to take it out of the rb-tree. > > And can_migrate_task() only checks for current after the pinning bits. > > > >> Continuing with the above explanation; when LBF_ALL_PINNED flag is > >> set,and we jump to out_balanced, we clear the imbalance flag for the > >> sched_group comprising of cpu0 and cpu1,although there is actually an > >> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in > >> its cpus allowed mask) in another sched group when load balancing is > >> done at the next sched domain level. > > > > And this is where Vince is wrong; note how > > update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but > > load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly > > one level up. > > One level up? The group->sgc->imbalance flag is checked during > update_sg_lb_stats(). This flag is *set during the load balancing at a > lower level sched domain*.IOW, when the 'group' formed the sched domain. sd_parent is one level up. > > > > So what we can do I suppose is clear 'group->sgc->imbalance' at > > out_balanced. > > You mean 'set'? If we clear it we will have no clue about imbalances at > lower level sched domains due to pinning. Specifically in LBF_ALL_PINNED > case. This might prevent us from balancing out these tasks to other > groups at a higher level domain. update_sd_pick_busiest() specifically > relies on this flag to choose the busiest group. No, clear, in load_balance. So set one level up, clear the current level.
On 9 July 2014 12:43, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote: [snip] > >> Continuing with the above explanation; when LBF_ALL_PINNED flag is >> set,and we jump to out_balanced, we clear the imbalance flag for the >> sched_group comprising of cpu0 and cpu1,although there is actually an >> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in >> its cpus allowed mask) in another sched group when load balancing is >> done at the next sched domain level. > > And this is where Vince is wrong; note how > update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but > load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly > one level up. > I forgot this behavior when studying preeti use case > So what we can do I suppose is clear 'group->sgc->imbalance' at > out_balanced. > > In any case, the entirely of this group imbalance crap is just that, > crap. Its a terribly difficult situation and the current bits more or > less fudge around some of the common cases. Also see the comment near > sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of > hacks trying to deal with hard cases. > > A 'good' solution would be prohibitively expensive I fear. I have tried to summarized several use cases that have been discussed for this patch The 1st use case is the one that i described in the commit message of this patch: If we have a sporadic imbalance that set the imbalance flag, we don't clear it after and it generates spurious and useless active load balance Then preeti came with the following use case : we have a sched_domain made of CPU0 and CPU1 in 2 different sched_groups 2 tasks A and B are on CPU0, B can't run on CPU1, A is the running task. When CPU1's sched_group is doing load balance, the imbalance should be set. That's still happen with this patchset because the LBF_ALL_PINNED flag will be cleared thanks to task A. Preeti also explained me the following use cases on irc: If we have both tasks A and B that can't run on CPU1, the LBF_ALL_PINNED will stay set. As we can't do anything, we conclude that we are balanced, we go to out_balanced and we clear the imbalance flag. But we should not consider that as a balanced state but as a all tasks pinned state instead and we should let the imbalance flag set. If we now have 2 additional CPUs which are in the cpumask of task A and/or B at the parent sched_domain level , we should migrate one task in this group but this will not happen (with this patch) because the sched_group made of CPU0 and CPU1 is not overloaded (2 tasks for 2 CPUs) and the imbalance flag has been cleared as described previously. I'm going to send a new revision of the patchset with the correction Vincent -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi Peter, Vincent, On 07/10/2014 02:44 PM, Vincent Guittot wrote: > On 9 July 2014 12:43, Peter Zijlstra <peterz@infradead.org> wrote: >> On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote: > > [snip] > >> >>> Continuing with the above explanation; when LBF_ALL_PINNED flag is >>> set,and we jump to out_balanced, we clear the imbalance flag for the >>> sched_group comprising of cpu0 and cpu1,although there is actually an >>> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in >>> its cpus allowed mask) in another sched group when load balancing is >>> done at the next sched domain level. >> >> And this is where Vince is wrong; note how >> update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but >> load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly >> one level up. >> > > I forgot this behavior when studying preeti use case > >> So what we can do I suppose is clear 'group->sgc->imbalance' at >> out_balanced. >> >> In any case, the entirely of this group imbalance crap is just that, >> crap. Its a terribly difficult situation and the current bits more or >> less fudge around some of the common cases. Also see the comment near >> sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of >> hacks trying to deal with hard cases. >> >> A 'good' solution would be prohibitively expensive I fear. > > I have tried to summarized several use cases that have been discussed > for this patch > > The 1st use case is the one that i described in the commit message of > this patch: If we have a sporadic imbalance that set the imbalance > flag, we don't clear it after and it generates spurious and useless > active load balance > > Then preeti came with the following use case : > we have a sched_domain made of CPU0 and CPU1 in 2 different sched_groups > 2 tasks A and B are on CPU0, B can't run on CPU1, A is the running task. > When CPU1's sched_group is doing load balance, the imbalance should be > set. That's still happen with this patchset because the LBF_ALL_PINNED > flag will be cleared thanks to task A. > > Preeti also explained me the following use cases on irc: > > If we have both tasks A and B that can't run on CPU1, the > LBF_ALL_PINNED will stay set. As we can't do anything, we conclude > that we are balanced, we go to out_balanced and we clear the imbalance > flag. But we should not consider that as a balanced state but as a all > tasks pinned state instead and we should let the imbalance flag set. > If we now have 2 additional CPUs which are in the cpumask of task A > and/or B at the parent sched_domain level , we should migrate one task > in this group but this will not happen (with this patch) because the > sched_group made of CPU0 and CPU1 is not overloaded (2 tasks for 2 > CPUs) and the imbalance flag has been cleared as described previously. Peter, The above paragraph describes my concern with regard to clearing the imbalance flag at a given level of sched domain in case of pinned tasks in the below conversation. https://lkml.org/lkml/2014/7/9/454. You are right about iterating through all tasks including the current task during load balancing. Thanks Regards Preeti U Murthy > > I'm going to send a new revision of the patchset with the correction > > Vincent > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3c73122..0c48dff 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6615,10 +6615,8 @@ more_balance: if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance; - if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) { + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1; - } else if (*group_imbalance) - *group_imbalance = 0; } /* All tasks on this runqueue were pinned by CPU affinity */ @@ -6703,6 +6701,16 @@ more_balance: goto out; out_balanced: + /* + * We reach balance although we may have faced some affinity + * constraints. Clear the imbalance flag if it was set. + */ + if (sd_parent) { + int *group_imbalance = &sd_parent->groups->sgc->imbalance; + if (*group_imbalance) + *group_imbalance = 0; + } + schedstat_inc(sd, lb_balanced[idle]); sd->nr_balance_failed = 0;
The imbalance flag can stay set whereas there is no imbalance. Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU. We need to reset of the imbalance flag as soon as we have reached a balanced state. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)