[v2,1/3] sched/fair: fix rounding issue for asym packing

Message ID 1544803317-7610-2-git-send-email-vincent.guittot@linaro.org
State New
Headers show
Series
  • sched/fair: some fixes for asym_packing
Related show

Commit Message

Vincent Guittot Dec. 14, 2018, 4:01 p.m.
When check_asym_packing() is triggered, the imbalance is set to :
busiest_stat.avg_load * busiest_stat.group_capacity / SCHED_CAPACITY_SCALE
busiest_stat.avg_load also comes from a division and the final rounding
can make imbalance slightly lower than the weighted load of the cfs_rq.
But this is enough to skip the rq in find_busiest_queue and prevents asym
migration to happen.

Add 1 to the avg_load to make sure that the targeted cpu will not be
skipped unexpectidly.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

-- 
2.7.4

Comments

Valentin Schneider Dec. 18, 2018, 5:32 p.m. | #1
On 14/12/2018 16:01, Vincent Guittot wrote:
> When check_asym_packing() is triggered, the imbalance is set to :

> busiest_stat.avg_load * busiest_stat.group_capacity / SCHED_CAPACITY_SCALE

> busiest_stat.avg_load also comes from a division and the final rounding

> can make imbalance slightly lower than the weighted load of the cfs_rq.

> But this is enough to skip the rq in find_busiest_queue and prevents asym

> migration to happen.

> 

> Add 1 to the avg_load to make sure that the targeted cpu will not be

> skipped unexpectidly.

> 

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---

>  kernel/sched/fair.c | 6 ++++++

>  1 file changed, 6 insertions(+)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index ca46964..c215f7a 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -8215,6 +8215,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,

>  	/* Adjust by relative CPU capacity of the group */

>  	sgs->group_capacity = group->sgc->capacity;

>  	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;

> +	/*

> +	 * Prevent division rounding to make the computation of imbalance

> +	 * slightly less than original value and to prevent the rq to be then

> +	 * selected as busiest queue

> +	 */

> +	sgs->avg_load += 1;


I've tried investigating this off-by-one issue by running lmbench, but it
seems to be a gnarly one.



The adventure starts in update_sg_lb_stats() where we compute
sgs->avg_load:

	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity

Then we drop by find_busiest_group() and call check_asym_packing() which,
if we do need to pack, does:

	env->imbalance = DIV_ROUND_CLOSEST(
		sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity,
		SCHED_CAPACITY_SCALE);

And finally the one check that seems to be failing, and that you're
addressing with a +1 is in find_busiest_queue():

	if (rq->nr_running == 1 && wl > env->imbalance &&
	    !check_cpu_capacity(rq, env->sd))
		continue;



Now, running lmbench on my HiKey960 hacked up to use asym packing, I
get an example where there's a task that should be migrated via asym
packing but isn't:

# This a DIE level balance
update_sg_lb_stats(): 
	cpu=0 load=1
	cpu=1 load=1023
	cpu=2 load=0
	cpu=3 load=0

	avg_load=568

# Busiest group is [0-3]
find_busiest_group(): check_asym_packing():
	env->imbalance = 1022

find_busiest_queue():
	cpu=0 wl=1
	cpu=1 wl=1023
	cpu=2 wl=0
	cpu=3 wl=0

Only CPU1 has rq->nr_running > 0 (==1 actually), but wl > env->imbalance
so we skip it in find_busiest_queue().

Now, I thought it was due to IRQ/rt pressure - 1840 isn't the maximum
group capacity for the LITTLE cluster, it should be (463 * 4) == 1852.
I got curious and threw random group capacity values in that equation while
keeping the same avg_load [1], and it gets wild: you have a signal that
oscillates between 1024 and 1014!

[1]: https://gist.github.com/valschneider/f42252e83be943d276b901f57734b8ba

Adding +1 to avg_load doesn't solve those pesky rounding errors that get
worse as group_capacity grows, but it reverses the trend: the signal
oscillates above 1024.



So in the end a +1 *could* "fix" this, but
a) It'd make more sense to have it in the check_asym_packing() code
b) It's ugly and it makes me feel like this piece of code is flimsy AF.

>  

>  	if (sgs->sum_nr_running)

>  		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;

>
Vincent Guittot Dec. 19, 2018, 8:32 a.m. | #2
On Tue, 18 Dec 2018 at 18:32, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>

> On 14/12/2018 16:01, Vincent Guittot wrote:

> > When check_asym_packing() is triggered, the imbalance is set to :

> > busiest_stat.avg_load * busiest_stat.group_capacity / SCHED_CAPACITY_SCALE

> > busiest_stat.avg_load also comes from a division and the final rounding

> > can make imbalance slightly lower than the weighted load of the cfs_rq.

> > But this is enough to skip the rq in find_busiest_queue and prevents asym

> > migration to happen.

> >

> > Add 1 to the avg_load to make sure that the targeted cpu will not be

> > skipped unexpectidly.

> >

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> > ---

> >  kernel/sched/fair.c | 6 ++++++

> >  1 file changed, 6 insertions(+)

> >

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index ca46964..c215f7a 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -8215,6 +8215,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,

> >       /* Adjust by relative CPU capacity of the group */

> >       sgs->group_capacity = group->sgc->capacity;

> >       sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;

> > +     /*

> > +      * Prevent division rounding to make the computation of imbalance

> > +      * slightly less than original value and to prevent the rq to be then

> > +      * selected as busiest queue

> > +      */

> > +     sgs->avg_load += 1;

>

> I've tried investigating this off-by-one issue by running lmbench, but it

> seems to be a gnarly one.

>

>

>

> The adventure starts in update_sg_lb_stats() where we compute

> sgs->avg_load:

>

>         sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity

>

> Then we drop by find_busiest_group() and call check_asym_packing() which,

> if we do need to pack, does:

>

>         env->imbalance = DIV_ROUND_CLOSEST(

>                 sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity,

>                 SCHED_CAPACITY_SCALE);

>

> And finally the one check that seems to be failing, and that you're

> addressing with a +1 is in find_busiest_queue():

>

>         if (rq->nr_running == 1 && wl > env->imbalance &&

>             !check_cpu_capacity(rq, env->sd))

>                 continue;

>

>

>

> Now, running lmbench on my HiKey960 hacked up to use asym packing, I

> get an example where there's a task that should be migrated via asym

> packing but isn't:

>

> # This a DIE level balance

> update_sg_lb_stats():

>         cpu=0 load=1

>         cpu=1 load=1023

>         cpu=2 load=0

>         cpu=3 load=0

>

>         avg_load=568

>

> # Busiest group is [0-3]

> find_busiest_group(): check_asym_packing():

>         env->imbalance = 1022

>

> find_busiest_queue():

>         cpu=0 wl=1

>         cpu=1 wl=1023

>         cpu=2 wl=0

>         cpu=3 wl=0

>

> Only CPU1 has rq->nr_running > 0 (==1 actually), but wl > env->imbalance

> so we skip it in find_busiest_queue().

>

> Now, I thought it was due to IRQ/rt pressure - 1840 isn't the maximum

> group capacity for the LITTLE cluster, it should be (463 * 4) == 1852.

> I got curious and threw random group capacity values in that equation while

> keeping the same avg_load [1], and it gets wild: you have a signal that

> oscillates between 1024 and 1014!


hmm ... this seems to not be related with $subject

>

> [1]: https://gist.github.com/valschneider/f42252e83be943d276b901f57734b8ba

>

> Adding +1 to avg_load doesn't solve those pesky rounding errors that get

> worse as group_capacity grows, but it reverses the trend: the signal

> oscillates above 1024.

>

>

>

> So in the end a +1 *could* "fix" this, but

> a) It'd make more sense to have it in the check_asym_packing() code

> b) It's ugly and it makes me feel like this piece of code is flimsy AF.


This is another UC, asym packing is used at SMT level for now and we
don't face this kind of problem, it has been also tested and DynamiQ
configuration which is similar to SMT : 1 CPU per sched_group
The legacy b.L one was not the main target although it can be

The rounding error is there and should be fixed and your UC is another
case for legacy b.L that can be treated in another patch

>

> >

> >       if (sgs->sum_nr_running)

> >               sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;

> >
Valentin Schneider Dec. 19, 2018, 11:58 a.m. | #3
On 19/12/2018 08:32, Vincent Guittot wrote:
[...]
> This is another UC, asym packing is used at SMT level for now and we

> don't face this kind of problem, it has been also tested and DynamiQ

> configuration which is similar to SMT : 1 CPU per sched_group

> The legacy b.L one was not the main target although it can be

> 

> The rounding error is there and should be fixed and your UC is another

> case for legacy b.L that can be treated in another patch

> 


I used that setup out of convenience for myself, but AFAICT that use-case
just stresses that issue.

In the end we still have an imbalance value that can vary with only
slight group_capacity changes. And that's true even if that group
contains a single CPU, as the imbalance value computed by
check_asym_packing() can vary by +/-1 with very tiny capacity changes 
(rt/IRQ pressure). That means that sometimes you're off by one and don't
do the packing because some CPU was pressured, but some other time you
might do it because that CPU was *more* pressured. It's doesn't make sense. 


In the end problem is that in update_sg_lb_stats() we do:

	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;

and in check_asym_packing() we do:

	env->imbalance = DIV_ROUND_CLOSEST(
		sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity,
		SCHED_CAPACITY_SCALE)

So we end up with something like:

		    group_load * SCHED_CAPACITY_SCALE * group_capacity
	imbalance = --------------------------------------------------
			    group_capacity * SCHED_CAPACITY_SCALE

Which you'd hope to reduce down to:

	imbalance = group_load

but you get division errors all over the place. So actually, the fix we'd
want would be:

-----8<-----
@@ -8463,9 +8463,7 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
        if (sched_asym_prefer(busiest_cpu, env->dst_cpu))
                return 0;
 
-       env->imbalance = DIV_ROUND_CLOSEST(
-               sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity,
-               SCHED_CAPACITY_SCALE);
+       env->imbalance = sds->busiest_stat.group_load;
 
        return 1;
 }
----->8-----
Vincent Guittot Dec. 19, 2018, 1:39 p.m. | #4
On Wed, 19 Dec 2018 at 12:58, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>

> On 19/12/2018 08:32, Vincent Guittot wrote:

> [...]

> > This is another UC, asym packing is used at SMT level for now and we

> > don't face this kind of problem, it has been also tested and DynamiQ

> > configuration which is similar to SMT : 1 CPU per sched_group

> > The legacy b.L one was not the main target although it can be

> >

> > The rounding error is there and should be fixed and your UC is another

> > case for legacy b.L that can be treated in another patch

> >

>

> I used that setup out of convenience for myself, but AFAICT that use-case

> just stresses that issue.


After looking at you UC in details, your problem comes from the wl=1
for cpu0 whereas there is no running task.
But wl!=0 without running task should not be possible since fdf5f3
("wsched/fair: Disable LB_BIAS by default")

This explain the 1023 instead of 1022

>

> In the end we still have an imbalance value that can vary with only

> slight group_capacity changes. And that's true even if that group

> contains a single CPU, as the imbalance value computed by

> check_asym_packing() can vary by +/-1 with very tiny capacity changes

> (rt/IRQ pressure). That means that sometimes you're off by one and don't

> do the packing because some CPU was pressured, but some other time you

> might do it because that CPU was *more* pressured. It's doesn't make sense.

>

>

> In the end problem is that in update_sg_lb_stats() we do:

>

>         sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;

>

> and in check_asym_packing() we do:

>

>         env->imbalance = DIV_ROUND_CLOSEST(

>                 sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity,

>                 SCHED_CAPACITY_SCALE)

>

> So we end up with something like:

>

>                     group_load * SCHED_CAPACITY_SCALE * group_capacity

>         imbalance = --------------------------------------------------

>                             group_capacity * SCHED_CAPACITY_SCALE

>

> Which you'd hope to reduce down to:

>

>         imbalance = group_load

>

> but you get division errors all over the place. So actually, the fix we'd

> want would be:

>

> -----8<-----

> @@ -8463,9 +8463,7 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)

>         if (sched_asym_prefer(busiest_cpu, env->dst_cpu))

>                 return 0;

>

> -       env->imbalance = DIV_ROUND_CLOSEST(

> -               sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity,

> -               SCHED_CAPACITY_SCALE);

> +       env->imbalance = sds->busiest_stat.group_load;

>

>         return 1;

>  }

> ----->8-----
Valentin Schneider Dec. 19, 2018, 2:59 p.m. | #5
On 19/12/2018 13:39, Vincent Guittot wrote:
[...]
>> I used that setup out of convenience for myself, but AFAICT that use-case

>> just stresses that issue.

> 

> After looking at you UC in details, your problem comes from the wl=1

> for cpu0 whereas there is no running task.

> But wl!=0 without running task should not be possible since fdf5f3

> ("wsched/fair: Disable LB_BIAS by default")

> 

> This explain the 1023 instead of 1022

> 

>


True, I had a look at the trace and there doesn't seem to be any running
task on that CPU. That's a separate matter however - the rounding issues
can happen regardless of the wl values.
Vincent Guittot Dec. 19, 2018, 3:05 p.m. | #6
On Wed, 19 Dec 2018 at 15:59, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>

>

>

> On 19/12/2018 13:39, Vincent Guittot wrote:

> [...]

> >> I used that setup out of convenience for myself, but AFAICT that use-case

> >> just stresses that issue.

> >

> > After looking at you UC in details, your problem comes from the wl=1

> > for cpu0 whereas there is no running task.

> > But wl!=0 without running task should not be possible since fdf5f3

> > ("wsched/fair: Disable LB_BIAS by default")

> >

> > This explain the 1023 instead of 1022

> >

> >

>

> True, I had a look at the trace and there doesn't seem to be any running

> task on that CPU. That's a separate matter however - the rounding issues

> can happen regardless of the wl values.


But it means that the rounding fix +1 works and your problems comes
from something else
Valentin Schneider Dec. 19, 2018, 3:11 p.m. | #7
On 19/12/2018 15:05, Vincent Guittot wrote:
[...]
>> True, I had a look at the trace and there doesn't seem to be any running

>> task on that CPU. That's a separate matter however - the rounding issues

>> can happen regardless of the wl values.

> 

> But it means that the rounding fix +1 works and your problems comes

> from something else


Oh yes, I never said it didn't work - I was doing some investigation on
the reason as to why we'd need this fix, because it's wasn't explicit from
the commit message.

The rounding errors are countered by the +1, yes, but I'd rather remove
the errors altogether and go for the snippet I suggested in my previous
reply.
Vincent Guittot Dec. 19, 2018, 3:20 p.m. | #8
On Wed, 19 Dec 2018 at 16:11, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>

> On 19/12/2018 15:05, Vincent Guittot wrote:

> [...]

> >> True, I had a look at the trace and there doesn't seem to be any running

> >> task on that CPU. That's a separate matter however - the rounding issues

> >> can happen regardless of the wl values.

> >

> > But it means that the rounding fix +1 works and your problems comes

> > from something else

>

> Oh yes, I never said it didn't work - I was doing some investigation on

> the reason as to why we'd need this fix, because it's wasn't explicit from

> the commit message.

>

> The rounding errors are countered by the +1, yes, but I'd rather remove

> the errors altogether and go for the snippet I suggested in my previous

> reply.


except that you don't always want to migrate all group load.
I prefer keeping current algorithm and fix it for now. Trying "new"
thing can come in a 2nd step
Valentin Schneider Dec. 19, 2018, 3:30 p.m. | #9
On 19/12/2018 15:20, Vincent Guittot wrote:
[...]
>> Oh yes, I never said it didn't work - I was doing some investigation on

>> the reason as to why we'd need this fix, because it's wasn't explicit from

>> the commit message.

>>

>> The rounding errors are countered by the +1, yes, but I'd rather remove

>> the errors altogether and go for the snippet I suggested in my previous

>> reply.

> 

> except that you don't always want to migrate all group load.

> I prefer keeping current algorithm and fix it for now. Trying "new"

> thing can come in a 2nd step


We already set the imbalance as the whole group load, we just introduce
rounding errors inbetween. As I've already said:

in update_sg_lb_stats() we do:

        sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;

and in check_asym_packing() we do:

        env->imbalance = DIV_ROUND_CLOSEST(
                sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity,
                SCHED_CAPACITY_SCALE)

So we end up with something like:

                    group_load * SCHED_CAPACITY_SCALE * group_capacity
        imbalance = --------------------------------------------------
                            group_capacity * SCHED_CAPACITY_SCALE

Which we could reduce down to:

        imbalance = group_load

and not get any rounding errors.
Vincent Guittot Dec. 19, 2018, 3:54 p.m. | #10
On Wed, 19 Dec 2018 at 16:30, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>

> On 19/12/2018 15:20, Vincent Guittot wrote:

> [...]

> >> Oh yes, I never said it didn't work - I was doing some investigation on

> >> the reason as to why we'd need this fix, because it's wasn't explicit from

> >> the commit message.

> >>

> >> The rounding errors are countered by the +1, yes, but I'd rather remove

> >> the errors altogether and go for the snippet I suggested in my previous

> >> reply.

> >

> > except that you don't always want to migrate all group load.

> > I prefer keeping current algorithm and fix it for now. Trying "new"

> > thing can come in a 2nd step

>

> We already set the imbalance as the whole group load, we just introduce

> rounding errors inbetween. As I've already said:

>

> in update_sg_lb_stats() we do:

>

>         sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;

>

> and in check_asym_packing() we do:

>

>         env->imbalance = DIV_ROUND_CLOSEST(

>                 sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity,

>                 SCHED_CAPACITY_SCALE)

>

> So we end up with something like:

>

>                     group_load * SCHED_CAPACITY_SCALE * group_capacity

>         imbalance = --------------------------------------------------

>                             group_capacity * SCHED_CAPACITY_SCALE

>

> Which we could reduce down to:

>

>         imbalance = group_load


ah yes, i thought the nr_running was involved but it's not the case.
This looks better indeed

>

> and not get any rounding errors.

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ca46964..c215f7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8215,6 +8215,12 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 	/* Adjust by relative CPU capacity of the group */
 	sgs->group_capacity = group->sgc->capacity;
 	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
+	/*
+	 * Prevent division rounding to make the computation of imbalance
+	 * slightly less than original value and to prevent the rq to be then
+	 * selected as busiest queue
+	 */
+	sgs->avg_load += 1;
 
 	if (sgs->sum_nr_running)
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;