diff mbox

[3/7] sched/fair: Correct unit of load_above_capacity

Message ID 1461958364-675-4-git-send-email-dietmar.eggemann@arm.com
State New
Headers show

Commit Message

Dietmar Eggemann April 29, 2016, 7:32 p.m. UTC
From: Morten Rasmussen <morten.rasmussen@arm.com>


In calculate_imbalance() load_above_capacity currently has the unit
[load] while it is used as being [load/capacity]. Not only is it wrong it
also makes it unlikely that load_above_capacity is ever used as the
subsequent code picks the smaller of load_above_capacity and the avg_load
difference between the local and the busiest sched_groups.

This patch ensures that load_above_capacity has the right unit
[load/capacity].

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

-- 
1.9.1

Comments

Vincent Guittot May 13, 2016, 8:22 a.m. UTC | #1
On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@intel.com> wrote:
> On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote:

>> Commit-ID:  cfa10334318d8212d007da8c771187643c9cef35

>> Gitweb:     http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35

>> Author:     Morten Rasmussen <morten.rasmussen@arm.com>

>> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100

>> Committer:  Ingo Molnar <mingo@kernel.org>

>> CommitDate: Thu, 12 May 2016 09:55:33 +0200

>>

>> sched/fair: Correct unit of load_above_capacity

>>

>> In calculate_imbalance() load_above_capacity currently has the unit

>> [capacity] while it is used as being [load/capacity]. Not only is it

>> wrong it also makes it unlikely that load_above_capacity is ever used

>> as the subsequent code picks the smaller of load_above_capacity and

>> the avg_load

>>

>> This patch ensures that load_above_capacity has the right unit

>> [load/capacity].


load_above_capacity has the unit [load] and is then compared with other load

>>

>> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

>> [ Changed changelog to note it was in capacity unit; +rebase. ]

>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>

>> Cc: Linus Torvalds <torvalds@linux-foundation.org>

>> Cc: Mike Galbraith <efault@gmx.de>

>> Cc: Peter Zijlstra <peterz@infradead.org>

>> Cc: Thomas Gleixner <tglx@linutronix.de>

>> Cc: linux-kernel@vger.kernel.org

>> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com

>> Signed-off-by: Ingo Molnar <mingo@kernel.org>

>> ---

>>  kernel/sched/fair.c | 6 ++++--

>>  1 file changed, 4 insertions(+), 2 deletions(-)

>>

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> index 2338105..218f8e8 100644

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s

>>       if (busiest->group_type == group_overloaded &&

>>           local->group_type   == group_overloaded) {

>>               load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;

>> -             if (load_above_capacity > busiest->group_capacity)

>> +             if (load_above_capacity > busiest->group_capacity) {

>>                       load_above_capacity -= busiest->group_capacity;

>> -             else

>> +                     load_above_capacity *= NICE_0_LOAD;

>> +                     load_above_capacity /= busiest->group_capacity;


If I follow correctly the sequence,
load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE
- busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity
so
load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE /
busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD

so the unit is [load]

Lets take the example of a group made of 2 cores with 2 threads per
core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the
capacity of the group is 3*SCHED_CAPACITY_SCALE.
If we have 6 tasks on this group :
load_above capacity = 1 *NICE_0_LOAD which means we want to remove no
more than 1 tasks from the group and let 5 tasks in the group whereas
we don't expect to have more than 4 tasks as we have 4 CPUs and
probably less because the compute capacity of each CPU is less than
the default one

So i would have expected
load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD -
busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE
load_above capacity = 3*NICE_0_LOAD which means we want to remove no
more than 3 tasks and let 3 tasks in the group

Regards,
Vincent


>> +             } else

>>                       load_above_capacity = ~0UL;

>>       }

>

> Hi Morten,

>

> I got the feeling this might be wrong, the NICE_0_LOAD should be scaled down.

> But I hope I am wrong.

>

> Vincent, could you take a look?
Morten Rasmussen May 19, 2016, 3:36 p.m. UTC | #2
On Fri, May 13, 2016 at 10:22:19AM +0200, Vincent Guittot wrote:
> On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@intel.com> wrote:

> > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote:

> >> Commit-ID:  cfa10334318d8212d007da8c771187643c9cef35

> >> Gitweb:     http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35

> >> Author:     Morten Rasmussen <morten.rasmussen@arm.com>

> >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100

> >> Committer:  Ingo Molnar <mingo@kernel.org>

> >> CommitDate: Thu, 12 May 2016 09:55:33 +0200

> >>

> >> sched/fair: Correct unit of load_above_capacity

> >>

> >> In calculate_imbalance() load_above_capacity currently has the unit

> >> [capacity] while it is used as being [load/capacity]. Not only is it

> >> wrong it also makes it unlikely that load_above_capacity is ever used

> >> as the subsequent code picks the smaller of load_above_capacity and

> >> the avg_load

> >>

> >> This patch ensures that load_above_capacity has the right unit

> >> [load/capacity].

> 

> load_above_capacity has the unit [load] and is then compared with other load


I'm not sure if talk about the same code, I'm referring to:

max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);

a bit further down. Here the first option has unit [load/capacity], and
the subsequent code multiplies the result with [capacity] to get back to
[load]:

/* How much load to actually move to equalise the imbalance */
env->imbalance = min(
        max_pull * busiest->group_capacity,
        (sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;

That lead me to the conclusion that load_above_capacity would have to be
'per capacity', i.e. [load/capacity], as well for the code to make
sense.

> 

> >>

> >> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

> >> [ Changed changelog to note it was in capacity unit; +rebase. ]

> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> >> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>

> >> Cc: Linus Torvalds <torvalds@linux-foundation.org>

> >> Cc: Mike Galbraith <efault@gmx.de>

> >> Cc: Peter Zijlstra <peterz@infradead.org>

> >> Cc: Thomas Gleixner <tglx@linutronix.de>

> >> Cc: linux-kernel@vger.kernel.org

> >> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com

> >> Signed-off-by: Ingo Molnar <mingo@kernel.org>

> >> ---

> >>  kernel/sched/fair.c | 6 ++++--

> >>  1 file changed, 4 insertions(+), 2 deletions(-)

> >>

> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> >> index 2338105..218f8e8 100644

> >> --- a/kernel/sched/fair.c

> >> +++ b/kernel/sched/fair.c

> >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s

> >>       if (busiest->group_type == group_overloaded &&

> >>           local->group_type   == group_overloaded) {

> >>               load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;

> >> -             if (load_above_capacity > busiest->group_capacity)

> >> +             if (load_above_capacity > busiest->group_capacity) {

> >>                       load_above_capacity -= busiest->group_capacity;

> >> -             else

> >> +                     load_above_capacity *= NICE_0_LOAD;

> >> +                     load_above_capacity /= busiest->group_capacity;

> 

> If I follow correctly the sequence,

> load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE

> - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity

> so

> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE /

> busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD

> 

> so the unit is [load]


I'm with you for the equation, but not for the unit and I find it hard
convince myself what the unit really is :-(

I think it is down to the fact that we are comparing [load] directly
with [capacity] here, basically saying [load] == [capacity]. Most other
places we scale [load] by [capacity], we don't compare the two or
otherwise imply that they are the same unit. Do we have any other
examples of this?

The name load_above_capacity suggests that it should have unit [load],
but we actually compute it by subtracting to values, where the latter
clearly has unit [capacity]. PeterZ's recent patch (1be0eb2a97
sched/fair: Clean up scale confusion) changes the former to be
[capacity].

load_above_capacity is later multiplied by [capacity] to determine the
imbalance which must have unit [load]. So working our way backwards,
load_above_capacity must have unit [load/capacity]. However, if [load] =
[capacity] then what we really have is just group-normalized [load].

As said in my other reply, the code only really makes sense for under
special circumstances where [load] == [capacity]. The easy, and possibly
best, way out of this mess would be to replace the code with something
like PeterZ's suggestion that I tried to implement in the patch included
in my other reply.

> Lets take the example of a group made of 2 cores with 2 threads per

> core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the

> capacity of the group is 3*SCHED_CAPACITY_SCALE.

> If we have 6 tasks on this group :

> load_above capacity = 1 *NICE_0_LOAD which means we want to remove no

> more than 1 tasks from the group and let 5 tasks in the group whereas

> we don't expect to have more than 4 tasks as we have 4 CPUs and

> probably less because the compute capacity of each CPU is less than

> the default one

> 

> So i would have expected

> load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD -

> busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE

> load_above capacity = 3*NICE_0_LOAD which means we want to remove no

> more than 3 tasks and let 3 tasks in the group


And this is exactly you get with this patch :-) load_above_capacity
(through max_pull) is multiplied by the group capacity to compute that
actual amount of [load] to remove:

env->imbalance 	= load_above_capacity * busiest->group_capacity /
		 	SCHED_CAPACITY_SCALE

		= 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE /
			SCHED_CAPACITY_SCALE

		= 3*NICE_0_LOAD

I don't think we disagree on how it should work :-) Without the capacity
scaling in this patch you get: 

env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) *
			3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE

		= 9*SCHED_CAPACITY_SCALE

Coming back to Yuyang's question. I think it should be NICE_0_LOAD to
ensure that the resulting imbalance has the proper unit [load].

I'm happy to scrap this patch and replace the whole thing with something
else that makes more sense, but IMHO it is needed if the current mess
should make any sense at all.
Vincent Guittot May 20, 2016, 8:17 a.m. UTC | #3
Hi Morten,

On 19 May 2016 at 17:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, May 13, 2016 at 10:22:19AM +0200, Vincent Guittot wrote:

>> On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@intel.com> wrote:

>> > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote:

>> >> Commit-ID:  cfa10334318d8212d007da8c771187643c9cef35

>> >> Gitweb:     http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35

>> >> Author:     Morten Rasmussen <morten.rasmussen@arm.com>

>> >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100

>> >> Committer:  Ingo Molnar <mingo@kernel.org>

>> >> CommitDate: Thu, 12 May 2016 09:55:33 +0200

>> >>

>> >> sched/fair: Correct unit of load_above_capacity

>> >>

>> >> In calculate_imbalance() load_above_capacity currently has the unit

>> >> [capacity] while it is used as being [load/capacity]. Not only is it

>> >> wrong it also makes it unlikely that load_above_capacity is ever used

>> >> as the subsequent code picks the smaller of load_above_capacity and

>> >> the avg_load

>> >>

>> >> This patch ensures that load_above_capacity has the right unit

>> >> [load/capacity].

>>

>> load_above_capacity has the unit [load] and is then compared with other load

>

> I'm not sure if talk about the same code, I'm referring to:

>

> max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);

>

> a bit further down. Here the first option has unit [load/capacity], and

> the subsequent code multiplies the result with [capacity] to get back to

> [load]:


My understand is that it multiplies and divides by capacity
so we select the minimum between
 max_pull * busiest->group_capacity/SCHED_CAPACITY_SCALE
and
(sds->avg_load - local->avg_load) * local->group_capacity  /
SCHED_CAPACITY_SCALE

so the unit of imbalance is the same as max_pull because we multiply
and divide by [capacity]
the imbalance's unit is [load] so max_pull also has a unit [load]
and max_pull has the same unit of load_above_capacity

>

> /* How much load to actually move to equalise the imbalance */

> env->imbalance = min(

>         max_pull * busiest->group_capacity,

>         (sds->avg_load - local->avg_load) * local->group_capacity

> ) / SCHED_CAPACITY_SCALE;

>

> That lead me to the conclusion that load_above_capacity would have to be

> 'per capacity', i.e. [load/capacity], as well for the code to make

> sense.

>

>>

>> >>

>> >> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

>> >> [ Changed changelog to note it was in capacity unit; +rebase. ]

>> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

>> >> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>

>> >> Cc: Linus Torvalds <torvalds@linux-foundation.org>

>> >> Cc: Mike Galbraith <efault@gmx.de>

>> >> Cc: Peter Zijlstra <peterz@infradead.org>

>> >> Cc: Thomas Gleixner <tglx@linutronix.de>

>> >> Cc: linux-kernel@vger.kernel.org

>> >> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com

>> >> Signed-off-by: Ingo Molnar <mingo@kernel.org>

>> >> ---

>> >>  kernel/sched/fair.c | 6 ++++--

>> >>  1 file changed, 4 insertions(+), 2 deletions(-)

>> >>

>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

>> >> index 2338105..218f8e8 100644

>> >> --- a/kernel/sched/fair.c

>> >> +++ b/kernel/sched/fair.c

>> >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s

>> >>       if (busiest->group_type == group_overloaded &&

>> >>           local->group_type   == group_overloaded) {

>> >>               load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;

>> >> -             if (load_above_capacity > busiest->group_capacity)

>> >> +             if (load_above_capacity > busiest->group_capacity) {

>> >>                       load_above_capacity -= busiest->group_capacity;

>> >> -             else

>> >> +                     load_above_capacity *= NICE_0_LOAD;

>> >> +                     load_above_capacity /= busiest->group_capacity;

>>

>> If I follow correctly the sequence,

>> load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE

>> - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity

>> so

>> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE /

>> busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD

>>

>> so the unit is [load]

>

> I'm with you for the equation, but not for the unit and I find it hard

> convince myself what the unit really is :-(


the sole NICE_0_LOAD in the equation above tends to confirms that it's
only [load]

>

> I think it is down to the fact that we are comparing [load] directly

> with [capacity] here, basically saying [load] == [capacity]. Most other

> places we scale [load] by [capacity], we don't compare the two or

> otherwise imply that they are the same unit. Do we have any other

> examples of this?


IIUC the goal of this part of calculate_imbalance, we make the
assumption that one task with NICE_0_LOAD should fit in one core with
SCHED_CAPACITY_SCALE capacity and we want to ensure that we will not
make a CPU idle

>

> The name load_above_capacity suggests that it should have unit [load],

> but we actually compute it by subtracting to values, where the latter

> clearly has unit [capacity]. PeterZ's recent patch (1be0eb2a97

> sched/fair: Clean up scale confusion) changes the former to be

> [capacity].


I have made a comment on this.
The original equation was
load_above_capacity -= busiest->group_capacity
which come from the optimization of
load_above_capacity -= busiest->group_capacity * SCHED_LOAD_SCALE /
SCHED_CAPACITY_SCALE

The assumption of the optimization was the SCHED_LOAD_SCALE ==
SCHED_CAPACITY_SCALE and SCHED_LOAD_SCALE == NICE_0_LOAD but  it's not
always true if  we enable the extra precision for 64bits system

>

> load_above_capacity is later multiplied by [capacity] to determine the

> imbalance which must have unit [load]. So working our way backwards,


it's multiplied by xxx->group_capacity and divided by  / SCHED_CAPACITY_SCALE;

> load_above_capacity must have unit [load/capacity]. However, if [load] =

> [capacity] then what we really have is just group-normalized [load].

>

> As said in my other reply, the code only really makes sense for under

> special circumstances where [load] == [capacity]. The easy, and possibly

> best, way out of this mess would be to replace the code with something

> like PeterZ's suggestion that I tried to implement in the patch included

> in my other reply.


I'm fine with replacing this part by something more simple

>

>> Lets take the example of a group made of 2 cores with 2 threads per

>> core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the

>> capacity of the group is 3*SCHED_CAPACITY_SCALE.

>> If we have 6 tasks on this group :

>> load_above capacity = 1 *NICE_0_LOAD which means we want to remove no

>> more than 1 tasks from the group and let 5 tasks in the group whereas

>> we don't expect to have more than 4 tasks as we have 4 CPUs and

>> probably less because the compute capacity of each CPU is less than

>> the default one

>>

>> So i would have expected

>> load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD -

>> busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE

>> load_above capacity = 3*NICE_0_LOAD which means we want to remove no

>> more than 3 tasks and let 3 tasks in the group

>

> And this is exactly you get with this patch :-) load_above_capacity

> (through max_pull) is multiplied by the group capacity to compute that

> actual amount of [load] to remove:


I forgot that additional weight of the load by group's
capacity/SCHED_CAPACITY_SCALE

>

> env->imbalance  = load_above_capacity * busiest->group_capacity /

>                         SCHED_CAPACITY_SCALE

>

>                 = 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE /

>                         SCHED_CAPACITY_SCALE

>

>                 = 3*NICE_0_LOAD

>

> I don't think we disagree on how it should work :-) Without the capacity

> scaling in this patch you get:

>

> env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) *

>                         3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE

>

>                 = 9*SCHED_CAPACITY_SCALE

>

> Coming back to Yuyang's question. I think it should be NICE_0_LOAD to

> ensure that the resulting imbalance has the proper unit [load].


yes, i agree it's should NICE_0_LOAD

>

> I'm happy to scrap this patch and replace the whole thing with something

> else that makes more sense, but IMHO it is needed if the current mess

> should make any sense at all.
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc19fee94f39..c066574cff04 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7016,9 +7016,11 @@  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	    local->group_type   == group_overloaded) {
 		load_above_capacity = busiest->sum_nr_running *
 					SCHED_LOAD_SCALE;
-		if (load_above_capacity > busiest->group_capacity)
+		if (load_above_capacity > busiest->group_capacity) {
 			load_above_capacity -= busiest->group_capacity;
-		else
+			load_above_capacity *= SCHED_LOAD_SCALE;
+			load_above_capacity /= busiest->group_capacity;
+		} else
 			load_above_capacity = ~0UL;
 	}