diff mbox

[RFC,0/4] sched: Improve cpu load accounting with nohz

Message ID 569810C4.7090900@arm.com
State New
Headers show

Commit Message

Dietmar Eggemann Jan. 14, 2016, 9:19 p.m. UTC
Hi Frederic,

On 13/01/16 16:01, Frederic Weisbecker wrote:
> Hi,

>

> Most of the remaining nohz full work is about making the scheduler

> resilient to full dynticks (such that we can remove the 1Hz one day).

> Byungchul Park started interesting work toward that with cpu load

> updates in nohz full. So here is one more step forward.

>

> Patches 1 and 2 concern both dyntick-idle and dyntick-full. The rest

> is rather about full dynticks. Note that this isn't complete support for

> cpu load in nohz full, we still need to think about a way to make

> target_load() able to return up to date values on full dynticks targets.

>

> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git

>       sched/core

>

> HEAD: f1d7d0f5382be3819490a859313f692f142dfb74

>

> Thanks,

>       Frederic

> ---

>

> Frederic Weisbecker (4):

>       sched: Don't account tickless CPU load on tick

>       sched: Consolidate nohz CPU load update code

>       sched: Move cpu load stats functions above fair queue callbacks

>       sched: Upload nohz full CPU load on task enqueue/dequeue

>

>

>  kernel/sched/fair.c | 306 ++++++++++++++++++++++++++++++----------------------

>  1 file changed, 175 insertions(+), 131 deletions(-)

>


I noticed during the test of these patches on a NO_HZ_FULL cpu, that the
rq->cpu_load[] values can be abnormally high. This happens w/ and w/o your
patches but it happens way more w/ the patches applied.

It might be related to commit 59543275488d "sched/fair: Prepare
__update_cpu_load() to handle active tickless", at least the following
patch cures it for me.

-- Dietmar

-- >8 --

Subject: [PATCH] sched/fair: Avoid unsigned underflow in __update_cpu_load()

tickless_load, which is rq->cpu_load[0] in the active case, can be
greater than rq->cpu_load[i] (0 < i < CPU_LOAD_IDX_MAX) which then
leads to an unsigned underflow when calculating old_load.

In the NO_HZ_FULL case (tick_nohz_full_update_tick() ->
tick_nohz_restart_sched_tick() -> update_cpu_load_nohz() ->
__update_cpu_load()) with pending_updates > 1, rq->cpu_load[i]
(0 < i < CPU_LOAD_IDX_MAX) can end up with abnormally high values:

Fix this by only assigning the difference out of rq->cpu_load[i]
and tickless_load to old_load if the former is greater, otherwise
set it to 0.
Also bail out of decay_load_missed() if it is called with load = 0.

E.g. running a pinned 50% task (w/ heavy tracing) on a cpu in
NO_HZ_FULL mode showed these max values for rq->cpu_load  w/o
this patch:

max(rq->cpu_load[0]): 738
max(rq->cpu_load[1]): 626
max(rq->cpu_load[2]): 10133099161584019
max(rq->cpu_load[3]): 42361983994954023
max(rq->cpu_load[4]): 80220368362537147

W/ this patch, the values are back to normal:

max(rq->cpu_load[0]): 769
max(rq->cpu_load[1]): 618
max(rq->cpu_load[2]): 607
max(rq->cpu_load[3]): 602
max(rq->cpu_load[4]): 599

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

---
 kernel/sched/fair.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

--
1.9.1
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Comments

Dietmar Eggemann Jan. 14, 2016, 10:23 p.m. UTC | #1
On 01/14/2016 09:27 PM, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 09:19:00PM +0000, Dietmar Eggemann wrote:

>> @@ -4346,7 +4346,10 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,

>>

>>                  /* scale is effectively 1 << i now, and >> i divides by scale */

>>

>> -               old_load = this_rq->cpu_load[i] - tickless_load;

>> +               if (this_rq->cpu_load[i] > tickless_load)

>> +                       old_load = this_rq->cpu_load[i] - tickless_load;

>> +               else

>> +                       old_load = 0;

>

> Yeah, yuck. That'd go bad quick.

>


... because I set it to 0? But after the decay function we add
tickless_load to old_load. Maybe in case tickless_load >
this_rq->cpu_load[i] we decay this_rq->cpu_load[i] and do not add
tickless_load afterwards.


IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Dietmar Eggemann Jan. 15, 2016, 4:56 p.m. UTC | #2
Hi Byungchul,

On 15/01/16 07:07, Byungchul Park wrote:
> On Thu, Jan 14, 2016 at 10:23:46PM +0000, Dietmar Eggemann wrote:

>> On 01/14/2016 09:27 PM, Peter Zijlstra wrote:

>>> On Thu, Jan 14, 2016 at 09:19:00PM +0000, Dietmar Eggemann wrote:

>>>> @@ -4346,7 +4346,10 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,


[...]

> 

> I re-checked the equation I expanded and fortunately found it had no

> problem. I think there are several ways to do it correctly. That is,

> 

> option 1. decay the absolute value with decay_load_missed() and adjust

> the sign.

> 

> option 2. make decay_load_missed() can handle negative value.

> 

> option 3. refer to the patch below. I think this option is the best.

> 

> -----8<-----

> 

> From ba3d3355fcce51c901376d268206f58a7d0e4214 Mon Sep 17 00:00:00 2001

> From: Byungchul Park <byungchul.park@lge.com>

> Date: Fri, 15 Jan 2016 15:58:09 +0900

> Subject: [PATCH] sched/fair: prevent using decay_load_missed() with a negative

>  value

> 

> decay_load_missed() cannot handle nagative value. So we need to prevent

> using the function with a negative value.

> 

> Signed-off-by: Byungchul Park <byungchul.park@lge.com>

> ---

>  kernel/sched/fair.c | 8 +++++++-

>  1 file changed, 7 insertions(+), 1 deletion(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 8dde8b6..3f08d75 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -4443,8 +4443,14 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,

>  

>  		/* scale is effectively 1 << i now, and >> i divides by scale */

>  

> -		old_load = this_rq->cpu_load[i] - tickless_load;

> +		old_load = this_rq->cpu_load[i];

>  		old_load = decay_load_missed(old_load, pending_updates - 1, i);

> +		old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);

> +		/*

> +		 * old_load can never be a negative value because a decayed

> +		 * tickless_load cannot be greater than the original

> +		 * tickless_load.

> +		 */

>  		old_load += tickless_load;

>  		new_load = this_load;

>  		/*

>


So in this case you want to decay old_load and add (tickless_load -
decay(tickless_load)) on top. IMHO, this makes sense.


(w/ your patch and cpu w/ NO_HZ_FULL)

update_cpu_load_nohz: cpu=3 jiffies=4294935491
this_rq->last_load_update_tick=4294935489 pending_updates=2 active=1
load=0x114
__update_cpu_load: cpu=3 this_load=0x114 pending_updates=2
tickless_load=0xc2
__update_cpu_load: cpu=3 this_rq->cpu_load[0 .. 4] = [0xc2 0x62 0x32
0x1d 0x14]
__update_cpu_load: cpu=3 1. old_load=0x62
__update_cpu_load: cpu=3 2.1 old_load=0x31
__update_cpu_load: cpu=3 2.2 old_load=0xffffffffffffffd0   <-- after
decaying tickless_load
__update_cpu_load: cpu=3 3. old_load=0x92                  <-- after
adding tickless_load
__update_cpu_load: cpu=3 1. new_load=0x92
__update_cpu_load: cpu=3 2. new_load=0x92
__update_cpu_load: cpu=3 this_rq->cpu_load[1]=0xd3
...
update_cpu_load_active: cpu=3 this_rq->last_load_update_tick=4294935491
pending_updates=1 active=1 load=0x13e
__update_cpu_load: cpu=3 this_load=0x13e pending_updates=1
tickless_load=0x114
__update_cpu_load: cpu=3 this_rq->cpu_load[0 .. 4] = [0x114 0xd3 0x86
0x4f 0x2f]
...

Another point ... 'active=1' (function header: @active: !0 for NOHZ_FULL
is a little bit misleading) is also true for when __update_cpu_load() is
called from update_cpu_load_active(). In this case tickless_load
wouldn't have to be set at all since pending_updates is 1,
decay_load_missed() can handle that by bailing in case missed_updates = 0.

Couldn't we set tickless_load only in case:

unsigned long tickless_load = (active && pending_updates > 1) ?
this_rq->cpu_load[0] : 0;

Even though update_cpu_load_nohz() can call with pending_updates=1 and
active=1 but then we don't have to decay.
Dietmar Eggemann Jan. 20, 2016, 1:04 p.m. UTC | #3
On 20/01/16 00:48, Byungchul Park wrote:
> On Tue, Jan 19, 2016 at 02:04:57PM +0100, Peter Zijlstra wrote:

>> On Fri, Jan 15, 2016 at 04:56:36PM +0000, Dietmar Eggemann wrote:

>>> Couldn't we set tickless_load only in case:

>>>

>>> unsigned long tickless_load = (active && pending_updates > 1) ?

>>> this_rq->cpu_load[0] : 0;

>>>

>>> Even though update_cpu_load_nohz() can call with pending_updates=1 and

>>> active=1 but then we don't have to decay.

>>

>> decay_load_missed() has an early bail for !missed, which will be tickled

>> with pending_updates == 1.

> 

> I think the way for decay_load_missed() to get an early bail for

> *!load*, which the Dietmar's proposal did, is also good. And the

> peterz's proposal avoiding an unnecessary "add" operation is also

> good. Whatever..

> 

>>

>> What I was thinking of doing however is:

>>

>> --- a/kernel/sched/fair.c

>> +++ b/kernel/sched/fair.c

>> @@ -4445,13 +4445,15 @@ static void __update_cpu_load(struct rq

>>  

>>  		old_load = this_rq->cpu_load[i];

>>  		old_load = decay_load_missed(old_load, pending_updates - 1, i);

>> -		old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);

>> -		/*

>> -		 * old_load can never be a negative value because a decayed

>> -		 * tickless_load cannot be greater than the original

>> -		 * tickless_load.

>> -		 */

>> -		old_load += tickless_load;

>> +		if (tickless_load) {

> 

> And additionally, in this approach, why don't you do like,

> 

>                 if (tickless_load || pending_updates - 1)

> 

>> +			old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);

>> +			/*

>> +			 * old_load can never be a negative value because a

>> +			 * decayed tickless_load cannot be greater than the

>> +			 * original tickless_load.

>> +			 */

>> +			old_load += tickless_load;

>> +		}

>>  		new_load = this_load;

>>  		/*

>>  		 * Round up the averaging division if load is increasing. This

>>

>>

>> Since regardless of the pending_updates, none of that makes sense if

>> !tickless_load.

> 

> None of that makes sense if !(pending_updates - 1), too. In that case,

> it becomes,

> 

>                         old_load -= tickless_load;

>                         old_load += tickless_load;

> 


tickless_load is potentially set to != 0 (rq->cpu_load[0]) in case
__update_cpu_load() is called by:

 tick_nohz_full_update_tick()->tick_nohz_restart_sched_tick()->
 update_cpu_load_nohz() [CONFIG_NO_HZ_FULL=y]

 scheduler_tick()->update_cpu_load_active()

The former can call with pending_updates=1 and the latter always calls
w/ pending_updates=1 which makes it impossible to set tickless_load
correctly in __update_cpu_load(). I guess an enum indicating the caller
(update_cpu_load_active() update_cpu_load_nohz() or
update_idle_cpu_load) would help if we want to do this super correctly.

I don't have a strong preference whether we do 'if (tickless_load)' or
let decay_load_missed() bail when load = 0 (latter is not really
necessary because decay_load_missed() can handle load=0 already.
But if we do 'if (tickless_load)' why don't we also do 'if(old_load)'?
(old_load can be 0 as well)

Anyway, the patch fixes the unsigned underflow issue I saw for the
cpu_load[] values in NO_HZ_FULL mode.

Maybe you could fix the __update_cpu_load header as well:

'@active: !0 for NOHZ_FULL' (not really true, also when called by
update_cpu_load_active())
s/decay_load_misses()/decay_load_missed()
s/@active paramter/@active parameter

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9deda2ac319f..4bf8f7c2c8b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4276,7 +4276,7 @@  decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
 {
        int j = 0;

-       if (!missed_updates)
+       if (!load || !missed_updates)
                return load;

        if (missed_updates >= degrade_zero_ticks[idx])
@@ -4346,7 +4346,10 @@  static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,

                /* scale is effectively 1 << i now, and >> i divides by scale */

-               old_load = this_rq->cpu_load[i] - tickless_load;
+               if (this_rq->cpu_load[i] > tickless_load)
+                       old_load = this_rq->cpu_load[i] - tickless_load;
+               else
+                       old_load = 0;
                old_load = decay_load_missed(old_load, pending_updates - 1, i);
                old_load += tickless_load;
                new_load = this_load;