diff mbox

[v3,03/12] sched/fair: Change the variable to hold the number of periods to 32bit

Message ID 20160505111308.GA22310@e105550-lin.cambridge.arm.com
State New
Headers show

Commit Message

Morten Rasmussen May 5, 2016, 11:13 a.m. UTC
On Wed, May 04, 2016 at 04:02:44AM +0800, Yuyang Du wrote:
> In sched average update, a period is about 1ms, so a 32-bit unsigned

> integer can approximately hold a maximum of 49 (=2^32/1000/3600/24)

> days.

> 

> For usual cases, 32bit is big enough and 64bit is needless. But if

> a task sleeps longer than it, there can be two outcomes:

> 

> Consider a task sleeps whatever m milliseconds, then n = (u32)m.


Maybe it crystal clear that you mean: For any m > UINT_MAX?

> 

> 1. If n >= 32*64, then the task's sched avgs will be surely decayed

>    to 0. In this case, it really doesn't matter that the 32bit is not

>    big enough to hold m. In other words, a task sleeps 2 secs or sleeps

>    50 days are the same from sched average point of view.

> 

> 2. If m < 32*64, the chance to be here is very very low, but if so,


Should that be: n < 32*64 ?

Talking about 32*64, I don't get why we don't use LOAD_AVG_MAX_N. I had
a patch ready to post for that:

From 5055e5f82c8d207880035c2ec4ecf1ac1e7f9e91 Mon Sep 17 00:00:00 2001
From: Morten Rasmussen <morten.rasmussen@arm.com>

Date: Mon, 11 Apr 2016 15:41:37 +0100
Subject: [PATCH] sched/fair: Fix decay to zero period in decay_load()

In __compute_runnable_contrib() we are happy with returning LOAD_AVG_MAX
when the update period n >= LOAD_AVG_MAX_N (=345), so we should be happy
with returning zero for n >= LOAD_AVG_MAX_N when decaying in
decay_load() as well instead of only returning zero for n >
LOAD_AVG_PERIOD * 63 (=2016).

As both conditions are unlikely() the impact is quite likely very small,
but at least it makes the rounding errors for load accumulation and
decay symmetrical.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
1.9.1


>    the task's sched avgs MAY NOT be decayed to 0, depending on how

>    big its sums are, and the chance to 0 is still good as load_sum

>    is way less than ~0ULL and util_sum way less than ~0U.


I don't get the last bit about load_sum < ~0ULL and util_sum < ~0U.
Whether you get to zero depends on the sums (as you say) and the actual
value of 'n'. It is true that you might get to zero even if n <
LOAD_AVG_MAX_N if the sums are small.

> Nevertheless, what really maters is what happens in the worst-case

> scenario, which is when (u32)m = 0? So in that case, it would be like

> after so long a sleep, we treat the task as it never slept, and has the

> same sched averages as before it slept. That is actually not bad or

> nothing to worry about, and think of it as the same as how we treat

> a new born task.


There is subtle but important difference between not decaying a task
correctly and adding new task: The sleeping task is already accounted
for in the cfs_rq.avg.{load,util}_sum. The sleeping task's contribution
to cfs_rq.avg has been decayed correctly in the mean time which means
that by not guaranteeing a decay of the se.avg at wake-up you introduce
a discrepancy between the task's owen view of its contribution (se.avg)
and the cfs_rq view (cfs_rq.avg). That might lead to trouble if the task
is migrated at wake-up as we remove se.avg amount of contribution from
the previous cpu's cfs_rq.avg. In other words, you remove too much from
the cfs_rq.avg.

The discrepancy will decay and disappear after LOAD_AVG_MAX_N ms, which
might be acceptable, but it is not a totally harmless change IMHO.

Comments

Vincent Guittot May 5, 2016, 11:23 a.m. UTC | #1
Hi Morten,

On 5 May 2016 at 13:13, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, May 04, 2016 at 04:02:44AM +0800, Yuyang Du wrote:

>> In sched average update, a period is about 1ms, so a 32-bit unsigned

>> integer can approximately hold a maximum of 49 (=2^32/1000/3600/24)

>> days.

>>

>> For usual cases, 32bit is big enough and 64bit is needless. But if

>> a task sleeps longer than it, there can be two outcomes:

>>

>> Consider a task sleeps whatever m milliseconds, then n = (u32)m.

>

> Maybe it crystal clear that you mean: For any m > UINT_MAX?

>

>>

>> 1. If n >= 32*64, then the task's sched avgs will be surely decayed

>>    to 0. In this case, it really doesn't matter that the 32bit is not

>>    big enough to hold m. In other words, a task sleeps 2 secs or sleeps

>>    50 days are the same from sched average point of view.

>>

>> 2. If m < 32*64, the chance to be here is very very low, but if so,

>

> Should that be: n < 32*64 ?

>

> Talking about 32*64, I don't get why we don't use LOAD_AVG_MAX_N. I had

> a patch ready to post for that:

>

> From 5055e5f82c8d207880035c2ec4ecf1ac1e7f9e91 Mon Sep 17 00:00:00 2001

> From: Morten Rasmussen <morten.rasmussen@arm.com>

> Date: Mon, 11 Apr 2016 15:41:37 +0100

> Subject: [PATCH] sched/fair: Fix decay to zero period in decay_load()

>

> In __compute_runnable_contrib() we are happy with returning LOAD_AVG_MAX

> when the update period n >= LOAD_AVG_MAX_N (=345), so we should be happy

> with returning zero for n >= LOAD_AVG_MAX_N when decaying in

> decay_load() as well instead of only returning zero for n >

> LOAD_AVG_PERIOD * 63 (=2016).

>

> As both conditions are unlikely() the impact is quite likely very small,

> but at least it makes the rounding errors for load accumulation and

> decay symmetrical.

>

> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

> ---

>  kernel/sched/fair.c | 2 +-

>  1 file changed, 1 insertion(+), 1 deletion(-)

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 56b7d4b..42515b6 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -2527,7 +2527,7 @@ static __always_inline u64 decay_load(u64 val, u64 n)

>

>         if (!n)

>                 return val;

> -       else if (unlikely(n > LOAD_AVG_PERIOD * 63))

> +       else if (unlikely(n > LOAD_AVG_MAX_N))


I had the same question about this 63 and IIUC, the 63 comes from 64bits-1.
The load can be that high that the 64bits are used. We shift right
every LOAD_AVG_PERIOD so we will be sure that the value will be 0
after shifting all 64 bits

>                 return 0;

>

>         /* after bounds checking we can collapse to 32-bit */

> --

> 1.9.1

>

>

>>    the task's sched avgs MAY NOT be decayed to 0, depending on how

>>    big its sums are, and the chance to 0 is still good as load_sum

>>    is way less than ~0ULL and util_sum way less than ~0U.

>

> I don't get the last bit about load_sum < ~0ULL and util_sum < ~0U.

> Whether you get to zero depends on the sums (as you say) and the actual

> value of 'n'. It is true that you might get to zero even if n <

> LOAD_AVG_MAX_N if the sums are small.

>

>> Nevertheless, what really maters is what happens in the worst-case

>> scenario, which is when (u32)m = 0? So in that case, it would be like

>> after so long a sleep, we treat the task as it never slept, and has the

>> same sched averages as before it slept. That is actually not bad or

>> nothing to worry about, and think of it as the same as how we treat

>> a new born task.

>

> There is subtle but important difference between not decaying a task

> correctly and adding new task: The sleeping task is already accounted

> for in the cfs_rq.avg.{load,util}_sum. The sleeping task's contribution

> to cfs_rq.avg has been decayed correctly in the mean time which means

> that by not guaranteeing a decay of the se.avg at wake-up you introduce

> a discrepancy between the task's owen view of its contribution (se.avg)

> and the cfs_rq view (cfs_rq.avg). That might lead to trouble if the task

> is migrated at wake-up as we remove se.avg amount of contribution from

> the previous cpu's cfs_rq.avg. In other words, you remove too much from

> the cfs_rq.avg.

>

> The discrepancy will decay and disappear after LOAD_AVG_MAX_N ms, which

> might be acceptable, but it is not a totally harmless change IMHO.
Morten Rasmussen May 10, 2016, 9:10 a.m. UTC | #2
On Fri, May 06, 2016 at 02:19:32AM +0800, Yuyang Du wrote:
> Hi Morten,

> 

> On Thu, May 05, 2016 at 12:13:10PM +0100, Morten Rasmussen wrote:

> > On Wed, May 04, 2016 at 04:02:44AM +0800, Yuyang Du wrote:

> > > In sched average update, a period is about 1ms, so a 32-bit unsigned

> > > integer can approximately hold a maximum of 49 (=2^32/1000/3600/24)

> > > days.

> > > 

> > > For usual cases, 32bit is big enough and 64bit is needless. But if

> > > a task sleeps longer than it, there can be two outcomes:

> > > 

> > > Consider a task sleeps whatever m milliseconds, then n = (u32)m.

> > 

> > Maybe it crystal clear that you mean: For any m > UINT_MAX?

> 

> En, literally whatever.


Your second case below is not correct for m < UINT_MAX. You are only
talking about cases where n != m which is m > UINT_MAX.

> 

> > > 1. If n >= 32*64, then the task's sched avgs will be surely decayed

> > >    to 0. In this case, it really doesn't matter that the 32bit is not

> > >    big enough to hold m. In other words, a task sleeps 2 secs or sleeps

> > >    50 days are the same from sched average point of view.

> > > 

> > > 2. If m < 32*64, the chance to be here is very very low, but if so,

> > 

> > Should that be: n < 32*64 ?

> > 

> > Talking about 32*64, I don't get why we don't use LOAD_AVG_MAX_N. I had

> > a patch ready to post for that:

> >

> > >From 5055e5f82c8d207880035c2ec4ecf1ac1e7f9e91 Mon Sep 17 00:00:00 2001

> > From: Morten Rasmussen <morten.rasmussen@arm.com>

> > Date: Mon, 11 Apr 2016 15:41:37 +0100

> > Subject: [PATCH] sched/fair: Fix decay to zero period in decay_load()

> > 

> > In __compute_runnable_contrib() we are happy with returning LOAD_AVG_MAX

> > when the update period n >= LOAD_AVG_MAX_N (=345), so we should be happy

> > with returning zero for n >= LOAD_AVG_MAX_N when decaying in

> > decay_load() as well instead of only returning zero for n >

> > LOAD_AVG_PERIOD * 63 (=2016).

> 

> So basically, you want to add another rule in addition to the exponential

> decay rule.


No, not at all. I want to make the 'rules' symmetrical for accumulation
and decay exactly like the patch does.

>  

> > >    the task's sched avgs MAY NOT be decayed to 0, depending on how

> > >    big its sums are, and the chance to 0 is still good as load_sum

> > >    is way less than ~0ULL and util_sum way less than ~0U.

> > 

> > I don't get the last bit about load_sum < ~0ULL and util_sum < ~0U.

> > Whether you get to zero depends on the sums (as you say) and the actual

> > value of 'n'. It is true that you might get to zero even if n <

> > LOAD_AVG_MAX_N if the sums are small.

>  

> Frankly, util Ben brought it up, I didn't think a task sleeping so long

> is even possible. But I do admit it may happen.

> 

> However, I will say this. A task sleeping so long is already very rare,

> and among all those long sleep cases, the chance that after waking up the

> avgs will not be decayed to zero is much less than 0.5 in a million

> (=32*64/2^32=1/2^21), assuming the sleeping time is uniformly distributed.


You can't just ignore cases because they have a low probability. Going
by that logic we could remove a lot of synchronization overhead in the
kernel.

My concern is whether we introduce any assumptions that might hit us
later when everybody has forgotten about them. This one would be
extremely hard to debug later.

> > > Nevertheless, what really maters is what happens in the worst-case

> > > scenario, which is when (u32)m = 0? So in that case, it would be like

> > > after so long a sleep, we treat the task as it never slept, and has the

> > > same sched averages as before it slept. That is actually not bad or

> > > nothing to worry about, and think of it as the same as how we treat

> > > a new born task.

> > 

> > There is subtle but important difference between not decaying a task

> > correctly and adding new task: The sleeping task is already accounted

> > for in the cfs_rq.avg.{load,util}_sum. The sleeping task's contribution

> > to cfs_rq.avg has been decayed correctly in the mean time which means

> > that by not guaranteeing a decay of the se.avg at wake-up you introduce

> > a discrepancy between the task's owen view of its contribution (se.avg)

> > and the cfs_rq view (cfs_rq.avg). That might lead to trouble if the task

> > is migrated at wake-up as we remove se.avg amount of contribution from

> > the previous cpu's cfs_rq.avg. In other words, you remove too much from

> > the cfs_rq.avg.

> > 

> > The discrepancy will decay and disappear after LOAD_AVG_MAX_N ms, which

> > might be acceptable, but it is not a totally harmless change IMHO.

> 

> That is just an explanation, :) nevermind, or I really don't think that is

> a deal.


I don't really follow. I analysed the implications of the overflow that
you are introducing. In my opinion this is what you should have done
before proposing this patch. I think it is essential to understand what
assumptions we make and introducing new ones should be carefully
considered. I think it is a big 'deal' if you are not more careful when you
are submitting patches and just ignore feedback. We spent a lot of time
reviewing them.

Morten
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56b7d4b..42515b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2527,7 +2527,7 @@  static __always_inline u64 decay_load(u64 val, u64 n)
 
 	if (!n)
 		return val;
-	else if (unlikely(n > LOAD_AVG_PERIOD * 63))
+	else if (unlikely(n > LOAD_AVG_MAX_N))
 		return 0;
 
 	/* after bounds checking we can collapse to 32-bit */