diff mbox

[v3,10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly

Message ID 20160818102438.GA27873@e105550-lin.cambridge.arm.com
State New
Headers show

Commit Message

Morten Rasmussen Aug. 18, 2016, 10:24 a.m. UTC
On Thu, Aug 18, 2016 at 09:40:55AM +0100, Morten Rasmussen wrote:
> On Mon, Aug 15, 2016 at 04:42:37PM +0100, Morten Rasmussen wrote:

> > On Mon, Aug 15, 2016 at 04:23:42PM +0200, Peter Zijlstra wrote:

> > > But unlike that function, it doesn't actually use __update_load_avg().

> > > Why not?

> > 

> > Fair question :)

> > 

> > We currently exploit the fact that the task utilization is _not_ updated

> > in wake-up balancing to make sure we don't under-estimate the capacity

> > requirements for tasks that have slept for a while. If we update it, we

> > loose the non-decayed 'peak' utilization, but I guess we could just

> > store it somewhere when we do the wake-up decay.

> > 

> > I thought there was a better reason when I wrote the patch, but I don't

> > recall right now. I will look into it again and see if we can use

> > __update_load_avg() to do a proper update instead of doing things twice.

> 

> AFAICT, we should be able to synchronize the task utilization to the

> previous rq utilization using __update_load_avg() as you suggest. The

> patch below is should work as a replacement without any changes to

> subsequent patches. It doesn't solve the under-estimation issue, but I

> have another patch for that.


And here is a possible solution to the under-estimation issue. The patch
would have to go at the end of this set.

---8<---

From 5bc918995c6c589b833ba1f189a8b92fa22202ae Mon Sep 17 00:00:00 2001
From: Morten Rasmussen <morten.rasmussen@arm.com>

Date: Wed, 17 Aug 2016 15:30:43 +0100
Subject: [PATCH] sched/fair: Track peak per-entity utilization

When using PELT (per-entity load tracking) utilization to place tasks at
wake-up using the decayed utilization (due to sleep) leads to
under-estimation of true utilization of the task. This could mean
putting the task on a cpu with less available capacity than is actually
needed. This issue can be mitigated by using 'peak' utilization instead
of the decayed utilization for placement decisions, e.g. at task
wake-up.

The 'peak' utilization metric, util_peak, tracks util_avg when the task
is running and retains its previous value while the task is
blocked/waiting on the rq. It is instantly updated to track util_avg
again as soon as the task running again.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

---
 include/linux/sched.h |  2 +-
 kernel/sched/fair.c   | 18 ++++++++++++++----
 2 files changed, 15 insertions(+), 5 deletions(-)

-- 
1.9.1

Comments

Morten Rasmussen Aug. 18, 2016, 1:45 p.m. UTC | #1
On Thu, Aug 18, 2016 at 07:46:44PM +0800, Wanpeng Li wrote:
> 2016-08-18 18:24 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:

> > On Thu, Aug 18, 2016 at 09:40:55AM +0100, Morten Rasmussen wrote:

> >> On Mon, Aug 15, 2016 at 04:42:37PM +0100, Morten Rasmussen wrote:

> >> > On Mon, Aug 15, 2016 at 04:23:42PM +0200, Peter Zijlstra wrote:

> >> > > But unlike that function, it doesn't actually use __update_load_avg().

> >> > > Why not?

> >> >

> >> > Fair question :)

> >> >

> >> > We currently exploit the fact that the task utilization is _not_ updated

> >> > in wake-up balancing to make sure we don't under-estimate the capacity

> >> > requirements for tasks that have slept for a while. If we update it, we

> >> > loose the non-decayed 'peak' utilization, but I guess we could just

> >> > store it somewhere when we do the wake-up decay.

> >> >

> >> > I thought there was a better reason when I wrote the patch, but I don't

> >> > recall right now. I will look into it again and see if we can use

> >> > __update_load_avg() to do a proper update instead of doing things twice.

> >>

> >> AFAICT, we should be able to synchronize the task utilization to the

> >> previous rq utilization using __update_load_avg() as you suggest. The

> >> patch below is should work as a replacement without any changes to

> >> subsequent patches. It doesn't solve the under-estimation issue, but I

> >> have another patch for that.

> >

> > And here is a possible solution to the under-estimation issue. The patch

> > would have to go at the end of this set.

> >

> > ---8<---

> >

> > From 5bc918995c6c589b833ba1f189a8b92fa22202ae Mon Sep 17 00:00:00 2001

> > From: Morten Rasmussen <morten.rasmussen@arm.com>

> > Date: Wed, 17 Aug 2016 15:30:43 +0100

> > Subject: [PATCH] sched/fair: Track peak per-entity utilization

> >

> > When using PELT (per-entity load tracking) utilization to place tasks at

> > wake-up using the decayed utilization (due to sleep) leads to

> > under-estimation of true utilization of the task. This could mean

> > putting the task on a cpu with less available capacity than is actually

> > needed. This issue can be mitigated by using 'peak' utilization instead

> > of the decayed utilization for placement decisions, e.g. at task

> > wake-up.

> >

> > The 'peak' utilization metric, util_peak, tracks util_avg when the task

> > is running and retains its previous value while the task is

> > blocked/waiting on the rq. It is instantly updated to track util_avg

> > again as soon as the task running again.

> 

> Maybe this will lead to disable wake affine due to a spike peak value

> for a low average load task.


I assume you are referring to using task_util_peak() instead of
task_util() in wake_cap()?

The peak value should never exceed the util_avg accumulated by the task
last time it ran. So any spike has to be caused by the task accumulating
more utilization last time it ran. We don't know if it a spike or a more
permanent change in behaviour, so we have to guess. So a spike on an
asymmetric system could cause us to disable wake affine in some
circumstances (either prev_cpu or waker cpu has to be low compute
capacity) for the following wake-up.

SMP should be unaffected as we should bail out on the previous
condition.

The counter-example is task with a fairly long busy period and a much
longer period (cycle). Its util_avg might have decayed away since the
last activation so it appears very small at wake-up and we end up
putting it on a low capacity cpu every time even though it keeps the cpu
busy for a long time every time it wakes up.

Did that answer your question?

Thanks,
Morten

> 

> Regards,

> Wanpeng Li

> 

> >

> > cc: Ingo Molnar <mingo@redhat.com>

> > cc: Peter Zijlstra <peterz@infradead.org>

> >

> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

> > ---

> >  include/linux/sched.h |  2 +-

> >  kernel/sched/fair.c   | 18 ++++++++++++++----

> >  2 files changed, 15 insertions(+), 5 deletions(-)

> >

> > diff --git a/include/linux/sched.h b/include/linux/sched.h

> > index 4e0c47af9b05..40e427d1d378 100644

> > --- a/include/linux/sched.h

> > +++ b/include/linux/sched.h

> > @@ -1281,7 +1281,7 @@ struct load_weight {

> >  struct sched_avg {

> >         u64 last_update_time, load_sum;

> >         u32 util_sum, period_contrib;

> > -       unsigned long load_avg, util_avg;

> > +       unsigned long load_avg, util_avg, util_peak;

> >  };

> >

> >  #ifdef CONFIG_SCHEDSTATS

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index 11b250531ed4..8462a3d455ff 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -692,6 +692,7 @@ void init_entity_runnable_average(struct sched_entity *se)

> >          * At this point, util_avg won't be used in select_task_rq_fair anyway

> >          */

> >         sa->util_avg = 0;

> > +       sa->util_peak = 0;

> >         sa->util_sum = 0;

> >         /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */

> >  }

> > @@ -744,6 +745,7 @@ void post_init_entity_util_avg(struct sched_entity *se)

> >                 } else {

> >                         sa->util_avg = cap;

> >                 }

> > +               sa->util_peak = sa->util_avg;

> >                 sa->util_sum = sa->util_avg * LOAD_AVG_MAX;

> >         }

> >

> > @@ -2806,6 +2808,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

> >                 sa->util_avg = sa->util_sum / LOAD_AVG_MAX;

> >         }

> >

> > +       if (running || sa->util_avg > sa->util_peak)

> > +               sa->util_peak = sa->util_avg;

> > +

> >         return decayed;

> >  }

> >

> > @@ -5174,7 +5179,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,

> >         return 1;

> >  }

> >

> > -static inline int task_util(struct task_struct *p);

> > +static inline int task_util_peak(struct task_struct *p);

> >  static int cpu_util_wake(int cpu, struct task_struct *p);

> >

> >  static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)

> > @@ -5257,10 +5262,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

> >         } while (group = group->next, group != sd->groups);

> >

> >         /* Found a significant amount of spare capacity. */

> > -       if (this_spare > task_util(p) / 2 &&

> > +       if (this_spare > task_util_peak(p) / 2 &&

> >             imbalance*this_spare > 100*most_spare)

> >                 return NULL;

> > -       else if (most_spare > task_util(p) / 2)

> > +       else if (most_spare > task_util_peak(p) / 2)

> >                 return most_spare_sg;

> >

> >         if (!idlest || 100*this_load < imbalance*min_load)

> > @@ -5423,6 +5428,11 @@ static inline int task_util(struct task_struct *p)

> >         return p->se.avg.util_avg;

> >  }

> >

> > +static inline int task_util_peak(struct task_struct *p)

> > +{

> > +       return p->se.avg.util_peak;

> > +}

> > +

> >  /*

> >   * cpu_util_wake: Compute cpu utilization with any contributions from

> >   * the waking task p removed.

> > @@ -5455,7 +5465,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)

> >         /* Bring task utilization in sync with prev_cpu */

> >         sync_entity_load_avg(&p->se);

> >

> > -       return min_cap * 1024 < task_util(p) * capacity_margin;

> > +       return min_cap * 1024 < task_util_peak(p) * capacity_margin;

> >  }

> >

> >  /*

> > --

> > 1.9.1

> >

> 

> 

> 

> -- 

> Regards,

> Wanpeng Li
Morten Rasmussen Aug. 19, 2016, 2:03 p.m. UTC | #2
On Fri, Aug 19, 2016 at 09:43:00AM +0800, Wanpeng Li wrote:
> 2016-08-18 21:45 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:

> > I assume you are referring to using task_util_peak() instead of

> > task_util() in wake_cap()?

> 

> Yes.

> 

> >

> > The peak value should never exceed the util_avg accumulated by the task

> > last time it ran. So any spike has to be caused by the task accumulating

> > more utilization last time it ran. We don't know if it a spike or a more

> 

> I see.

> 

> > permanent change in behaviour, so we have to guess. So a spike on an

> > asymmetric system could cause us to disable wake affine in some

> > circumstances (either prev_cpu or waker cpu has to be low compute

> > capacity) for the following wake-up.

> >

> > SMP should be unaffected as we should bail out on the previous

> > condition.

> 

> Why capacity_orig instead of capacity since it is checked each time

> wakeup and maybe rt class/interrupt have already occupied many cpu

> utilization.


We could switch to capacity for this condition if we also change the
spare capacity evaluation in find_idlest_group() to do the same. It
would open up for SMP systems to take find_idlest_group() route if the
SD_BALANCE_WAKE flag is set.

The reason why I have avoided capacity and used capacity_orig instead
is that in previous discussions about scheduling behaviour under
rt/dl/irq pressure it has been clear to me whether we want to move tasks
away from cpus with capacity < capacity_orig or not. The choice depends
on the use-case.

In some cases taking rt/dl/irq pressure into account is more complicated
as we don't know the capacities available in a sched_group without
iterating over all the cpus. However, I don't think it would complicate
these patches. It is more a question whether everyone are happy with
additional conditions in their wake-up path. I guess we could make it a
sched_feature if people are interested?

In short, I used capacity_orig to play it safe ;-)

> > The counter-example is task with a fairly long busy period and a much

> > longer period (cycle). Its util_avg might have decayed away since the

> > last activation so it appears very small at wake-up and we end up

> > putting it on a low capacity cpu every time even though it keeps the cpu

> > busy for a long time every time it wakes up.

> 

> Agreed, that's the reason for under-estimation concern.

> 

> >

> > Did that answer your question?

> 

> Yeah, thanks for the clarification.


You are welcome.
Morten Rasmussen Aug. 22, 2016, 11:29 a.m. UTC | #3
On Mon, Aug 22, 2016 at 09:48:19AM +0800, Wanpeng Li wrote:
> 2016-08-19 22:03 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:

> > On Fri, Aug 19, 2016 at 09:43:00AM +0800, Wanpeng Li wrote:

> >> 2016-08-18 21:45 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:

> >> > I assume you are referring to using task_util_peak() instead of

> >> > task_util() in wake_cap()?

> >>

> >> Yes.

> >>

> >> >

> >> > The peak value should never exceed the util_avg accumulated by the task

> >> > last time it ran. So any spike has to be caused by the task accumulating

> >> > more utilization last time it ran. We don't know if it a spike or a more

> >>

> >> I see.

> >>

> >> > permanent change in behaviour, so we have to guess. So a spike on an

> >> > asymmetric system could cause us to disable wake affine in some

> >> > circumstances (either prev_cpu or waker cpu has to be low compute

> >> > capacity) for the following wake-up.

> >> >

> >> > SMP should be unaffected as we should bail out on the previous

> >> > condition.

> >>

> >> Why capacity_orig instead of capacity since it is checked each time

> >> wakeup and maybe rt class/interrupt have already occupied many cpu

> >> utilization.

> >

> > We could switch to capacity for this condition if we also change the

> > spare capacity evaluation in find_idlest_group() to do the same. It

> > would open up for SMP systems to take find_idlest_group() route if the

> > SD_BALANCE_WAKE flag is set.

> >

> > The reason why I have avoided capacity and used capacity_orig instead

> > is that in previous discussions about scheduling behaviour under

> > rt/dl/irq pressure it has been clear to me whether we want to move tasks

> > away from cpus with capacity < capacity_orig or not. The choice depends

> > on the use-case.

> >

> > In some cases taking rt/dl/irq pressure into account is more complicated

> > as we don't know the capacities available in a sched_group without

> > iterating over all the cpus. However, I don't think it would complicate

> > these patches. It is more a question whether everyone are happy with

> > additional conditions in their wake-up path. I guess we could make it a

> > sched_feature if people are interested?

> >

> > In short, I used capacity_orig to play it safe ;-)

> 

> Actually you mixed capacity_orig and capacity when evaluating max spare cap.


Right, that is a mistake. Thanks for pointing that out :-)

Morten
diff mbox

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4e0c47af9b05..40e427d1d378 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1281,7 +1281,7 @@  struct load_weight {
 struct sched_avg {
 	u64 last_update_time, load_sum;
 	u32 util_sum, period_contrib;
-	unsigned long load_avg, util_avg;
+	unsigned long load_avg, util_avg, util_peak;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 11b250531ed4..8462a3d455ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -692,6 +692,7 @@  void init_entity_runnable_average(struct sched_entity *se)
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
 	 */
 	sa->util_avg = 0;
+	sa->util_peak = 0;
 	sa->util_sum = 0;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -744,6 +745,7 @@  void post_init_entity_util_avg(struct sched_entity *se)
 		} else {
 			sa->util_avg = cap;
 		}
+		sa->util_peak = sa->util_avg;
 		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
 	}
 
@@ -2806,6 +2808,9 @@  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
 	}
 
+	if (running || sa->util_avg > sa->util_peak)
+		sa->util_peak = sa->util_avg;
+
 	return decayed;
 }
 
@@ -5174,7 +5179,7 @@  static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return 1;
 }
 
-static inline int task_util(struct task_struct *p);
+static inline int task_util_peak(struct task_struct *p);
 static int cpu_util_wake(int cpu, struct task_struct *p);
 
 static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -5257,10 +5262,10 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	} while (group = group->next, group != sd->groups);
 
 	/* Found a significant amount of spare capacity. */
-	if (this_spare > task_util(p) / 2 &&
+	if (this_spare > task_util_peak(p) / 2 &&
 	    imbalance*this_spare > 100*most_spare)
 		return NULL;
-	else if (most_spare > task_util(p) / 2)
+	else if (most_spare > task_util_peak(p) / 2)
 		return most_spare_sg;
 
 	if (!idlest || 100*this_load < imbalance*min_load)
@@ -5423,6 +5428,11 @@  static inline int task_util(struct task_struct *p)
 	return p->se.avg.util_avg;
 }
 
+static inline int task_util_peak(struct task_struct *p)
+{
+	return p->se.avg.util_peak;
+}
+
 /*
  * cpu_util_wake: Compute cpu utilization with any contributions from
  * the waking task p removed.
@@ -5455,7 +5465,7 @@  static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	/* Bring task utilization in sync with prev_cpu */
 	sync_entity_load_avg(&p->se);
 
-	return min_cap * 1024 < task_util(p) * capacity_margin;
+	return min_cap * 1024 < task_util_peak(p) * capacity_margin;
 }
 
 /*