diff mbox series

[v3,2/3] sched/fair: use util_est in LB and WU paths

Message ID 20180123180847.4477-3-patrick.bellasi@arm.com
State New
Headers show
Series [v3,1/3] sched/fair: add util_est on top of PELT | expand

Commit Message

Patrick Bellasi Jan. 23, 2018, 6:08 p.m. UTC
When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while it is still considered relatively empty.

In order to reduce this kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in cpu_util_wake as well as in update_sg_lb_stats.

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of the
tasks currently RUNNABLE on that CPU.
This allows to properly represent the spare capacity of a CPU which, for
example, has just got a big task running since a long sleep period.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org

---
Changes in v3:
 - rebased on today's tip/sched/core (commit 07881166a892)

Changes in v2:
 - rebase on top of v4.15-rc2
 - tested that overhauled PELT code does not affect the util_est
---
 kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 68 insertions(+), 6 deletions(-)

-- 
2.15.1

Comments

Pavan Kondeti Jan. 24, 2018, 11:33 a.m. UTC | #1
Hi Patrick,

On Tue, Jan 23, 2018 at 06:08:46PM +0000, Patrick Bellasi wrote:
>  static unsigned long cpu_util_wake(int cpu, struct task_struct *p)

>  {

> -	unsigned long util, capacity;

> +	long util, util_est;

>  

>  	/* Task has no contribution or is new */

>  	if (cpu != task_cpu(p) || !p->se.avg.last_update_time)

> -		return cpu_util(cpu);

> +		return cpu_util_est(cpu);

>  

> -	capacity = capacity_orig_of(cpu);

> -	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);

> +	/* Discount task's blocked util from CPU's util */

> +	util = cpu_util(cpu) - task_util(p);

> +	util = max(util, 0L);

>  

> -	return (util >= capacity) ? capacity : util;

> +	if (!sched_feat(UTIL_EST))

> +		return util;


At first, It is not clear to me why you are not clamping the capacity to
CPU original capacity. It looks like it is not needed any more with
commit f453ae2200b0 ("sched/fair: Consider RT/IRQ pressure in
capacity_spare_wake()") inclusion. May be a separate patch to remove
the clamping part?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Patrick Bellasi Jan. 31, 2018, 3:32 p.m. UTC | #2
On 25-Jan 20:03, Pavan Kondeti wrote:
> On Wed, Jan 24, 2018 at 07:31:38PM +0000, Patrick Bellasi wrote:

> > 

> > > > +       /*

> > > > +        * These are the main cases covered:

> > > > +        * - if *p is the only task sleeping on this CPU, then:

> > > > +        *      cpu_util (== task_util) > util_est (== 0)

> > > > +        *   and thus we return:

> > > > +        *      cpu_util_wake = (cpu_util - task_util) = 0

> > > > +        *

> > > > +        * - if other tasks are SLEEPING on the same CPU, which is just waking

> > > > +        *   up, then:

> > > > +        *      cpu_util >= task_util

> > > > +        *      cpu_util > util_est (== 0)

> > > > +        *   and thus we discount *p's blocked utilization to return:

> > > > +        *      cpu_util_wake = (cpu_util - task_util) >= 0

> > > > +        *

> > > > +        * - if other tasks are RUNNABLE on that CPU and

> > > > +        *      util_est > cpu_util

> > > > +        *   then we use util_est since it returns a more restrictive

> > > > +        *   estimation of the spare capacity on that CPU, by just considering

> > > > +        *   the expected utilization of tasks already runnable on that CPU.

> > > > +        */

> > > > +       util_est = cpu_rq(cpu)->cfs.util_est_runnable;

> > > > +       util = max(util, util_est);

> > > > +

> > > > +       return util;

> > 

> > I should instead clamp util before returning it! ;-)

> > 

> > > May be a separate patch to remove  the clamping part?

> > 

> > No, I think we should keep cpu_util_wake clamped to not affect the existing

> > call sites. I just need to remove it where not needed (done) and add it where

> > needed (will do on the next iteration).

> 

> cpu_util_wake() is called only from capacity_spare_wake(). There are no other

> callsites.


True...

> The capacity_spare_wake() is clamping the return value of

> cpu_util_wake() to CPU capacity.


... actually it's clamping negative numbers with:

   max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0);

thus, by having cpu_util_wake returning potentially a value which is
bigger then capacity_of or capacity_orig_of we should not have issues
from a capacity_spare_wake() usage standpoint.

> The clamping is not needed, I think.


However, we can still argue that the cpu_util_wake() should never
return something bigger then the maximum possible capacity of a CPU.
At least that's the feature so fare.

Thus, even just for the sake of consistency, with previous returns
paths (e.g. when we bail out returning cpu_util), I would say that
it's worth to maintain this semantics.

With a clamping, all these functions:
 - cpu_util
 - cpu_util_est
 - cpu_util_wake
will always return a signal which is never bigger then the maximum
possible CPU capacity.

-- 
#include <best/regards.h>

Patrick Bellasi
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0bfe94f3176e..6f2e614fd79f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6332,6 +6332,41 @@  static unsigned long cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
+/**
+ * cpu_util_est: estimated utilization for the specified CPU
+ * @cpu: the CPU to get the estimated utilization for
+ *
+ * The estimated utilization of a CPU is defined to be the maximum between its
+ * PELT's utilization and the sum of the estimated utilization of the tasks
+ * currently RUNNABLE on that CPU.
+ *
+ * This allows to properly represent the expected utilization of a CPU which
+ * has just got a big task running since a long sleep period. At the same time
+ * however it preserves the benefits of the "blocked utilization" in
+ * describing the potential for other tasks waking up on the same CPU.
+ *
+ * Return: the estimated utilization for the specified CPU
+ */
+static inline unsigned long cpu_util_est(int cpu)
+{
+	unsigned long util, util_est;
+	unsigned long capacity;
+	struct cfs_rq *cfs_rq;
+
+	if (!sched_feat(UTIL_EST))
+		return cpu_util(cpu);
+
+	cfs_rq = &cpu_rq(cpu)->cfs;
+	util = cfs_rq->avg.util_avg;
+	util_est = cfs_rq->util_est_runnable;
+	util_est = max(util, util_est);
+
+	capacity = capacity_orig_of(cpu);
+	util_est = min(util_est, capacity);
+
+	return util_est;
+}
+
 static inline unsigned long task_util(struct task_struct *p)
 {
 	return p->se.avg.util_avg;
@@ -6348,16 +6383,43 @@  static inline unsigned long task_util_est(struct task_struct *p)
  */
 static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
 {
-	unsigned long util, capacity;
+	long util, util_est;
 
 	/* Task has no contribution or is new */
 	if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
-		return cpu_util(cpu);
+		return cpu_util_est(cpu);
 
-	capacity = capacity_orig_of(cpu);
-	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+	/* Discount task's blocked util from CPU's util */
+	util = cpu_util(cpu) - task_util(p);
+	util = max(util, 0L);
 
-	return (util >= capacity) ? capacity : util;
+	if (!sched_feat(UTIL_EST))
+		return util;
+
+	/*
+	 * These are the main cases covered:
+	 * - if *p is the only task sleeping on this CPU, then:
+	 *      cpu_util (== task_util) > util_est (== 0)
+	 *   and thus we return:
+	 *      cpu_util_wake = (cpu_util - task_util) = 0
+	 *
+	 * - if other tasks are SLEEPING on the same CPU, which is just waking
+	 *   up, then:
+	 *      cpu_util >= task_util
+	 *      cpu_util > util_est (== 0)
+	 *   and thus we discount *p's blocked utilization to return:
+	 *      cpu_util_wake = (cpu_util - task_util) >= 0
+	 *
+	 * - if other tasks are RUNNABLE on that CPU and
+	 *      util_est > cpu_util
+	 *   then we use util_est since it returns a more restrictive
+	 *   estimation of the spare capacity on that CPU, by just considering
+	 *   the expected utilization of tasks already runnable on that CPU.
+	 */
+	util_est = cpu_rq(cpu)->cfs.util_est_runnable;
+	util = max(util, util_est);
+
+	return util;
 }
 
 /*
@@ -7882,7 +7944,7 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
-		sgs->group_util += cpu_util(i);
+		sgs->group_util += cpu_util_est(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;