From patchwork Fri Aug 25 10:20:06 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 110979
Delivered-To: patch@linaro.org
Received: by 10.140.95.78 with SMTP id h72csp776892qge;
 Fri, 25 Aug 2017 03:20:40 -0700 (PDT)
X-Received: by 10.99.119.142 with SMTP id s136mr9397384pgc.438.1503656440113; 
 Fri, 25 Aug 2017 03:20:40 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1503656440; cv=none;
 d=google.com; s=arc-20160816;
 b=rOuD6eqbedkh9r4RReak8eHTsFYe0Ztz5i5yvIFcBoU9gQGuoKxxNzM/xWrr1YRydY
 mUCjeOF0nX/1U+ZjzAm1xxm5CITg6awxhXc19XlprIsbJp1yi+8MfERC55SeCB2TVuaC
 diH0kuBIFwoapBg87zfsggKrYvnDpaCpGmU1LdB7PQxDPbHDRHF3PKJ+rELmTd+lHPCW
 ctJo3wJNSkFG8VP5UjlmfygjdUG9aIO+jBQV1B3KAT2JWJKmjGvC62UlCjja6nr8BNgf
 LqlPty0GYi4Wuc1ntgx2IX2R7ToE1wD/7BPWVTg30zAhz6xykobSj5v7xP0uW9BLRQi4
 OEDQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:arc-authentication-results;
 bh=/YPf3u2l0R0NwXBiM5gFPtUscVBvZRKumN36E11CtNU=;
 b=nO7AFV0osSq1v7ZKog1/Rs/ndHqnFWGQ8OTq4U+nvjhXGaCraPgLGgQgr4vFxzeLpF
 bN9ihqUlBCnoQqqtmUhs1NnusCv8Npx7lS0GhWCzM3PzAjvCcpIUJdMrALvG2Z98fUkI
 cQl2ivIiQl0fCypMRZHjnSHWT/9hR+YHK9H0tCrHjUm2VpgNXzZzkicVIa6x4porqxrB
 dOwklDRu8tgkBVCNCJ95Lpn21nL32mEoSI/izF8AXdfAIaGWyVncbTuakc+jICn0ep3+
 CyilEdWtEsOEdigJA/DDYQOe50OJ8TdUjqYR4vkYJuOuuqUMLQ5TGKES4IHxPwD/KQ/X
 E6qw==
ARC-Authentication-Results: i=1; mx.google.com;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id h1si3941879plh.428.2017.08.25.03.20.39;
 Fri, 25 Aug 2017 03:20:40 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1754902AbdHYKUh (ORCPT <rfc822;daniel.diaz@linaro.org>
 + 26 others); Fri, 25 Aug 2017 06:20:37 -0400
Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:52000 "EHLO
 foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1754454AbdHYKUb (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 25 Aug 2017 06:20:31 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B68C215B2;
 Fri, 25 Aug 2017 03:20:30 -0700 (PDT)
Received: from e110439-lin.cambridge.arm.com (e110439-lin.cambridge.arm.com
 [10.1.210.68])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 E2C573F540; Fri, 25 Aug 2017 03:20:27 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
 "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
 Paul Turner <pjt@google.com>, Vincent Guittot <vincent.guittot@linaro.org>,
 John Stultz <john.stultz@linaro.org>,
 Morten Rasmussen <morten.rasmussen@arm.com>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
 Juri Lelli <juri.lelli@arm.com>, Tim Murray <timmurray@google.com>,
 Todd Kjos <tkjos@android.com>, Andres Oportus <andresoportus@google.com>,
 Joel Fernandes <joelaf@google.com>, Viresh Kumar <viresh.kumar@linaro.org>
Subject: [RFC 1/3] sched/fair: add util_est on top of PELT
Date: Fri, 25 Aug 2017 11:20:06 +0100
Message-Id: <20170825102008.4626-2-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.14.1
In-Reply-To: <20170825102008.4626-1-patrick.bellasi@arm.com>
References: <20170825102008.4626-1-patrick.bellasi@arm.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can results in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the utilization of a task is, by PELT definition, a continuously
changing metrics. This contributes in making almost instantly outdated some
decisions based on the value of the PELT's utilization.

For all these reasons, a more stable signal could probably do a better job
of representing the expected/estimated utilization of a SE/RQ. Such a
signal can be easily created on top of PELT by still using it as an
estimator which produces values to be aggregated once meaningful events
happens.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

    util_est(se) = max(se::util_avg, f(se::util_avg@dequeue_times))

This allows to remember how big a task has been reported to be by PELT in
its previous activations via the function: f(se::util_avg@dequeue_times).

If a task should change it's behavior and it runs even longer in a new
activation, after a certain time util_est will just track the original PELT
signal (i.e. se::util_avg).

For a (top-level) RQ, the estimated utilization is simply defined as:

    util_est(rq) = max(rq::util_est, rq::util_avg)

where:

    rq::util_est = sum(se::util_est), for each RUNNABLE SE on that CPU

It's worth to note that the estimated utilization is tracked only for
entities of interests, specifically:
 - SE which are Tasks
   since we want to better support tasks placement decisions
 - top-level CPU's RQs
   since we want to better support frequencies selection for a CPU

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 include/linux/sched.h | 48 ++++++++++++++++++++++++++++++++++++
 kernel/sched/debug.c  |  8 ++++++
 kernel/sched/fair.c   | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 123 insertions(+), 1 deletion(-)

-- 
2.14.1

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c28b182c9833..8d7bc55f68d5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -26,6 +26,7 @@
 #include <linux/signal_types.h>
 #include <linux/mm_types_task.h>
 #include <linux/task_io_accounting.h>
+#include <linux/average.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -277,6 +278,16 @@ struct load_weight {
 	u32				inv_weight;
 };
 
+/**
+ * Utilizaton's Exponential Weighted Moving Average (EWMA)
+ *
+ * Support functions to track an EWMA for the utilization of SEs and RQs. New
+ * samples will be added to the moving average each time a task completes an
+ * activation. Thus the weight is chosen so that the EWMA wil be relatively
+ * insensitive to transient changes to the task's workload.
+ */
+DECLARE_EWMA(util, 0, 4);
+
 /*
  * The load_avg/util_avg accumulates an infinite geometric series
  * (see __update_load_avg() in kernel/sched/fair.c).
@@ -336,8 +347,45 @@ struct sched_avg {
 	u32				period_contrib;
 	unsigned long			load_avg;
 	unsigned long			util_avg;
+
+	/* Utilization estimation */
+	struct ewma_util		util_ewma;
+	struct util_est {
+		unsigned long ewma;
+		unsigned long last;
+	} util_est;
 };
 
+/* Utilization estimation policies */
+#define UTIL_EST_MAX_EWMA_LAST	0 /* max(sa->util_est.ema, sa->util_est.last) */
+#define UTIL_EST_EWMA		1 /* sa->util_est.ewma */
+#define UTIL_EST_LAST		2 /* sa->util_est.last */
+
+/* Default policy used by utilization estimation */
+#define UTIL_EST_POLICY	UTIL_EST_MAX_EWMA_LAST
+
+/**
+ * util_est: estimated utilization for a given entity (i.e. SE or RQ)
+ *
+ * Depending on the selected utlization estimation policy, the estimated
+ * utilization of a SE or RQ is returned by this function.
+ * Supported policies are:
+ * UTIL_EST_LAST: the value of the PELT signal the last time a SE has
+ *                completed an activation, i.e. it has been dequeued because
+ *                of a sleep
+ * UTIL_EST_EWMA: the exponential weighted moving average of all the past
+ *                UTIL_EST_LAST samples
+ * UTIL_EST_MAX_EWMA_LAST: the maximum among the previous two metrics
+ */
+static inline unsigned long util_est(struct sched_avg *sa, int policy)
+{
+	if (likely(policy == UTIL_EST_MAX_EWMA_LAST))
+		return max(sa->util_est.ewma, sa->util_est.last);
+	if (policy == UTIL_EST_EWMA)
+		return sa->util_est.ewma;
+	return sa->util_est.last;
+}
+
 struct sched_statistics {
 #ifdef CONFIG_SCHEDSTATS
 	u64				wait_start;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index cfd84f79e075..17e293adb7f0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -399,6 +399,8 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
+	P(se->avg.util_est.ewma);
+	P(se->avg.util_est.last);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -521,6 +523,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
+	SEQ_printf(m, "  .%-30s: %lu\n", "util_est.ewma",
+			cfs_rq->avg.util_est.ewma);
+	SEQ_printf(m, "  .%-30s: %lu\n", "util_est.last",
+			cfs_rq->avg.util_est.last);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed_load_avg",
 			atomic_long_read(&cfs_rq->removed_load_avg));
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed_util_avg",
@@ -966,6 +972,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
 	P(se.avg.util_avg);
+	P(se.avg.util_est.ewma);
+	P(se.avg.util_est.last);
 	P(se.avg.last_update_time);
 #endif
 	P(policy);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8d5868771cb3..a4ec1b8c4755 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -751,6 +751,10 @@ void init_entity_runnable_average(struct sched_entity *se)
 	sa->util_avg = 0;
 	sa->util_sum = 0;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
+
+	ewma_util_init(&sa->util_ewma);
+	sa->util_est.ewma = 0;
+	sa->util_est.last = 0;
 }
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
@@ -4880,6 +4884,21 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+static inline int task_util(struct task_struct *p);
+static inline int task_util_est(struct task_struct *p);
+
+static inline void util_est_enqueue(struct task_struct *p)
+{
+	struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+
+	/*
+	 * Update (top level CFS) RQ estimated utilization.
+	 * NOTE: here we assumes that we never change the
+	 *       utilization estimation policy at run-time.
+	 */
+	cfs_rq->avg.util_est.last += task_util_est(p);
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -4932,9 +4951,49 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!se)
 		add_nr_running(rq, 1);
 
+	util_est_enqueue(p);
 	hrtick_update(rq);
 }
 
+static inline void util_est_dequeue(struct task_struct *p, int flags)
+{
+	struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+	int task_sleep = flags & DEQUEUE_SLEEP;
+	long util_est;
+
+	/*
+	 * Update (top level CFS) RQ estimated utilization
+	 *
+	 * When *p is the last FAIR task then the RQ's estimated utilization
+	 * is 0 by its definition.
+	 *
+	 * Otherwise, in removing *p's util_est from the current RQ's util_est
+	 * we should account for cases where this last activation of *p was
+	 * longher then the previous ones. In these cases as well we set to 0
+	 * the new estimated utilization for the CPU.
+	 */
+	util_est = (cfs_rq->nr_running > 1)
+		? cfs_rq->avg.util_est.last - task_util_est(p)
+		: 0;
+	if (util_est < 0)
+		util_est = 0;
+	cfs_rq->avg.util_est.last = util_est;
+
+	/*
+	 * Update Task's estimated utilization
+	 *
+	 * When *p completes an activation we can consolidate another sample
+	 * about the task size. This is done by storing the last PELT value
+	 * for this task and using this value to load another sample in the
+	 * EMWA for the task.
+	 */
+	if (task_sleep) {
+		p->se.avg.util_est.last = task_util(p);
+		ewma_util_add(&p->se.avg.util_ewma, task_util(p));
+		p->se.avg.util_est.ewma = ewma_util_read(&p->se.avg.util_ewma);
+	}
+}
+
 static void set_next_buddy(struct sched_entity *se);
 
 /*
@@ -4991,6 +5050,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!se)
 		sub_nr_running(rq, 1);
 
+	util_est_dequeue(p, flags);
 	hrtick_update(rq);
 }
 
@@ -5486,7 +5546,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return affine;
 }
 
-static inline int task_util(struct task_struct *p);
 static int cpu_util_wake(int cpu, struct task_struct *p);
 
 static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -5931,6 +5990,13 @@ static inline int task_util(struct task_struct *p)
 	return p->se.avg.util_avg;
 }
 
+static inline int task_util_est(struct task_struct *p)
+{
+	struct sched_avg *sa = &p->se.avg;
+
+	return util_est(sa, UTIL_EST_POLICY);
+}
+
 /*
  * cpu_util_wake: Compute cpu utilization with any contributions from
  * the waking task p removed.

From patchwork Fri Aug 25 10:20:07 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 110981
Delivered-To: patch@linaro.org
Received: by 10.140.95.78 with SMTP id h72csp777434qge;
 Fri, 25 Aug 2017 03:21:09 -0700 (PDT)
X-Received: by 10.98.216.2 with SMTP id e2mr8649579pfg.296.1503656469583;
 Fri, 25 Aug 2017 03:21:09 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1503656469; cv=none;
 d=google.com; s=arc-20160816;
 b=ypinylq2fjvS6puPnxU8QtYgo3+F+1Q5X5+lFlCe5KxdYZ+ofz+P7Gi3O+xaJjZJIA
 WHPguWhpqxYi5zGDCJZmMfQc7yGR1pV8Lm1+QiGUnh5MlBV8MoPfgHx7smsUKypbbs6D
 UzTkLeTTZWqjEQ1HXu0bJHFFlRQnKXq3sktk+TV0Nmd4/UYb3acVoKpw5HWti50g1JbX
 5vFXPI3IsLHuB/Q8KAVkH2FxYZ+QIB/Mlbzp8N727uR/pB780KmY8DJIXCgh3bL90NBG
 WmnlnPCEvtLfHNwo5HaB5fiDqAzGjJzl1OoU8B9MjImC7IMuMoFPPr4K+TYXvLOzYbA+
 JAWg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:arc-authentication-results;
 bh=S2oSX/KFg68aLUQB22v3gofmQ1GcZqM93AY0oWKo8Dg=;
 b=zyLCrzgCpFRabURKQUlXArcuMbOk11zJXtMFSFsuTQhaZnMqPZuC+KwPxmFRBJ7LjH
 042+w9pM6z6hWTKGztM/GGBOxl4U1HNu/YlqtAvZ5xOBWz+7bpRI6wyA7Vr56AnoPthb
 I1qmov0I21TVMDdMSWkreVIK4yRkN2nVL9JdL4pc/iUxJYiOmWlks0+2eCiT6AKWbMZv
 Zvzx/DqeLixdP3ZHIPZBQTGGSMQ8s4TUamnNwKOsyVaicgtHdmAxMtmjfFKNCV4vVKSG
 ipyYaQVaaU6mCfDkoCVxTdt5rajtteFyFaw5rED/fRPEQR2fsHjt2MhPeRO/P+7cs1FB
 E92A==
ARC-Authentication-Results: i=1; mx.google.com;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id b15si4443409pgr.29.2017.08.25.03.21.09;
 Fri, 25 Aug 2017 03:21:09 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1755276AbdHYKVH (ORCPT <rfc822;daniel.diaz@linaro.org>
 + 26 others); Fri, 25 Aug 2017 06:21:07 -0400
Received: from foss.arm.com ([217.140.101.70]:52016 "EHLO foss.arm.com"
 rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1754718AbdHYKUf (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 25 Aug 2017 06:20:35 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E8E2415BE;
 Fri, 25 Aug 2017 03:20:34 -0700 (PDT)
Received: from e110439-lin.cambridge.arm.com (e110439-lin.cambridge.arm.com
 [10.1.210.68])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 217A63F540; Fri, 25 Aug 2017 03:20:31 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
 "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
 Paul Turner <pjt@google.com>, Vincent Guittot <vincent.guittot@linaro.org>,
 John Stultz <john.stultz@linaro.org>,
 Morten Rasmussen <morten.rasmussen@arm.com>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
 Juri Lelli <juri.lelli@arm.com>, Tim Murray <timmurray@google.com>,
 Todd Kjos <tkjos@android.com>, Andres Oportus <andresoportus@google.com>,
 Joel Fernandes <joelaf@google.com>, Viresh Kumar <viresh.kumar@linaro.org>
Subject: [RFC 2/3] sched/fair: use util_est in LB
Date: Fri, 25 Aug 2017 11:20:07 +0100
Message-Id: <20170825102008.4626-3-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.14.1
In-Reply-To: <20170825102008.4626-1-patrick.bellasi@arm.com>
References: <20170825102008.4626-1-patrick.bellasi@arm.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

When the scheduler looks at the CPU utlization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU which is still considered relatively empty.

In order to reduce these kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in some load balancer path.

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of the
tasks currently RUNNABLE on that CPU.
This allows to properly represent the expected utilization of a CPU which,
for example it has just got a big task running since a long sleep
period.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 kernel/sched/fair.c     | 38 ++++++++++++++++++++++++++++++++++++--
 kernel/sched/features.h |  4 ++++
 2 files changed, 40 insertions(+), 2 deletions(-)

-- 
2.14.1

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a4ec1b8c4755..2593da1d1465 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5985,6 +5985,32 @@ static int cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
+/**
+ * cpu_util_est: estimated utilization for the specified CPU
+ * @cpu: the CPU to get the estimated utilization for
+ *
+ * The estimated utilization of a CPU is defined to be the maximum between its
+ * PELT's utilization and the sum of the estimated utilization of the tasks
+ * currently RUNNABLE on that CPU.
+ *
+ * This allows to properly represent the expected utilization of a CPU which
+ * has just got a big task running since a long sleep period. At the same time
+ * however it preserves the benefits of the "blocked load" in describing the
+ * potential for other tasks waking up on the same CPU.
+ *
+ * Return: the estimated utlization for the specified CPU
+ */
+static inline unsigned long cpu_util_est(int cpu)
+{
+	struct sched_avg *sa = &cpu_rq(cpu)->cfs.avg;
+	unsigned long util = cpu_util(cpu);
+
+	if (!sched_feat(UTIL_EST))
+		return util;
+
+	return max(util, util_est(sa, UTIL_EST_LAST));
+}
+
 static inline int task_util(struct task_struct *p)
 {
 	return p->se.avg.util_avg;
@@ -6007,11 +6033,19 @@ static int cpu_util_wake(int cpu, struct task_struct *p)
 
 	/* Task has no contribution or is new */
 	if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
-		return cpu_util(cpu);
+		return cpu_util_est(cpu);
 
 	capacity = capacity_orig_of(cpu);
 	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
 
+	/*
+	 * Estimated utilization tracks only tasks already enqueued, but still
+	 * sometimes can return a bigger value than PELT, for example when the
+	 * blocked load is negligible wrt the estimated utilization of the
+	 * already enqueued tasks.
+	 */
+	util = max_t(long, util, cpu_util_est(cpu));
+
 	return (util >= capacity) ? capacity : util;
 }
 
@@ -7539,7 +7573,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
-		sgs->group_util += cpu_util(i);
+		sgs->group_util += cpu_util_est(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d3fb15555291..dadae44be2ab 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -81,3 +81,7 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+/*
+ * UtilEstimation. Use estimated CPU utiliation.
+ */
+SCHED_FEAT(UTIL_EST, false)

From patchwork Fri Aug 25 10:20:08 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 110980
Delivered-To: patch@linaro.org
Received: by 10.140.95.78 with SMTP id h72csp777000qge;
 Fri, 25 Aug 2017 03:20:46 -0700 (PDT)
X-Received: by 10.98.20.68 with SMTP id 65mr8568553pfu.238.1503656445962;
 Fri, 25 Aug 2017 03:20:45 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1503656445; cv=none;
 d=google.com; s=arc-20160816;
 b=GsGFymzuEjam0q6EiMvNovD33zzQwk58Uy5zUUIIpgePrV6P7YtubPC2jdNGUCUkMN
 +Q9m+aHv4YkAE/Dcf/gF4eMowqKgSjMTKZnAqlzxRVhlD2TA+rJUgVfg4h+yudxx1esI
 ES8onV+xSwQX53Buqg9n6QEn+kinuw8sHEjhqCsY1cLcWI4gfFq3/apaj8RyH2kwTGuC
 JvmaUGYlaO1mxfA5+c+7u/DvJS1Q/faqI/WWTnhNanNfK09QMK3wX5urHdFb8rOB/H1j
 r70IFuQwp7RnnP1xSaHRsLn7D5NTRQOuzoasbLImrvml6A+bHfUQ6i68rdQz8DUCCVrj
 zZqQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:arc-authentication-results;
 bh=zAFIlMJYLssEOSxwpoFMDxyXk9iWtKLA6QHRsECfPDc=;
 b=YilsElS15ZqFZN8cgZ0PIRI2pRZEuqMPXA5BO7vb/Be1f37T52Aj5f61k65xaxqTFq
 u+gVGFRL5xq+jdrOkDH69c9sGETjFOLHuQHfj0iPdYb16WaxyRiUqCmarE+E8xaRYxij
 61Dd2ddAkEYtHoIk7+EOHz5lzD0ulGige/Kcyz1HLbxT06AlnMdkKQWP456fIPT+aEQ+
 Z6FgJiQ1opyQrImaE/hV1iiOJaSC2YH4+bKMNUWpbKQV96yhPIArJ9p//CXOqtmQBY85
 bwsl6joCjMLDRFXXIpckARHVEh6a++fjdoS+97vynKgbogeAF+NVwAt0lXcz3GqmSo/b
 TYhA==
ARC-Authentication-Results: i=1; mx.google.com;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id g8si4522142pgr.417.2017.08.25.03.20.45;
 Fri, 25 Aug 2017 03:20:45 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1755185AbdHYKUl (ORCPT <rfc822;daniel.diaz@linaro.org>
 + 26 others); Fri, 25 Aug 2017 06:20:41 -0400
Received: from foss.arm.com ([217.140.101.70]:52022 "EHLO foss.arm.com"
 rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1754454AbdHYKUj (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 25 Aug 2017 06:20:39 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D0F40168F;
 Fri, 25 Aug 2017 03:20:38 -0700 (PDT)
Received: from e110439-lin.cambridge.arm.com (e110439-lin.cambridge.arm.com
 [10.1.210.68])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 061383F540; Fri, 25 Aug 2017 03:20:35 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
 "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
 Paul Turner <pjt@google.com>, Vincent Guittot <vincent.guittot@linaro.org>,
 John Stultz <john.stultz@linaro.org>,
 Morten Rasmussen <morten.rasmussen@arm.com>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
 Juri Lelli <juri.lelli@arm.com>, Tim Murray <timmurray@google.com>,
 Todd Kjos <tkjos@android.com>, Andres Oportus <andresoportus@google.com>,
 Joel Fernandes <joelaf@google.com>, Viresh Kumar <viresh.kumar@linaro.org>
Subject: [RFC 3/3] sched/cpufreq_schedutil: use util_est for OPP selection
Date: Fri, 25 Aug 2017 11:20:08 +0100
Message-Id: <20170825102008.4626-4-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.14.1
In-Reply-To: <20170825102008.4626-1-patrick.bellasi@arm.com>
References: <20170825102008.4626-1-patrick.bellasi@arm.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

When the schedutil looks at the CPU utlization, the current PELT value for
that CPU is returned straight away. In certain scenarios this can have
undesired side effects and delays on frequency selection.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU. This introduces some latency
before schedutil will be able to detect the best frequency required by
that task.

Moreover, the PELT signal building up time is function of the current
frequency, becasue of the scale invariant load tracking support. Thus,
starting from a lower frequency, the utilization build-up time will
increase even more and further delays the selection of the actual
frequency which better serves the task requirements.

In order to reduce these kind of latencies, this patch integrates the usage
of the CPU's estimated utlization in the sugov_get_util function.

The estimated utilization of a CPU is defined to be the maximum between its
PELT's utilization and the sum of the estimated utilization of each task
currently RUNNABLE on that CPU.
This allows to properly represent the expected utilization of a CPU which,
for example, it has just got a big task running after a long sleep period,
and ultimately it allows to select the best frequency to run a task
right after it wakes up.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 kernel/sched/cpufreq_schedutil.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

-- 
2.14.1

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 910a915fef8b..aacba9f7202e 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -161,7 +161,12 @@ static void sugov_get_util(unsigned long *util, unsigned long *max)
 
 	cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
 
-	*util = min(rq->cfs.avg.util_avg, cfs_max);
+	*util = rq->cfs.avg.util_avg;
+	if (sched_feat(UTIL_EST))
+		*util = max(*util,
+			    util_est(&rq->cfs.avg, UTIL_EST_MAX_EWMA_LAST));
+	*util = min(*util, cfs_max);
+
 	*max = cfs_max;
 }