From patchwork Tue May 24 09:55:54 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 68471
Delivered-To: patch@linaro.org
Received: by 10.140.92.199 with SMTP id b65csp535616qge;
 Tue, 24 May 2016 02:56:18 -0700 (PDT)
X-Received: by 10.66.136.41 with SMTP id px9mr5422142pab.80.1464083778063;
 Tue, 24 May 2016 02:56:18 -0700 (PDT)
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 us10si3763915pab.110.2016.05.24.02.56.17; 
 Tue, 24 May 2016 02:56:18 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1755387AbcEXJ4Q (ORCPT <rfc822;julien.grall@linaro.org>
 + 30 others); Tue, 24 May 2016 05:56:16 -0400
Received: from mail-wm0-f43.google.com ([74.125.82.43]:35991 "EHLO
 mail-wm0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1753601AbcEXJ4N (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 24 May 2016 05:56:13 -0400
Received: by mail-wm0-f43.google.com with SMTP id n129so120622208wmn.1
 for <linux-kernel@vger.kernel.org>;
 Tue, 24 May 2016 02:56:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=ASN5eTgitWhDumcv4eHuSgloelQsytSpdeFvu5nwes4=;
 b=dbIZrlnmQunCzzf0EYzhcpzYzeacNhl5uOjgZRpRzEKgvgtKcpUvxMbFDEPuUraJdX
 FNjllLR/HdlPmUnkDVSz7IA/9pXyJiZN1LnDDQyt+B+KIGcNwVsJxlPDKfbMAI7j9oos
 BN+NpsyADdV2W0mox8MrWQ0y+MptExcY64siY=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=ASN5eTgitWhDumcv4eHuSgloelQsytSpdeFvu5nwes4=;
 b=MSKVU7sXPIQTeows1VVIwavqdBdyUhbYHYJ5jRg9KwilktLzmWad1wB3AGOBCQW/HV
 RsC2j1YTySF01cPBq6+oL8adR9fL4Zhy0DI7IkEmquOwlk1VU56/rOprz2yQVkmBk0fL
 7mdsbxst5Zv6sYoE2lFCH4NOCJhxv5bAul8EeH3akoFbCQiBtlNGCHknpk5r2Kcj01XS
 IpxManNQYDa2fNKT2eTxWHEriPvezwODA8a4V2Kpnb2Kp5qeXZ8j2jQBywDiL8VXNfXj
 khxt+DHD3MX2jWUzdKHnVh1i/A2oeyj+Av35KBLBLQuRphQZtfqq6KiB+XFQ0rwdULfG
 QxXw==
X-Gm-Message-State: AOPr4FUwC7GZlHRvqJi5eAfaEkPWiS69wTdUM15Q906o2MISD2cd19aahGDn/iEh6yltbJLi
X-Received: by 10.28.189.138 with SMTP id n132mr22345738wmf.34.1464083771794; 
 Tue, 24 May 2016 02:56:11 -0700 (PDT)
Received: from localhost.localdomain (pas72-3-88-189-71-117.fbx.proxad.net.
 [88.189.71.117]) by smtp.gmail.com with ESMTPSA id
 j6sm2267538wjb.29.2016.05.24.02.56.09
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Tue, 24 May 2016 02:56:10 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: peterz@infradead.org, mingo@kernel.org,
 linux-kernel@vger.kernel.org, pjt@google.com
Cc: yuyang.du@intel.com, dietmar.eggemann@arm.com, bsegall@google.com,
 Morten.Rasmussen@arm.com, Vincent Guittot <vincent.guittot@linaro.org>
Subject: [RFC PATCH v2] sched: reflect sched_entity movement into
 task_group's utilization
Date: Tue, 24 May 2016 11:55:54 +0200
Message-Id: <1464083754-4424-1-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 1.9.1
In-Reply-To: <1464080252-17209-1-git-send-email-vincent.guittot@linaro.org>
References: <1464080252-17209-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Ensure that the changes of the utilization of a sched_entity will be
reflected in the task_group hierarchy.

This patch tries another way than the flat utilization hierarchy proposal
to ensure the changes will be propagated down to the root cfs.

The way to compute the sched average metrics stays the same so the
utilization only need to be synced with the local cfs rq timestamp.

Changes since v1:
- This patch needs the patch that fixes issue with rq->leaf_cfs_rq_list
  "sched: fix hierarchical order in rq->leaf_cfs_rq_list" in order to work
  correctly. I haven't sent them as a single patchset because the fix is
  independant of this one
- Merge some functions that are always used together
- During update of blocked load, ensure that the sched_entity is synced
  with the cfs_rq applying changes
- Fix an issue when task changes its cpu affinity

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

Hi,

Same version but rebased on latest sched/core

Regards,
Vincent

 kernel/sched/fair.c  | 168 ++++++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |   1 +
 2 files changed, 147 insertions(+), 22 deletions(-)

-- 
1.9.1

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d3fbf2..19475e61 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2591,6 +2591,7 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq)
 
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
 static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 {
@@ -2818,6 +2819,28 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	return decayed;
 }
 
+#ifndef CONFIG_64BIT
+static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
+{
+	u64 last_update_time_copy;
+	u64 last_update_time;
+
+	do {
+		last_update_time_copy = cfs_rq->load_last_update_time_copy;
+		smp_rmb();
+		last_update_time = cfs_rq->avg.last_update_time;
+	} while (last_update_time != last_update_time_copy);
+
+	return last_update_time;
+}
+#else
+static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->avg.last_update_time;
+}
+#endif
+
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
  * Updating tg's load_avg is necessary before update_cfs_share (which is done)
@@ -2885,8 +2908,86 @@ void set_task_rq_fair(struct sched_entity *se,
 		se->avg.last_update_time = n_last_update_time;
 	}
 }
+
+/*
+ * Save how much utilization has just been added/removed on cfs rq so we can
+ * propagate it across the whole tg tree
+ */
+static void set_tg_cfs_rq_util(struct cfs_rq *cfs_rq, int delta)
+{
+	if (cfs_rq->tg == &root_task_group)
+		return;
+
+	cfs_rq->diff_util_avg += delta;
+}
+
+/* Take into account the change of the utilization of a child task group */
+static void update_tg_cfs_util(struct sched_entity *se, int blocked)
+{
+	int delta;
+	struct cfs_rq *cfs_rq;
+	long update_util_avg;
+	long last_update_time;
+	long old_util_avg;
+
+
+	/*
+	 * update_blocked_average will call this function for root cfs_rq
+	 * whose se is null. In this case just return
+	 */
+	if (!se)
+		return;
+
+	if (entity_is_task(se))
+		return 0;
+
+	/* Get sched_entity of cfs rq */
+	cfs_rq = group_cfs_rq(se);
+
+	update_util_avg = cfs_rq->diff_util_avg;
+
+	if (!update_util_avg)
+		return 0;
+
+	/* Clear pending changes */
+	cfs_rq->diff_util_avg = 0;
+
+	/* Add changes in sched_entity utilizaton */
+	old_util_avg = se->avg.util_avg;
+	se->avg.util_avg = max_t(long, se->avg.util_avg + update_util_avg, 0);
+	se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
+
+	/* Get parent cfs_rq */
+	cfs_rq = cfs_rq_of(se);
+
+	if (blocked) {
+		/*
+		 * blocked utilization has to be synchronized with its parent
+		 * cfs_rq's timestamp
+		 */
+		last_update_time = cfs_rq_last_update_time(cfs_rq);
+
+		__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)),
+			  &se->avg,
+			  se->on_rq * scale_load_down(se->load.weight),
+			  cfs_rq->curr == se, NULL);
+	}
+
+	delta = se->avg.util_avg - old_util_avg;
+
+	cfs_rq->avg.util_avg =  max_t(long, cfs_rq->avg.util_avg + delta, 0);
+	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
+
+	set_tg_cfs_rq_util(cfs_rq, delta);
+}
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
+
+static inline void set_tg_cfs_rq_util(struct cfs_rq *cfs_rq, int delta) {}
+
+static inline void update_tg_cfs_util(struct sched_entity *se, int sync) {}
+
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
@@ -2926,6 +3027,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 {
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed, removed_load = 0, removed_util = 0;
+	int old_util_avg = sa->util_avg;
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
@@ -2948,6 +3050,8 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	smp_wmb();
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif
+	if (removed_util)
+		set_tg_cfs_rq_util(cfs_rq, sa->util_avg - old_util_avg);
 
 	if (update_freq && (decayed || removed_util))
 		cfs_rq_util_change(cfs_rq);
@@ -3001,6 +3105,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
 
+	set_tg_cfs_rq_util(cfs_rq, se->avg.util_avg);
+
 	cfs_rq_util_change(cfs_rq);
 }
 
@@ -3009,12 +3115,13 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
 			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
-
 	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
 	cfs_rq->avg.load_sum = max_t(s64,  cfs_rq->avg.load_sum - se->avg.load_sum, 0);
 	cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
 	cfs_rq->avg.util_sum = max_t(s32,  cfs_rq->avg.util_sum - se->avg.util_sum, 0);
 
+	set_tg_cfs_rq_util(cfs_rq, -se->avg.util_avg);
+
 	cfs_rq_util_change(cfs_rq);
 }
 
@@ -3057,27 +3164,6 @@ dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		max_t(s64,  cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
 }
 
-#ifndef CONFIG_64BIT
-static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
-{
-	u64 last_update_time_copy;
-	u64 last_update_time;
-
-	do {
-		last_update_time_copy = cfs_rq->load_last_update_time_copy;
-		smp_rmb();
-		last_update_time = cfs_rq->avg.last_update_time;
-	} while (last_update_time != last_update_time_copy);
-
-	return last_update_time;
-}
-#else
-static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
-{
-	return cfs_rq->avg.last_update_time;
-}
-#endif
-
 /*
  * Task first catches up with cfs_rq, and then subtract
  * itself from the cfs_rq (task must be off the queue now).
@@ -3328,6 +3414,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	enqueue_entity_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
+	update_tg_cfs_util(se, false);
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
@@ -3429,6 +3516,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq);
+	update_tg_cfs_util(se, false);
+
 }
 
 /*
@@ -3606,6 +3695,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 */
 	update_load_avg(curr, 1);
 	update_cfs_shares(cfs_rq);
+	update_tg_cfs_util(curr, false);
 
 #ifdef CONFIG_SCHED_HRTICK
 	/*
@@ -4479,6 +4569,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_load_avg(se, 1);
 		update_cfs_shares(cfs_rq);
+		update_tg_cfs_util(se, false);
 	}
 
 	if (!se)
@@ -4539,6 +4630,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_load_avg(se, 1);
 		update_cfs_shares(cfs_rq);
+		update_tg_cfs_util(se, false);
 	}
 
 	if (!se)
@@ -6328,6 +6420,8 @@ static void update_blocked_averages(int cpu)
 
 		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
 			update_tg_load_avg(cfs_rq, 0);
+		/* Propagate pending util changes to the parent */
+		update_tg_cfs_util(cfs_rq->tg->se[cpu], true);
 	}
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
@@ -8405,6 +8499,20 @@ static void detach_task_cfs_rq(struct task_struct *p)
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
 	detach_entity_load_avg(cfs_rq, se);
+
+	/*
+	 * Propagate the detach across the tg tree to ake it visible to the
+	 * root
+	 */
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
+		update_load_avg(se, 1);
+		update_tg_cfs_util(se, false);
+	}
 }
 
 static void attach_task_cfs_rq(struct task_struct *p)
@@ -8434,8 +8542,21 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
 {
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq;
+
 	attach_task_cfs_rq(p);
 
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
+		update_load_avg(se, 1);
+		update_tg_cfs_util(se, false);
+	}
+
 	if (task_on_rq_queued(p)) {
 		/*
 		 * We were most likely switched from sched_rt, so
@@ -8478,6 +8599,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 	atomic_long_set(&cfs_rq->removed_load_avg, 0);
 	atomic_long_set(&cfs_rq->removed_util_avg, 0);
 #endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	cfs_rq->diff_util_avg = 0;
+#endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 30750dc..3235ae4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -396,6 +396,7 @@ struct cfs_rq {
 	unsigned long runnable_load_avg;
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	unsigned long tg_load_avg_contrib;
+	long diff_util_avg;
 #endif
 	atomic_long_t removed_load_avg, removed_util_avg;
 #ifndef CONFIG_64BIT