From patchwork Tue Feb 28 14:38:38 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 94621
Delivered-To: patch@linaro.org
Received: by 10.140.20.113 with SMTP id 104csp1353776qgi;
 Tue, 28 Feb 2017 06:50:00 -0800 (PST)
X-Received: by 10.98.19.12 with SMTP id b12mr2948925pfj.150.1488293400887;
 Tue, 28 Feb 2017 06:50:00 -0800 (PST)
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 b14si1973223pgn.113.2017.02.28.06.50.00; 
 Tue, 28 Feb 2017 06:50:00 -0800 (PST)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1752570AbdB1Ot6 (ORCPT <rfc822;julien.grall@linaro.org>
 + 25 others); Tue, 28 Feb 2017 09:49:58 -0500
Received: from foss.arm.com ([217.140.101.70]:38118 "EHLO foss.arm.com"
 rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1751769AbdB1Ot3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 28 Feb 2017 09:49:29 -0500
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6DAF6344;
 Tue, 28 Feb 2017 06:38:54 -0800 (PST)
Received: from e110439-lin.cambridge.arm.com (e110439-lin.cambridge.arm.com
 [10.1.210.68])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 4FEA83F77C; Tue, 28 Feb 2017 06:38:53 -0800 (PST)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
 Tejun Heo <tj@kernel.org>
Subject: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
Date: Tue, 28 Feb 2017 14:38:38 +0000
Message-Id: <1488292722-19410-2-git-send-email-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1488292722-19410-1-git-send-email-patrick.bellasi@arm.com>
References: <1488292722-19410-1-git-send-email-patrick.bellasi@arm.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

The CPU CGroup controller allows to assign a specified (maximum)
bandwidth to tasks within a group, however it does not enforce any
constraint on how such bandwidth can be consumed.
With the integration of schedutil, the scheduler has now the proper
information about a task to select  the most suitable frequency to
satisfy tasks needs.

This patch extends the CPU controller by adding a couple of new
attributes, capacity_min and capacity_max, which can be used to enforce
bandwidth boosting and capping. More specifically:

- capacity_min: defines the minimum capacity which should be granted
                (by schedutil) when a task in this group is running,
                i.e. the task will run at least at that capacity

- capacity_max: defines the maximum capacity which can be granted
                (by schedutil) when a task in this group is running,
                i.e. the task can run up to that capacity

These attributes:
a) are tunable at all hierarchy levels, i.e. root group too
b) allow to create subgroups of tasks which are not violating the
   capacity constraints defined by the parent group.
   Thus, tasks on a subgroup can only be more boosted and/or more
   capped, which is matching with the "limits" schema proposed by
   the "Resource Distribution Model (RDM)" suggested by the
   CGroups v2 documentation (Documentation/cgroup-v2.txt)

This patch provides the basic support to expose the two new attributes
and to validate their run-time update based on the "limits" schema of
the RDM.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 init/Kconfig         |  17 ++++++
 kernel/sched/core.c  | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   8 +++
 3 files changed, 170 insertions(+)

-- 
2.7.4

diff --git a/init/Kconfig b/init/Kconfig
index e1a93734..71e46ce 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1044,6 +1044,23 @@ menuconfig CGROUP_SCHED
 	  bandwidth allocation to such task groups. It uses cgroups to group
 	  tasks.
 
+config CAPACITY_CLAMPING
+	bool "Capacity clamping per group of tasks"
+	depends on CPU_FREQ_GOV_SCHEDUTIL
+	depends on CGROUP_SCHED
+	default n
+	help
+	  This feature allows the scheduler to enforce maximum and minimum
+	  capacity on each CPU based on RUNNABLE tasks currently scheduled
+	  on that CPU.
+	  Minimum capacity can be used for example to "boost" the performance
+	  of important tasks by running them on an OPP which can be higher than
+	  the minimum one eventually selected by the schedutil governor.
+	  Maximum capacity can be used for example to "restrict" the maximum
+	  OPP which can be requested by background tasks.
+
+	  If in doubt, say N.
+
 if CGROUP_SCHED
 config FAIR_GROUP_SCHED
 	bool "Group scheduling for SCHED_OTHER"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 34e2291..a171d49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6015,6 +6015,11 @@ void __init sched_init(void)
 	autogroup_init(&init_task);
 #endif /* CONFIG_CGROUP_SCHED */
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+	root_task_group.cap_clamp[CAP_CLAMP_MIN] = 0;
+	root_task_group.cap_clamp[CAP_CLAMP_MAX] = SCHED_CAPACITY_SCALE;
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
 	for_each_possible_cpu(i) {
 		struct rq *rq;
 
@@ -6310,6 +6315,11 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+	tg->cap_clamp[CAP_CLAMP_MIN] = parent->cap_clamp[CAP_CLAMP_MIN];
+	tg->cap_clamp[CAP_CLAMP_MAX] = parent->cap_clamp[CAP_CLAMP_MAX];
+#endif
+
 	return tg;
 
 err:
@@ -6899,6 +6909,129 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 		sched_move_task(task);
 }
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+
+static DEFINE_MUTEX(cap_clamp_mutex);
+
+static int cpu_capacity_min_write_u64(struct cgroup_subsys_state *css,
+				      struct cftype *cftype, u64 value)
+{
+	struct cgroup_subsys_state *pos;
+	unsigned int min_value;
+	struct task_group *tg;
+	int ret = -EINVAL;
+
+	min_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
+
+	mutex_lock(&cap_clamp_mutex);
+	rcu_read_lock();
+
+	tg = css_tg(css);
+
+	/* Already at the required value */
+	if (tg->cap_clamp[CAP_CLAMP_MIN] == min_value)
+		goto done;
+
+	/* Ensure to not exceed the maximum capacity */
+	if (tg->cap_clamp[CAP_CLAMP_MAX] < min_value)
+		goto out;
+
+	/* Ensure min cap fits within parent constraint */
+	if (tg->parent &&
+	    tg->parent->cap_clamp[CAP_CLAMP_MIN] > min_value)
+		goto out;
+
+	/* Each child must be a subset of us */
+	css_for_each_child(pos, css) {
+		if (css_tg(pos)->cap_clamp[CAP_CLAMP_MIN] < min_value)
+			goto out;
+	}
+
+	tg->cap_clamp[CAP_CLAMP_MIN] = min_value;
+
+done:
+	ret = 0;
+out:
+	rcu_read_unlock();
+	mutex_unlock(&cap_clamp_mutex);
+
+	return ret;
+}
+
+static int cpu_capacity_max_write_u64(struct cgroup_subsys_state *css,
+				      struct cftype *cftype, u64 value)
+{
+	struct cgroup_subsys_state *pos;
+	unsigned int max_value;
+	struct task_group *tg;
+	int ret = -EINVAL;
+
+	max_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
+
+	mutex_lock(&cap_clamp_mutex);
+	rcu_read_lock();
+
+	tg = css_tg(css);
+
+	/* Already at the required value */
+	if (tg->cap_clamp[CAP_CLAMP_MAX] == max_value)
+		goto done;
+
+	/* Ensure to not go below the minimum capacity */
+	if (tg->cap_clamp[CAP_CLAMP_MIN] > max_value)
+		goto out;
+
+	/* Ensure max cap fits within parent constraint */
+	if (tg->parent &&
+	    tg->parent->cap_clamp[CAP_CLAMP_MAX] < max_value)
+		goto out;
+
+	/* Each child must be a subset of us */
+	css_for_each_child(pos, css) {
+		if (css_tg(pos)->cap_clamp[CAP_CLAMP_MAX] > max_value)
+			goto out;
+	}
+
+	tg->cap_clamp[CAP_CLAMP_MAX] = max_value;
+
+done:
+	ret = 0;
+out:
+	rcu_read_unlock();
+	mutex_unlock(&cap_clamp_mutex);
+
+	return ret;
+}
+
+static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
+				     struct cftype *cft)
+{
+	struct task_group *tg;
+	u64 min_capacity;
+
+	rcu_read_lock();
+	tg = css_tg(css);
+	min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
+	rcu_read_unlock();
+
+	return min_capacity;
+}
+
+static u64 cpu_capacity_max_read_u64(struct cgroup_subsys_state *css,
+				     struct cftype *cft)
+{
+	struct task_group *tg;
+	u64 max_capacity;
+
+	rcu_read_lock();
+	tg = css_tg(css);
+	max_capacity = tg->cap_clamp[CAP_CLAMP_MAX];
+	rcu_read_unlock();
+
+	return max_capacity;
+}
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -7193,6 +7326,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CAPACITY_CLAMPING
+	{
+		.name = "capacity_min",
+		.read_u64 = cpu_capacity_min_read_u64,
+		.write_u64 = cpu_capacity_min_write_u64,
+	},
+	{
+		.name = "capacity_max",
+		.read_u64 = cpu_capacity_max_read_u64,
+		.write_u64 = cpu_capacity_max_write_u64,
+	},
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.name = "cfs_quota_us",
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 71b10a9..05dae4a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -273,6 +273,14 @@ struct task_group {
 #endif
 #endif
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+#define CAP_CLAMP_MIN 0
+#define CAP_CLAMP_MAX 1
+
+	/* Min and Max capacity constraints for tasks in this group */
+	unsigned int cap_clamp[2];
+#endif
+
 #ifdef CONFIG_RT_GROUP_SCHED
 	struct sched_rt_entity **rt_se;
 	struct rt_rq **rt_rq;

From patchwork Tue Feb 28 14:38:39 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 94624
Delivered-To: patch@linaro.org
Received: by 10.140.20.113 with SMTP id 104csp1354213qgi;
 Tue, 28 Feb 2017 06:51:06 -0800 (PST)
X-Received: by 10.84.232.77 with SMTP id f13mr3526547pln.137.1488293466044; 
 Tue, 28 Feb 2017 06:51:06 -0800 (PST)
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id c9si1979114pge.126.2017.02.28.06.51.05;
 Tue, 28 Feb 2017 06:51:06 -0800 (PST)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1752658AbdB1Ou5 (ORCPT <rfc822;julien.grall@linaro.org>
 + 25 others); Tue, 28 Feb 2017 09:50:57 -0500
Received: from foss.arm.com ([217.140.101.70]:38116 "EHLO foss.arm.com"
 rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1751737AbdB1Ot3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 28 Feb 2017 09:49:29 -0500
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6ADEE11D4;
 Tue, 28 Feb 2017 06:38:57 -0800 (PST)
Received: from e110439-lin.cambridge.arm.com (e110439-lin.cambridge.arm.com
 [10.1.210.68])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 4D8AD3F77C; Tue, 28 Feb 2017 06:38:56 -0800 (PST)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
 Tejun Heo <tj@kernel.org>
Subject: [RFC v3 2/5] sched/core: track CPU's capacity_{min,max}
Date: Tue, 28 Feb 2017 14:38:39 +0000
Message-Id: <1488292722-19410-3-git-send-email-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1488292722-19410-1-git-send-email-patrick.bellasi@arm.com>
References: <1488292722-19410-1-git-send-email-patrick.bellasi@arm.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

When CAPACITY_CLAMPING is enabled, each task is subject to a capacity
constraint which is defined by the capacity_{min,max} attributes of the
task group it belongs to.
At run-time, the capacity constraints of RUNNABLE tasks must be
aggregated to figure out the actual capacity constraints to enforce on
each CPU.

This aggregation must meet two main goals:
  1) ensure the minimum capacity required by the most boosted
     RUNNABLE task on that CPU
  2) do not penalize the less capped RUNNABLE tasks on that CPU

Thus, the aggregation for both the capacity constraints turns out to be
a MAX function on the min/max capacities of RUNNABLE tasks:

  cpu_capacity_min := MAX(capacity_min_i), for each RUNNABLE task_i
  cpu_capacity_max := MAX(capacity_max_i), for each RUNNABLE task_i

The aggregation at CPU level is done by exploiting the task_struct.
Tasks are already enqueued, via fields embedded in their task_struct, in
many different lists and trees. This patch uses the same approach to
keep track of the capacity constraints enforced by every task on a CPU.
To this purpose:
  - each CPU's RQ has two RBTrees, which are used to track the minimum and
    maximum capacity constraints of all the tasks enqueue on that CPU
  - task_struct has two rb_node which allows to position that task in the
    minimum/maximum capacity tracking RBTree of the CPU in which the
    task is enqueued

This patch provides the RBTree support code while, for the sake of
clarity, the synchronization between the
   fast path: {enqueue,dequeue}_task
and the
   slow path: cpu_capacity_{min,max}_write_u64
is provided in a dedicated patch.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 include/linux/sched.h |   3 ++
 kernel/sched/core.c   | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  |  23 +++++++++
 3 files changed, 155 insertions(+)

-- 
2.7.4

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e2ed46d..5838570 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1531,6 +1531,9 @@ struct task_struct {
 	struct sched_rt_entity rt;
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group *sched_task_group;
+#ifdef CONFIG_CAPACITY_CLAMPING
+	struct rb_node cap_clamp_node[2];
+#endif
 #endif
 	struct sched_dl_entity dl;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a171d49..8f509be 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -752,11 +752,128 @@ static void set_load_weight(struct task_struct *p)
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+
+static inline void
+cap_clamp_insert_capacity(struct rq *rq, struct task_struct *p,
+			  unsigned int cap_idx)
+{
+	struct cap_clamp_cpu *cgc = &rq->cap_clamp_cpu[cap_idx];
+	struct task_group *tg = task_group(p);
+	struct rb_node *parent = NULL;
+	struct task_struct *entry;
+	struct rb_node **link;
+	struct rb_root *root;
+	struct rb_node *node;
+	int update_cache = 1;
+	u64 capacity_new;
+	u64 capacity_cur;
+
+	node = &p->cap_clamp_node[cap_idx];
+	if (!RB_EMPTY_NODE(node)) {
+		WARN(1, "cap_clamp_insert_capacity() on non empty node\n");
+		return;
+	}
+
+	/*
+	 * The capacity_{min,max} the task is subject to is defined by the
+	 * current TG the task belongs to. The TG's capacity constraints are
+	 * thus used to place the task within the rbtree used to track
+	 * the capacity_{min,max} for the CPU.
+	 */
+	capacity_new = tg->cap_clamp[cap_idx];
+	root = &cgc->tree;
+	link = &root->rb_node;
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct task_struct,
+				 cap_clamp_node[cap_idx]);
+		capacity_cur = task_group(entry)->cap_clamp[cap_idx];
+		if (capacity_new <= capacity_cur) {
+			link = &parent->rb_left;
+			update_cache = 0;
+		} else {
+			link = &parent->rb_right;
+		}
+	}
+
+	/* Add task's capacity_{min,max} and rebalance the rbtree */
+	rb_link_node(node, parent, link);
+	rb_insert_color(node, root);
+
+	if (!update_cache)
+		return;
+
+	/* Update CPU's capacity cache pointer */
+	cgc->value = capacity_new;
+	cgc->node = node;
+}
+
+static inline void
+cap_clamp_remove_capacity(struct rq *rq, struct task_struct *p,
+			  unsigned int cap_idx)
+{
+	struct cap_clamp_cpu *cgc = &rq->cap_clamp_cpu[cap_idx];
+	struct rb_node *node = &p->cap_clamp_node[cap_idx];
+	struct rb_root *root = &cgc->tree;
+
+	if (RB_EMPTY_NODE(node)) {
+		WARN(1, "cap_clamp_remove_capacity on empty node\n");
+		return;
+	}
+
+	/* Update CPU's capacity_{min,max} cache pointer */
+	if (node == cgc->node) {
+		struct rb_node *prev_node = rb_prev(node);
+
+		/* Reset value in case this was the last task */
+		cgc->value = (cap_idx == CAP_CLAMP_MIN)
+			? 0 : SCHED_CAPACITY_SCALE;
+
+		/* Update node and value, if there is another task */
+		cgc->node = prev_node;
+		if (cgc->node) {
+			struct task_struct *entry;
+
+			entry = rb_entry(cgc->node, struct task_struct,
+					 cap_clamp_node[cap_idx]);
+			cgc->value = task_group(entry)->cap_clamp[cap_idx];
+		}
+	}
+
+	/* Remove task's capacity_{min,max] */
+	rb_erase(node, root);
+	RB_CLEAR_NODE(node);
+}
+
+static inline void
+cap_clamp_enqueue_task(struct rq *rq, struct task_struct *p, int flags)
+{
+	/* Track task's min/max capacities */
+	cap_clamp_insert_capacity(rq, p, CAP_CLAMP_MIN);
+	cap_clamp_insert_capacity(rq, p, CAP_CLAMP_MAX);
+}
+
+static inline void
+cap_clamp_dequeue_task(struct rq *rq, struct task_struct *p, int flags)
+{
+	/* Track task's min/max capacities */
+	cap_clamp_remove_capacity(rq, p, CAP_CLAMP_MIN);
+	cap_clamp_remove_capacity(rq, p, CAP_CLAMP_MAX);
+}
+#else
+static inline void
+cap_clamp_enqueue_task(struct rq *rq, struct task_struct *p, int flags) { }
+static inline void
+cap_clamp_dequeue_task(struct rq *rq, struct task_struct *p, int flags) { }
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_rq_clock(rq);
 	if (!(flags & ENQUEUE_RESTORE))
 		sched_info_queued(rq, p);
+	cap_clamp_enqueue_task(rq, p, flags);
 	p->sched_class->enqueue_task(rq, p, flags);
 }
 
@@ -765,6 +882,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	update_rq_clock(rq);
 	if (!(flags & DEQUEUE_SAVE))
 		sched_info_dequeued(rq, p);
+	cap_clamp_dequeue_task(rq, p, flags);
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -2412,6 +2530,10 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 	RB_CLEAR_NODE(&p->pushable_dl_tasks);
 #endif
+#ifdef CONFIG_CAPACITY_CLAMPING
+	RB_CLEAR_NODE(&p->cap_clamp_node[CAP_CLAMP_MIN]);
+	RB_CLEAR_NODE(&p->cap_clamp_node[CAP_CLAMP_MAX]);
+#endif
 
 	put_cpu();
 	return 0;
@@ -6058,6 +6180,13 @@ void __init sched_init(void)
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+		rq->cap_clamp_cpu[CAP_CLAMP_MIN].tree = RB_ROOT;
+		rq->cap_clamp_cpu[CAP_CLAMP_MIN].node = NULL;
+		rq->cap_clamp_cpu[CAP_CLAMP_MAX].tree = RB_ROOT;
+		rq->cap_clamp_cpu[CAP_CLAMP_MAX].node = NULL;
+#endif
+
 		rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
 #ifdef CONFIG_RT_GROUP_SCHED
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 05dae4a..4a7d224 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -461,6 +461,24 @@ struct cfs_rq {
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };
 
+/* Capacity capping -related fields in a runqueue */
+struct cap_clamp_cpu {
+	/*
+	 * RBTree to keep sorted capacity constraints
+	 * of currently RUNNABLE tasks on a CPU.
+	 */
+	struct rb_root tree;
+
+	/*
+	 * Pointers to the RUNNABLE task defining the current
+	 * capacity constraint for a CPU.
+	 */
+	struct rb_node *node;
+
+	/* Current CPU's capacity constraint */
+	unsigned int value;
+};
+
 static inline int rt_bandwidth_enabled(void)
 {
 	return sysctl_sched_rt_runtime >= 0;
@@ -648,6 +666,11 @@ struct rq {
 	struct list_head *tmp_alone_branch;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+	/* Min and Max capacity constraints */
+	struct cap_clamp_cpu cap_clamp_cpu[2];
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
 	/*
 	 * This is part of a global counter where only the total sum
 	 * over all CPUs matters. A task can increase this counter on

From patchwork Tue Feb 28 14:38:40 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 94623
Delivered-To: patch@linaro.org
Received: by 10.140.20.113 with SMTP id 104csp1354209qgi;
 Tue, 28 Feb 2017 06:51:05 -0800 (PST)
X-Received: by 10.84.229.137 with SMTP id c9mr3541170plk.41.1488293465348;
 Tue, 28 Feb 2017 06:51:05 -0800 (PST)
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id c9si1979114pge.126.2017.02.28.06.51.05;
 Tue, 28 Feb 2017 06:51:05 -0800 (PST)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1752634AbdB1Ouy (ORCPT <rfc822;julien.grall@linaro.org>
 + 25 others); Tue, 28 Feb 2017 09:50:54 -0500
Received: from foss.arm.com ([217.140.101.70]:38114 "EHLO foss.arm.com"
 rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1751686AbdB1Ot3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 28 Feb 2017 09:49:29 -0500
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A6EDD139F;
 Tue, 28 Feb 2017 06:38:59 -0800 (PST)
Received: from e110439-lin.cambridge.arm.com (e110439-lin.cambridge.arm.com
 [10.1.210.68])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 8A05A3F77C; Tue, 28 Feb 2017 06:38:58 -0800 (PST)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
 Tejun Heo <tj@kernel.org>
Subject: [RFC v3 3/5] sched/core: sync capacity_{min,
 max} between slow and fast paths
Date: Tue, 28 Feb 2017 14:38:40 +0000
Message-Id: <1488292722-19410-4-git-send-email-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1488292722-19410-1-git-send-email-patrick.bellasi@arm.com>
References: <1488292722-19410-1-git-send-email-patrick.bellasi@arm.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

At enqueue/dequeue time a task needs to be placed in the CPU's rb_tree,
depending on the current capacity_{min,max} value of the cgroup it
belongs to. Thus, we need to grant that these values cannot be changed
while the task is in these critical sections.

To this purpose, this patch uses the same locking schema already used by
the __set_cpus_allowed_ptr. We might uselessly lock the (previous) RQ of
a !RUNNABLE task, but that's the price to pay to safely serialize
capacity_{min,max} updates with enqueues, dequeues and migrations.

This patch adds the synchronization calls required to grant that each
RUNNABLE task is always in the correct relative position within the
RBTree. Specifically, when a group's capacity_{min,max} value is
updated, each task in that group is re-positioned within the rb_tree, if
currently RUNNABLE and its relative position has changed.
This operation is mutually exclusive with the task being {en,de}queued
or migrated via a task_rq_lock().

It's worth to notice that moving a task from a CGroup to another,
perhaps with different capacity_{min,max} values, is already covered by
the current locking schema. Indeed, this operation requires a dequeue
from the original cgroup's RQ followed by an enqueue in the new one.
The same argument is true for tasks migrations thus, tasks migrations
between CPUs and CGruoups are ultimately managed like tasks
wakeups/sleeps.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/core.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

-- 
2.7.4

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f509be..d620bc4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -846,9 +846,68 @@ cap_clamp_remove_capacity(struct rq *rq, struct task_struct *p,
 	RB_CLEAR_NODE(node);
 }
 
+static void
+cap_clamp_update_capacity(struct task_struct *p, unsigned int cap_idx)
+{
+	struct task_group *tg = task_group(p);
+	unsigned int next_cap = SCHED_CAPACITY_SCALE;
+	unsigned int prev_cap = 0;
+	struct task_struct *entry;
+	struct rb_node *node;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	/*
+	 * Lock the CPU's RBTree where the task is (eventually) queued.
+	 *
+	 * We might uselessly lock the (previous) RQ of a !RUNNABLE task, but
+	 * that's the price to pay to safely serializ capacity_{min,max}
+	 * updates with enqueues, dequeues and migration operations, which is
+	 * the same locking schema already in use by __set_cpus_allowed_ptr().
+	 */
+	rq = task_rq_lock(p, &rf);
+
+	/*
+	 * If the task has not a node in the rbtree, it's not yet RUNNABLE or
+	 * it's going to be enqueued with the proper value.
+	 * The setting of the cap_clamp_node is serialized by task_rq_lock().
+	 */
+	if (RB_EMPTY_NODE(&p->cap_clamp_node[cap_idx]))
+		goto done;
+
+	/* Check current position in the capacity rbtree */
+	node = rb_next(&p->cap_clamp_node[cap_idx]);
+	if (node) {
+		entry = rb_entry(node, struct task_struct,
+				 cap_clamp_node[cap_idx]);
+		next_cap = task_group(entry)->cap_clamp[cap_idx];
+	}
+	node = rb_prev(&p->cap_clamp_node[cap_idx]);
+	if (node) {
+		entry = rb_entry(node, struct task_struct,
+				 cap_clamp_node[cap_idx]);
+		prev_cap = task_group(entry)->cap_clamp[cap_idx];
+	}
+
+	/* If relative position has not changed: nothing to do */
+	if (prev_cap <= tg->cap_clamp[cap_idx] &&
+	    next_cap >= tg->cap_clamp[cap_idx])
+		goto done;
+
+	/* Reposition this node within the rbtree */
+	cap_clamp_remove_capacity(rq, p, cap_idx);
+	cap_clamp_insert_capacity(rq, p, cap_idx);
+
+done:
+	task_rq_unlock(rq, p, &rf);
+}
+
 static inline void
 cap_clamp_enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	lockdep_assert_held(&p->pi_lock);
+	lockdep_assert_held(&rq->lock);
+
 	/* Track task's min/max capacities */
 	cap_clamp_insert_capacity(rq, p, CAP_CLAMP_MIN);
 	cap_clamp_insert_capacity(rq, p, CAP_CLAMP_MAX);
@@ -857,6 +916,9 @@ cap_clamp_enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 static inline void
 cap_clamp_dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	lockdep_assert_held(&p->pi_lock);
+	lockdep_assert_held(&rq->lock);
+
 	/* Track task's min/max capacities */
 	cap_clamp_remove_capacity(rq, p, CAP_CLAMP_MIN);
 	cap_clamp_remove_capacity(rq, p, CAP_CLAMP_MAX);
@@ -7046,8 +7108,10 @@ static int cpu_capacity_min_write_u64(struct cgroup_subsys_state *css,
 				      struct cftype *cftype, u64 value)
 {
 	struct cgroup_subsys_state *pos;
+	struct css_task_iter it;
 	unsigned int min_value;
 	struct task_group *tg;
+	struct task_struct *p;
 	int ret = -EINVAL;
 
 	min_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
@@ -7078,6 +7142,12 @@ static int cpu_capacity_min_write_u64(struct cgroup_subsys_state *css,
 
 	tg->cap_clamp[CAP_CLAMP_MIN] = min_value;
 
+	/* Update the capacity_min of RUNNABLE tasks */
+	css_task_iter_start(css, &it);
+	while ((p = css_task_iter_next(&it)))
+		cap_clamp_update_capacity(p, CAP_CLAMP_MIN);
+	css_task_iter_end(&it);
+
 done:
 	ret = 0;
 out:
@@ -7091,8 +7161,10 @@ static int cpu_capacity_max_write_u64(struct cgroup_subsys_state *css,
 				      struct cftype *cftype, u64 value)
 {
 	struct cgroup_subsys_state *pos;
+	struct css_task_iter it;
 	unsigned int max_value;
 	struct task_group *tg;
+	struct task_struct *p;
 	int ret = -EINVAL;
 
 	max_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
@@ -7123,6 +7195,12 @@ static int cpu_capacity_max_write_u64(struct cgroup_subsys_state *css,
 
 	tg->cap_clamp[CAP_CLAMP_MAX] = max_value;
 
+	/* Update the capacity_max of RUNNABLE tasks */
+	css_task_iter_start(css, &it);
+	while ((p = css_task_iter_next(&it)))
+		cap_clamp_update_capacity(p, CAP_CLAMP_MAX);
+	css_task_iter_end(&it);
+
 done:
 	ret = 0;
 out: