From patchwork Thu Oct 27 17:41:08 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Patrick Bellasi X-Patchwork-Id: 79768 Delivered-To: patch@linaro.org Received: by 10.140.97.247 with SMTP id m110csp755013qge; Thu, 27 Oct 2016 10:42:02 -0700 (PDT) X-Received: by 10.98.18.23 with SMTP id a23mr16297806pfj.151.1477590121998; Thu, 27 Oct 2016 10:42:01 -0700 (PDT) Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c19si8455099pfc.102.2016.10.27.10.42.01; Thu, 27 Oct 2016 10:42:01 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964863AbcJ0Rlo (ORCPT + 27 others); Thu, 27 Oct 2016 13:41:44 -0400 Received: from foss.arm.com ([217.140.101.70]:43360 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935747AbcJ0Rlj (ORCPT ); Thu, 27 Oct 2016 13:41:39 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id ACB3D1534; Thu, 27 Oct 2016 10:41:38 -0700 (PDT) Received: from e105326-lin.cambridge.arm.com (e105326-lin.cambridge.arm.com [10.1.210.55]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id DECEC3F218; Thu, 27 Oct 2016 10:41:35 -0700 (PDT) From: Patrick Bellasi To: linux-kernel@vger.kernel.org Cc: Ingo Molnar , Peter Zijlstra , Vincent Guittot , Steve Muckle , Leo Yan , Viresh Kumar , "Rafael J . Wysocki" , Todd Kjos , Srinath Sridharan , Andres Oportus , Juri Lelli , Morten Rasmussen , Dietmar Eggemann , Chris Redpath , Robin Randhawa , Patrick Bellasi , Jonathan Corbet , Ingo Molnar Subject: [RFC v2 8/8] sched/{fair,tune}: add support for negative boosting Date: Thu, 27 Oct 2016 18:41:08 +0100 Message-Id: <20161027174108.31139-9-patrick.bellasi@arm.com> X-Mailer: git-send-email 2.10.1 In-Reply-To: <20161027174108.31139-1-patrick.bellasi@arm.com> References: <20161027174108.31139-1-patrick.bellasi@arm.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Boosting support allows to inflate a signal with a margin which is defined to be proportional to its delta from its maximum possible value. Such a mechanism allows to run a task on an OPP which is higher with respect to the minimum capacity which can satisfy its demands. In certain use-cases we could be interested to the opposite goal, i.e. running a task on an OPP which is lower than the minimum required. Currently the only why to achieve such a goal is to use the "powersave" governor, thus forcing all tasks to run at the lower OPP, or the "userspace" governor, still forcing all task to run at a certain OPP. With the availability of schedutil and the addition of SchedTune, we now have the support to tune the way OPPs are selected depending on which tasks are active on a CPU. This patch extends SchedTune to introduce the support for negative boosting. While boosting inflate a signal, with negative boosting we can reduce artificially the value of a signal. The boosting strategy used to reduce a signal is quite simple and extends the concept of "margin" already used for positive boosting. The Boost (B) value [%] is used to compute a Margin (M) which, in case of negative boosting, is a fraction of the original Signal (S): M = B * S, when B in [-100%, 0%) Such a value of M is defined to be a negative quantity which, once added to the original signal S, allows to reduce the amount of that signal by a fraction of the original signal. With such a definition, a 50% utilization task will run at: - 25% capacity OPP when boosted -50% - minimum capacity OPP when boosted -100% It's worth to notice that, the boosting of all tasks on a CPU is aggregated to figure out what is the max boost value currently required. Thus, for example, if we have two tasks: T1 boosted @ -20% T2 boosted @ +30% when T2 is active, we boost the CPU +30%, also if T1 is active. While the CPU is "slowed-down" 20% when T1 is the only task active on that CPU. Cc: Jonathan Corbet Cc: Ingo Molnar Cc: Peter Zijlstra Suggested-by: Srinath Sridharan Signed-off-by: Patrick Bellasi --- Documentation/scheduler/sched-tune.txt | 44 ++++++++++++++++++++++++++++++---- include/linux/sched/sysctl.h | 6 ++--- kernel/sched/fair.c | 38 +++++++++++++++++++++-------- kernel/sched/tune.c | 33 +++++++++++++++---------- kernel/sysctl.c | 3 ++- 5 files changed, 92 insertions(+), 32 deletions(-) -- 2.10.1 diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt index da7b3eb..5822f9f 100644 --- a/Documentation/scheduler/sched-tune.txt +++ b/Documentation/scheduler/sched-tune.txt @@ -100,12 +100,17 @@ This permits expressing a boost value as an integer in the range [0..100]. A value of 0 (default) for a CFS task means that schedutil will attempt to match compute capacity of the CPU where the task is scheduled to -match its current utilization with a few spare cycles left. A value of -100 means that schedutil will select the highest available OPP. +match its current utilization with a few spare cycles left. -The range between 0 and 100 can be set to satisfy other scenarios suitably. -For example to satisfy interactive response or depending on other system events -(battery level, thermal status, etc). +A value of 100 means that schedutil will select the highest available OPP, +while a negative value means that schedutils will try to run tasks at lower +OPPs. Togheter, positive and negative boost value allows to get from scedutil +behaviors similar to that of the existing "performance" and "powersave" +governors but with a more fine grained control. + +The range between -100 and 100 can be set to satisfy other scenarios suitably. +For example to satisfy interactive response or other system events (battery +level, thermal status, etc). A CGroup based extension is also provided, which permits further user-space defined task classification to tune the scheduler for different goals depending @@ -227,6 +232,27 @@ corresponding to a 50% boost is midway from the original signal and the upper bound. Boosting by 100% generates a boosted signal which is always saturated to the upper bound. +Negative boosting +----------------- + +While postive boosting uses the SPC strategy to inflate a signal, with negative +boosting we can reduce artificially the value of a signal. The boosting +strategy used to reduce a signal is quite simple and extends the concept of +"margin" already used for positive boosting. + +When sched_cfs_boost is defined in [-100%, 0%), the boost value [%] is used to +compute a margin which is a fraction of the original signal: + + margin := sched_cfs_boost * signal + +Such a margin is defined to be a negative quantity which, once added to the +original signal, it allows to reduce the amount of that signal by a fraction of +the original value. + +With such a definition, for example a 50% utilization task will run at: + - 25% capacity OPP when boosted -50% + - minimum capacity OPP when boosted -100% + 4. OPP selection using boosted CPU utilization ============================================== @@ -304,6 +330,14 @@ main characteristics: which has to compute the per CPU boosting once there are multiple RUNNABLE tasks with different boost values. +It's worth to notice that, the boosting of all tasks on a CPU is aggregated to +figure out what is the max boost value currently required. Thus, for example, +if we have two tasks: + T1 boosted @ -20% + T2 boosted @ +30% +when T2 is active, we boost the CPU +30%, also if T1 is active. +While the CPU is "slowed-down" 20% when T1 is the only task active on that CPU. + Such a simple design should allow servicing the main utilization scenarios identified so far. It provides a simple interface which can be used to manage the power-performance of all tasks or only selected tasks. Moreover, this diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 5bfbb14..fe878c9 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -56,16 +56,16 @@ extern unsigned int sysctl_sched_cfs_bandwidth_slice; #endif #ifdef CONFIG_SCHED_TUNE -extern unsigned int sysctl_sched_cfs_boost; +extern int sysctl_sched_cfs_boost; int sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos); -static inline unsigned int get_sysctl_sched_cfs_boost(void) +static inline int get_sysctl_sched_cfs_boost(void) { return sysctl_sched_cfs_boost; } #else -static inline unsigned int get_sysctl_sched_cfs_boost(void) +static inline int get_sysctl_sched_cfs_boost(void) { return 0; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f56953b..43a4989 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5580,17 +5580,34 @@ struct reciprocal_value schedtune_spc_rdiv; * schedtune_margin returns the "margin" to be added on top of * the original value of a "signal". * - * The Boost (B) value [%] is used to compute a Margin (M) which - * is proportional to the complement of the original Signal (S): + * The Boost (B) value [%] is used to compute a Margin (M) which, in case of + * positive boosting, it is proportional to the complement of the original + * Signal (S): * - * M = B * (SCHED_CAPACITY_SCALE - S) + * M = B * (SCHED_CAPACITY_SCALE - S), when B is in (0%, 100%] + * + * In case of negative boosting, the computed margin is a fraction of the + * original S: + * + * M = B * S, when B in [-100%, 0%) * * The obtained value M could be used by the caller to "boost" S. */ -static unsigned long -schedtune_margin(unsigned long signal, unsigned int boost) +static long +schedtune_margin(unsigned long signal, int boost) { - unsigned long long margin = 0; + long long margin = 0; + + /* A -100% boost nullify the orignal signal */ + if (unlikely(boost == -100)) + return -signal; + + /* A negative boost produces a proportional (negative) margin */ + if (unlikely(boost < 0)) { + margin = -boost * margin; + margin = reciprocal_divide(margin, schedtune_spc_rdiv); + return -margin; + } /* Do not boost saturated signals */ if (signal >= SCHED_CAPACITY_SCALE) @@ -5606,10 +5623,10 @@ schedtune_margin(unsigned long signal, unsigned int boost) return margin; } -static inline unsigned long +static inline long schedtune_cpu_margin(unsigned long util, int cpu) { - unsigned int boost = schedtune_cpu_boost(cpu); + int boost = schedtune_cpu_boost(cpu); if (boost == 0) return 0UL; @@ -5619,7 +5636,7 @@ schedtune_cpu_margin(unsigned long util, int cpu) #else /* CONFIG_SCHED_TUNE */ -static inline unsigned long +static inline long schedtune_cpu_margin(unsigned long util, int cpu) { return 0; @@ -5665,9 +5682,10 @@ unsigned long boosted_cpu_util(int cpu) { unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg; unsigned long capacity = capacity_orig_of(cpu); + int boost = schedtune_cpu_boost(cpu); /* Do not boost saturated utilizations */ - if (util >= capacity) + if (boost >= 0 && util >= capacity) return capacity; /* Add margin to current CPU's capacity */ diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c index 965a3e1..ed90830 100644 --- a/kernel/sched/tune.c +++ b/kernel/sched/tune.c @@ -13,7 +13,7 @@ #include "sched.h" #include "tune.h" -unsigned int sysctl_sched_cfs_boost __read_mostly; +int sysctl_sched_cfs_boost __read_mostly; #ifdef CONFIG_CGROUP_SCHED_TUNE @@ -32,7 +32,7 @@ struct schedtune { int idx; /* Boost value for tasks on that SchedTune CGroup */ - unsigned int boost; + int boost; }; @@ -95,10 +95,10 @@ static struct schedtune *allocated_group[boostgroups_max] = { */ struct boost_groups { /* Maximum boost value for all RUNNABLE tasks on a CPU */ - unsigned int boost_max; + int boost_max; struct { /* The boost for tasks on that boost group */ - unsigned int boost; + int boost; /* Count of RUNNABLE tasks on that boost group */ unsigned int tasks; } group[boostgroups_max]; @@ -112,15 +112,14 @@ DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups); static void schedtune_cpu_update(int cpu) { + bool active_tasks = false; struct boost_groups *bg; - unsigned int boost_max; + int boost_max = -100; int idx; bg = &per_cpu(cpu_boost_groups, cpu); - /* The root boost group is always active */ - boost_max = bg->group[0].boost; - for (idx = 1; idx < boostgroups_max; ++idx) { + for (idx = 0; idx < boostgroups_max; ++idx) { /* * A boost group affects a CPU only if it has * RUNNABLE tasks on that CPU @@ -128,8 +127,13 @@ schedtune_cpu_update(int cpu) if (bg->group[idx].tasks == 0) continue; boost_max = max(boost_max, bg->group[idx].boost); + active_tasks = true; } + /* Reset boosting when there are not tasks in the system */ + if (!active_tasks) + boost_max = 0; + bg->boost_max = boost_max; } @@ -383,7 +387,7 @@ void schedtune_exit_task(struct task_struct *tsk) task_rq_unlock(rq, tsk, &rq_flags); } -static u64 +static s64 boost_read(struct cgroup_subsys_state *css, struct cftype *cft) { struct schedtune *st = css_st(css); @@ -393,15 +397,18 @@ boost_read(struct cgroup_subsys_state *css, struct cftype *cft) static int boost_write(struct cgroup_subsys_state *css, struct cftype *cft, - u64 boost) + s64 boost) { struct schedtune *st = css_st(css); - if (boost > 100) + if (boost < -100 || boost > 100) return -EINVAL; + + /* Update boostgroup and global boosting (if required) */ st->boost = boost; if (css == &root_schedtune.css) sysctl_sched_cfs_boost = boost; + /* Update CPU boost */ schedtune_boostgroup_update(st->idx, st->boost); @@ -411,8 +418,8 @@ boost_write(struct cgroup_subsys_state *css, struct cftype *cft, static struct cftype files[] = { { .name = "boost", - .read_u64 = boost_read, - .write_u64 = boost_write, + .read_s64 = boost_read, + .write_s64 = boost_write, }, { } /* terminate */ }; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 12c3432..3b412fb 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -127,6 +127,7 @@ static int __maybe_unused four = 4; static unsigned long one_ul = 1; static int one_hundred = 100; static int one_thousand = 1000; +static int __maybe_unused one_hundred_neg = -100; #ifdef CONFIG_PRINTK static int ten_thousand = 10000; #endif @@ -453,7 +454,7 @@ static struct ctl_table kern_table[] = { .mode = 0644, #endif .proc_handler = &sysctl_sched_cfs_boost_handler, - .extra1 = &zero, + .extra1 = &one_hundred_neg, .extra2 = &one_hundred, }, #endif