[RFC,v2,1/8] sched/tune: add detailed documentation

Message ID 20161027174108.31139-2-patrick.bellasi@arm.com
State New
Headers show

Commit Message

Patrick Bellasi Oct. 27, 2016, 5:41 p.m.
The topic of a single simple power-performance tunable, that is wholly
scheduler centric, and has well defined and predictable properties has
come up on several occasions in the past. With techniques such as a
scheduler driven DVFS, which is now provided mainline via the schedutil
governor, we now have a good framework for implementing such a tunable.

This patch provides a detailed description of the motivations and design
decisions behind the implementation of SchedTune.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>

---
 Documentation/scheduler/sched-tune.txt | 392 +++++++++++++++++++++++++++++++++
 1 file changed, 392 insertions(+)
 create mode 100644 Documentation/scheduler/sched-tune.txt

-- 
2.10.1

Comments

Viresh Kumar Nov. 4, 2016, 9:46 a.m. | #1
On 27-10-16, 18:41, Patrick Bellasi wrote:
> +This last requirement is especially important if we consider that schedutil can

> +potentially replace all currently available CPUFreq policies. Since schedutil

> +is event based, as opposed to the sampling driven governors, it is already more

> +responsive at selecting the optimal OPP to run tasks allocated to a CPU.


I am not sure if I follow this paragraph. All the governors follow the same
basic rules now. They are all event driven (events from scheduler), but they
function only after a certain sampling period is finished. Isn't this the case ?

> +SchedTune exposes a simple user-space interface with a single power-performance

> +tunable:

> +

> +  /proc/sys/kernel/sched_cfs_boost

> +

> +This permits expressing a boost value as an integer in the range [0..100].

> +

> +A value of 0 (default) for a CFS task means that schedutil will attempt

> +to match compute capacity of the CPU where the task is scheduled to

> +match its current utilization with a few spare cycles left. A value of

> +100 means that schedutil will select the highest available OPP.

> +

> +The range between 0 and 100 can be set to satisfy other scenarios suitably.

> +For example to satisfy interactive response or depending on other system events

> +(battery level, thermal status, etc).


Earlier section said that schedutil+schedtune can replace all earlier governors.
How will schedutil behave like powersave governor with schedtune? I was
expecting the possible values of sched_cfs_boost to be in the range -100 to 100,
where -100 will make it powersave, +100 will make it performance and 0 will not
make any changes.

-- 
viresh
Patrick Bellasi Nov. 8, 2016, 10:53 a.m. | #2
On 04-Nov 15:16, Viresh Kumar wrote:
> On 27-10-16, 18:41, Patrick Bellasi wrote:

> > +This last requirement is especially important if we consider that schedutil can

> > +potentially replace all currently available CPUFreq policies. Since schedutil

> > +is event based, as opposed to the sampling driven governors, it is already more

> > +responsive at selecting the optimal OPP to run tasks allocated to a CPU.

> 

> I am not sure if I follow this paragraph. All the governors follow the same

> basic rules now. They are all event driven (events from scheduler), but they

> function only after a certain sampling period is finished. Isn't this the case ?


Right, the main difference from what I call "sample based" governors
(e.g. ondemand, interactive) is that they consider metrics which are
averaged across time (e.g. how long is idle in average a CPU).
To the contrary, with schedutil we have a direct input from the
scheduler about what is the required CPU bandwidth demand.

Thus, schedutil is not only event based but it can exploits a more
direct knowledge of what is the CPU bandwidth demand. Moreover,
depending on the CPUFreq driver latencies of a specific platform,
schedutil can be much more aggressive on triggering frequencies
transitions, e.g. on some ARM platforms we can easily have 1ms OPP
switches.
AFAIK, such fast transitions cannot be exploited by "sample based"
governors because they cannot collect sensible averages in such a
limited timeframe without the risk to be "unstable" (e.g. almost
always get a wrong decision).

> > +SchedTune exposes a simple user-space interface with a single power-performance

> > +tunable:

> > +

> > +  /proc/sys/kernel/sched_cfs_boost

> > +

> > +This permits expressing a boost value as an integer in the range [0..100].

> > +

> > +A value of 0 (default) for a CFS task means that schedutil will attempt

> > +to match compute capacity of the CPU where the task is scheduled to

> > +match its current utilization with a few spare cycles left. A value of

> > +100 means that schedutil will select the highest available OPP.

> > +

> > +The range between 0 and 100 can be set to satisfy other scenarios suitably.

> > +For example to satisfy interactive response or depending on other system events

> > +(battery level, thermal status, etc).

> 

> Earlier section said that schedutil+schedtune can replace all earlier governors.

> How will schedutil behave like powersave governor with schedtune? I was

> expecting the possible values of sched_cfs_boost to be in the range -100 to 100,

> where -100 will make it powersave, +100 will make it performance and 0 will not

> make any changes.


You right, however the negative values for the boost are introduced by
the last patch of this series. That patch updates also the
documentation to describe the meaning of negative boost values.

> --

> viresh


-- 
#include <best/regards.h>

Patrick Bellasi

Patch

diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt
new file mode 100644
index 0000000..da7b3eb
--- /dev/null
+++ b/Documentation/scheduler/sched-tune.txt
@@ -0,0 +1,392 @@ 
+			SchedTune Support for CFS Tasks
+             central, scheduler-driven, power-performance control
+                               (EXPERIMENTAL)
+
+Abstract
+========
+
+The topic of a single simple power-performance tunable, that is wholly
+scheduler centric and has well defined and predictable properties, has come up
+on several occasions in the past [1,2]. With techniques such as a scheduler
+driven DVFS [3], we now have a good framework for implementing such a tunable.
+
+Scheduler driven DVFS provides the foundation mechanism on top of which it's
+possible to differentiate the performance levels based on the specific
+requirements of different use-cases. For example, on mobile systems it's likely
+to have demanding use-cases which may benefit from running on an higher OPP
+than the one currently chosen by schedutil.
+
+In the past this behavior was provided by changing the governor or by tuning
+its the many different parameters of non-mainline governors, now we can achieve
+similar behaviors by tuning schedutil. This approach allows also for a more
+fine grained control which can be extended to consider the requirement of
+specific tasks.
+
+This document introduces SchedTune and describes the overall ideas behind its
+design and implementation.
+
+Table of Contents
+=================
+
+1. Motivation
+2. Introduction
+3. Signal Boosting Strategy
+4. OPP selection using boosted CPU utilization
+5. Per task group boosting
+6. Question and Answers
+   - What about "auto" mode?
+   - What about boosting on a congested system?
+   - How CPUs are boosted when we have tasks with multiple boost values?
+7. References
+
+
+1. Motivation
+=============
+
+schedutil [3] is a new event-driven cpufreq governor which allows the scheduler
+to select the optimal DVFS Operating Performance Point (OPP) for running a task
+allocated to a CPU. The introduction of schedutil enables running workloads at
+the most efficient OPPs.
+
+However, sometimes it may be desired to intentionally boost the performance of
+a workload even if that could imply a reasonable increase in energy
+consumption. For example, in order to reduce the response time of a task, we
+may want to run the task at a higher OPP than the one actually required by its
+CPU bandwidth demand.
+
+This last requirement is especially important if we consider that schedutil can
+potentially replace all currently available CPUFreq policies. Since schedutil
+is event based, as opposed to the sampling driven governors, it is already more
+responsive at selecting the optimal OPP to run tasks allocated to a CPU.
+However, just tracking the actual task utilization demand may not be enough
+from a performance standpoint.  For example, it is not possible to get
+behaviors similar to those provided by the "performance" and "powersave"
+CPUFreq governors.
+
+This document describes an implementation of a tunable, stacked on top of
+schedutil, which extends its functionality to support task performance
+boosting.
+
+By "performance boosting" we mean the reduction of the time required to
+complete a task activation, i.e. the time elapsed from a task wakeup to its
+next deactivation (e.g. because it goes back to sleep or it terminates).
+For example, if we consider a simple periodic task which executes the same
+workload for 5[ms] every 20[ms] while running at a certain OPP, a boosted
+execution of that task should be able to complete each of its activations in
+less than 5[ms].
+
+Previous attempts to introduce such a boosting feature has not been successful,
+mainly because of the complexity of the proposed solution. The approach
+described in this document exposes a single simple interface to user-space.
+This single knob allows the tuning of system wide scheduler behaviours ranging
+from energy efficiency at one end through to incremental performance boosting
+at the other end. This tunable affects all tasks. A more advanced extension of
+this concept is also provided, which uses CGroups to boost the performance of
+selected tasks while using the default schedutil behaviors for all others.
+
+The rest of this document introduces in more details the proposed solution
+which has been named SchedTune.
+
+
+2. Introduction
+===============
+
+SchedTune exposes a simple user-space interface with a single power-performance
+tunable:
+
+  /proc/sys/kernel/sched_cfs_boost
+
+This permits expressing a boost value as an integer in the range [0..100].
+
+A value of 0 (default) for a CFS task means that schedutil will attempt
+to match compute capacity of the CPU where the task is scheduled to
+match its current utilization with a few spare cycles left. A value of
+100 means that schedutil will select the highest available OPP.
+
+The range between 0 and 100 can be set to satisfy other scenarios suitably.
+For example to satisfy interactive response or depending on other system events
+(battery level, thermal status, etc).
+
+A CGroup based extension is also provided, which permits further user-space
+defined task classification to tune the scheduler for different goals depending
+on the specific nature of the task, e.g. background vs interactive vs
+low-priority.
+
+The overall design of the SchedTune module is built on top of "Per-Entity Load
+Tracking" (PELT) signals and schedutil by introducing a bias on the Operating
+Performance Point (OPP) selection.  Each time a task is allocated on a CPU,
+schedutil has the opportunity to tune the operating frequency of that CPU to
+better match the workload demand. The selection of the actual OPP being
+activated is influenced by the global boost value, or the boost value for the
+task CGroup when in use.
+
+This simple biasing approach leverages existing frameworks, which means minimal
+modifications to the scheduler, and yet it allows to achieve a range of
+different behaviours all from a single simple tunable knob.  The only new
+concept introduced is that of signal boosting.
+
+
+3. Signal Boosting Strategy
+===========================
+
+The whole PELT machinery works based on the value of a few utilization tracking
+signals which basically track the CPU bandwidth requirements for tasks and the
+capacity of CPUs. The basic idea behind the SchedTune knob is to artificially
+inflate some of these utilization tracking signals to make a task or RQ appears
+more demanding than it actually is.
+
+Which signals have to be inflated depends on the specific "consumer".  However,
+independently from the specific (signal, consumer) pair, it is important to
+define a simple and possibly consistent strategy for the concept of boosting a
+signal.
+
+A boosting strategy defines how the "abstract" user-space defined
+sched_cfs_boost value is translated into an internal "margin" value to be added
+to a signal to get its inflated value:
+
+  margin         := boosting_strategy(sched_cfs_boost, signal)
+  boosted_signal := signal + margin
+
+Different boosting strategies were identified and analyzed before selecting the
+one found to be most effective. The next section describes the details of this
+boosting strategy.
+
+Signal Proportional Compensation (SPC)
+--------------------------------------
+
+In this boosting strategy the sched_cfs_boost value is used to compute a margin
+which is proportional to the complement of the original signal.  This
+complement is defined as the delta from the actual value of a signal and its
+possible maximum value.
+
+Since the tunable implementation uses signals which have SCHED_CAPACITY_SCALE
+as the maximum possible value, the margin becomes:
+
+	margin := sched_cfs_boost * (SCHED_CAPACITY_SCALE - signal)
+
+Using this boosting strategy:
+- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
+- each value in the range of sched_cfs_boost effectively inflates the signal in
+  question by a quantity which is proportional to the maximum value.
+
+For example, by applying the SPC boosting strategy to the selection of the OPP
+to run a task it is possible to achieve these behaviors:
+
+-   0% boosting: run the task at the minimum OPP required by its workload
+- 100% boosting: run the task at the maximum OPP available for the CPU
+-  50% boosting: run at the half-way OPP between minimum and maximum
+
+Which means that, at 50% boosting, a task will be scheduled to run at half of
+the maximum theoretically achievable performance on the specific target
+platform.
+
+For example, assuming a 50% task which runs for 8ms every 16ms,
+ignoring any margins built into schedutil we should get:
+
+  Boost    OPP  Expected Completion Time
+     0%    50%                  16.00 ms
+    50%    75%       16*50/75 = 10.67 ms
+   100%   100%                   8.00 ms
+
+The reduction of the completion time of boosting 100% instead of 50% is
+only 25%.
+
+A graphical representation of an SPC boosted signal is represented in the
+following figure where:
+ a) "-" represents the original signal
+ b) "b" represents a  50% boosted signal
+ c) "p" represents a 100% boosted signal
+
+
+   |  SCHED_CAPACITY_SCALE
+   +-----------------------------------------------------------------+
+   |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
+   |
+   |                                             boosted_signal
+   |                                          bbbbbbbbbbbbbbbbbbbbbbbb
+   |
+   |                                            original signal
+   |                  bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
+   |                                          |
+   |bbbbbbbbbbbbbbbbbb                        |
+   |                                          |
+   |                                          |
+   |                                          |
+   |                  +-----------------------+
+   |                  |
+   |                  |
+   |                  |
+   |------------------+
+   |
+   |
+   +----------------------------------------------------------------------->
+
+The plot above shows a ramped utilization signal (titled 'original_signal') and
+its boosted equivalent. For each step of the original signal the boosted signal
+corresponding to a 50% boost is midway from the original signal and the upper
+bound. Boosting by 100% generates a boosted signal which is always saturated to
+the upper bound.
+
+
+4. OPP selection using boosted CPU utilization
+==============================================
+
+It is worth calling out that the implementation does not introduce any new
+utilization signals. Instead, it provides an API to tune existing signals. This
+tuning is done on demand and only in scheduler code paths where it is sensible
+to do so.  The new API calls are defined to return either the default signal or
+a boosted one, depending on the value of sched_cfs_boost. This is a clean and
+non invasive modification of the existing existing code paths.
+
+The signal representing a CPU's utilization is boosted according to the
+previously described SPC boosting strategy. To schedutil, this allows a CPU
+(ie CFS run-queue) to appear more used then it actually is.
+
+Thus, with the sched_cfs_boost enabled we have the following main functions to
+get the current utilization of a CPU:
+
+  cpu_util()
+  boosted_cpu_util()
+
+The new boosted_cpu_util() is similar to the first but returns a boosted
+utilization signal which is a function of the sched_cfs_boost value.
+
+This function is used in the CFS scheduler code paths where schedutil needs to
+decide the OPP to run a CPU at.
+For example, this allows selecting the highest OPP for a CPU which has
+the boost value set to 100%.
+
+
+5. Per task group boosting
+==========================
+
+The availability of a single knob which is used to boost all tasks in the
+system is certainly a simple solution but it quite likely doesn't fit many
+use-case, especially in the mobile device space.
+
+For example, on battery powered devices there usually are many background
+services which are long running and need energy efficient scheduling. On the
+other hand, some applications are more performance sensitive and require an
+interactive response and/or maximum performance, regardless of the energy cost.
+To better service such scenarios, the SchedTune implementation has an extension
+that provides a more fine grained boosting interface.
+
+A new CGroup controller, namely "schedtune", could be enabled which allows to
+defined and configure task groups with different boosting values.  Tasks that
+require special performance can be put into separate CGroups.  The value of the
+boost associated with the tasks in this group can be specified using a single
+knob exposed by the CGroup controller:
+
+   schedtune.boost
+
+This knob allows the definition of a boost value that is to be used for
+SPC boosting of all tasks attached to this group.
+
+The current schedtune controller implementation is really simple and has these
+main characteristics:
+
+  1) It is only possible to create 1 level depth hierarchies
+
+     The root control groups define the system-wide boost value to be applied
+     by default to all tasks. Its direct subgroups are named "boost groups" and
+     they define the boost value for specific set of tasks.
+     Further nested subgroups are not allowed since they do not have a sensible
+     meaning from a user-space standpoint.
+
+  2) It is possible to define only a limited number of "boost groups"
+
+     This number is defined at compile time and by default configured to 16.
+     This is a design decision motivated by two main reasons:
+     a) In a real system we do not expect use-cases with more then few
+	boost groups. For example, a reasonable collection of groups could be
+        just "background", "interactive" and "performance".
+     b) It simplifies the implementation considerably, especially for the code
+	which has to compute the per CPU boosting once there are multiple
+        RUNNABLE tasks with different boost values.
+
+Such a simple design should allow servicing the main utilization scenarios
+identified so far. It provides a simple interface which can be used to manage
+the power-performance of all tasks or only selected tasks.  Moreover, this
+interface can be easily integrated by user-space run-times (e.g.  Android,
+ChromeOS) to implement a QoS solution for task boosting based on tasks
+classification, which has been a long standing requirement.
+
+Setup and usage
+---------------
+
+0. Use a kernel with CGROUP_SCHED_TUNE support enabled
+
+1. Check that the "schedtune" CGroup controller is available:
+
+   root@derkdell:~# cat /proc/cgroups
+   #subsys_name	hierarchy	num_cgroups	enabled
+   cpuset  	0		1		1
+   cpu     	0		1		1
+   schedtune	0		1		1
+
+2. Mount a tmpfs to create the CGroups mount point (Optional)
+
+   root@derkdell:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
+
+3. Mount the "schedtune" controller
+
+   root@derkdell:~# mkdir /sys/fs/cgroup/stune
+   root@derkdell:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
+
+4. Setup the system-wide boost value (Optional)
+
+   If not configured the root control group has a 0% boost value, which
+   basically disables boosting for all tasks in the system thus running in
+   an energy-efficient mode. Let assume SYSBOOST defines the default boost
+   value to be used for all tasks:
+
+   root@derkdell:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost
+
+5. Create task groups and configure their specific boost value (Optional)
+
+   For example here we create a "performance" boost group configure to boost
+   all its tasks to 100%
+
+   root@derkdell:~# mkdir /sys/fs/cgroup/stune/performance
+   root@derkdell:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
+
+6. Move tasks into the boost group
+
+   For example, the following moves the tasks with PID $TASKPID (and all its
+   tasks) into the "performance" boost group.
+
+   root@derkdell:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
+
+This simple configuration allows only the tasks of the $TASKPID task to run,
+when needed, at the highest OPP in the most capable CPU of the system.
+
+
+6. Question and Answers
+=======================
+
+What about "auto" mode?
+-----------------------
+
+The 'auto' mode as described in previous propose approaches can be implemented
+by interfacing SchedTune with some suitable user-space element. This element
+could use the exposed system-wide or cgroup based interface.
+
+How are multiple groups of tasks with different boost values managed?
+---------------------------------------------------------------------
+
+The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
+on a CPU. Once schedutil selects the OPP for a CPU, the utilization of that CPU
+is boosted with a value which is the maximum of the boost values of all the
+currently RUNNABLE tasks in that CPU.
+
+This allows schedutil to boost a CPU only while there are boosted tasks ready
+to run and switch back to the energy efficient mode as soon as the last boosted
+task is dequeued.
+
+
+7. References
+=============
+[1] http://lwn.net/Articles/552889
+[2] http://lkml.org/lkml/2012/5/18/91
+[3] http://lkml.org/lkml/2016/3/16/559
+