diff mbox

[RFC,v2,5/8] sched/tune: add initial support for CGroups based boosting

Message ID 20161027174108.31139-6-patrick.bellasi@arm.com
State New
Headers show

Commit Message

Patrick Bellasi Oct. 27, 2016, 5:41 p.m. UTC
To support task performance boosting, the usage of a single knob has the
advantage to be a simple solution, both from the implementation and the
usability standpoint.  However, on a real system it can be difficult to
identify a single value for the knob which fits the needs of multiple
different tasks. For example, some kernel threads and/or user-space
background services should be better managed the "standard" way while we
still want to be able to boost the performance of specific workloads.

In order to improve the flexibility of the task boosting mechanism this
patch is the first of a small series which extends the previous
implementation to introduce a "per task group" support.

This first patch introduces just the basic CGroups support, a new
"schedtune" CGroups controller is added which allows to configure
different boost value for different groups of tasks.
To keep the implementation simple while still supporting an effective
boosting strategy, the new controller:
  1. allows only a two layer hierarchy
  2. supports only a limited number of boost groups

A two layer hierarchy allows to place each task either:
  a) in the root control group
     thus being subject to a system-wide boosting value
  b) in a child of the root group
     thus being subject to the specific boost value defined by that
     "boost group"

The limited number of "boost groups" supported is mainly motivated by
the observation that in a real system it could be useful to have only
few classes of tasks which deserve different treatment.
For example, background vs foreground or interactive vs low-priority.

As an additional benefit, a limited number of boost groups allows also
to have a simpler implementation, especially for the code required to
compute the boost value for CPUs which have RUNNABLE tasks belonging to
different boost groups.

Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>

---
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  42 ++++++++
 kernel/sched/tune.c           | 233 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c               |   4 +
 4 files changed, 283 insertions(+)

-- 
2.10.1

Comments

Tejun Heo Oct. 27, 2016, 6:30 p.m. UTC | #1
Hello, Patrick.

On Thu, Oct 27, 2016 at 06:41:05PM +0100, Patrick Bellasi wrote:
> To support task performance boosting, the usage of a single knob has the

> advantage to be a simple solution, both from the implementation and the

> usability standpoint.  However, on a real system it can be difficult to

> identify a single value for the knob which fits the needs of multiple

> different tasks. For example, some kernel threads and/or user-space

> background services should be better managed the "standard" way while we

> still want to be able to boost the performance of specific workloads.

> 

> In order to improve the flexibility of the task boosting mechanism this

> patch is the first of a small series which extends the previous

> implementation to introduce a "per task group" support.

> 

> This first patch introduces just the basic CGroups support, a new

> "schedtune" CGroups controller is added which allows to configure

> different boost value for different groups of tasks.

> To keep the implementation simple while still supporting an effective

> boosting strategy, the new controller:

>   1. allows only a two layer hierarchy

>   2. supports only a limited number of boost groups

> 

> A two layer hierarchy allows to place each task either:

>   a) in the root control group

>      thus being subject to a system-wide boosting value

>   b) in a child of the root group

>      thus being subject to the specific boost value defined by that

>      "boost group"

> 

> The limited number of "boost groups" supported is mainly motivated by

> the observation that in a real system it could be useful to have only

> few classes of tasks which deserve different treatment.

> For example, background vs foreground or interactive vs low-priority.

> 

> As an additional benefit, a limited number of boost groups allows also

> to have a simpler implementation, especially for the code required to

> compute the boost value for CPUs which have RUNNABLE tasks belonging to

> different boost groups.


So, skipping on the actual details of boosting mechanism, in terms of
cgroup support, it should be integrated into the existing cpu
controller and have proper support for hierarchy.  Note that hierarchy
support doesn't necessarily mean that boosting itself has to be
hierarchical.  It can be, for example, something along the line of
"the descendants are allowed upto this level of boosting" so that the
hierarchy just serves to assign the appropriate boosting values to the
groups of tasks.

Thanks.

-- 
tejun
Patrick Bellasi Oct. 27, 2016, 8:14 p.m. UTC | #2
On 27-Oct 14:30, Tejun Heo wrote:
> Hello, Patrick.


Hi Tejun,

> On Thu, Oct 27, 2016 at 06:41:05PM +0100, Patrick Bellasi wrote:

> > To support task performance boosting, the usage of a single knob has the

> > advantage to be a simple solution, both from the implementation and the

> > usability standpoint.  However, on a real system it can be difficult to

> > identify a single value for the knob which fits the needs of multiple

> > different tasks. For example, some kernel threads and/or user-space

> > background services should be better managed the "standard" way while we

> > still want to be able to boost the performance of specific workloads.

> > 

> > In order to improve the flexibility of the task boosting mechanism this

> > patch is the first of a small series which extends the previous

> > implementation to introduce a "per task group" support.

> > 

> > This first patch introduces just the basic CGroups support, a new

> > "schedtune" CGroups controller is added which allows to configure

> > different boost value for different groups of tasks.

> > To keep the implementation simple while still supporting an effective

> > boosting strategy, the new controller:

> >   1. allows only a two layer hierarchy

> >   2. supports only a limited number of boost groups

> > 

> > A two layer hierarchy allows to place each task either:

> >   a) in the root control group

> >      thus being subject to a system-wide boosting value

> >   b) in a child of the root group

> >      thus being subject to the specific boost value defined by that

> >      "boost group"

> > 

> > The limited number of "boost groups" supported is mainly motivated by

> > the observation that in a real system it could be useful to have only

> > few classes of tasks which deserve different treatment.

> > For example, background vs foreground or interactive vs low-priority.

> > 

> > As an additional benefit, a limited number of boost groups allows also

> > to have a simpler implementation, especially for the code required to

> > compute the boost value for CPUs which have RUNNABLE tasks belonging to

> > different boost groups.

> 

> So, skipping on the actual details of boosting mechanism, in terms of

> cgroup support, it should be integrated into the existing cpu

> controller and have proper support for hierarchy.


I have a couple of concerns/questions about both of these points.

First, regarding the integration with the cpu controller,
don't we risk to overload the semantic of the cpu controller?

Right now this controller is devoted to track the bandwidth that a
group of tasks can consume and/or to repartition the available
bandwidth among the tasks in that group.
Boosting is a different concept, it's kind-of related to CPU bandwidth
but it targets a completely different goal, i.e. biasing schedutils
and (in the future) scheduler decisions.

I'm wondering also how much confusing and complex it can be to
configure a system where you have not overlapping groups of tasks with
different bandwidth and boosting requirements.

For example, let assume we have three tasks: A, B, and C and we want:

   Bandwidth:  10% for A and B,  20% for C
   Boost:      10% for A,         0% for B and C

IMO, configuring such a set of constraints would be quite complex if
we expose the boost value through the cpu controller.

> Note that hierarchy

> support doesn't necessarily mean that boosting itself has to be

> hierarchical.


Initially I've actually considered such a design, however...

>It can be, for example, something along the line of

> "the descendants are allowed upto this level of boosting" so that the

> hierarchy just serves to assign the appropriate boosting values to the

> groups of tasks.


... the current "single layer hierarchy" has been proposed instead for
two main reasons.

First, we was not able to think about realistic use-cases where we
need this "up to this level" semantic.
For boosting purposes, tasks are grouped based on their role and/or
importance in the system. This property is usually defined in
"absolute" terms instead of "relative" therms.
Does it make sense to say that task A can be boosted only up to how
much is task B? In our experience probably never.

The second reason is mainly related to the possibility to have an
efficient and low-overhead implementation. The currently defined
semantic for CPU boosting requires to perform certain operations at
each task enqueue and dequeue events. Some of these operations are
part of the hot path in the scheduler code. The flat hierarchy allows
to use per-cpu data structures and algorithms which aims at being
efficient in reducing the overheads incurred in doing the required
accounting.

As a final remark, I would like to say that Google is currently using
SchedTune in Android to classify tasks by "importance" and feed this
information into the scheduler. Doing this exercise, so far we did not
spot limitations related to the usage of a flat hierarchy.

However, I like to have this discussion, which it's actually the main
goal of this RFC. My suggestion is just that we should think about
use-cases before and than introduce a more complex solution, but only
if we convince ourself that it can bring more benefits than burdens in
code maintainability.

Is your request for a "proper support for hierarchy" somehow related to
the requirements for the "unified hierarchy"? Or do you see also other
more functional/semantic aspects?


> Thanks.


If you are going to attend LPC next week, I hope we can have a chat on
these topics.

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi
Tejun Heo Oct. 27, 2016, 8:39 p.m. UTC | #3
Hello, Patrick.

On Thu, Oct 27, 2016 at 09:14:39PM +0100, Patrick Bellasi wrote:
> I'm wondering also how much confusing and complex it can be to

> configure a system where you have not overlapping groups of tasks with

> different bandwidth and boosting requirements.

> 

> For example, let assume we have three tasks: A, B, and C and we want:

> 

>    Bandwidth:  10% for A and B,  20% for C

>    Boost:      10% for A,         0% for B and C

> 

> IMO, configuring such a set of constraints would be quite complex if

> we expose the boost value through the cpu controller.


Going back to your use case point, when would we realistically need
this?

> > Note that hierarchy

> > support doesn't necessarily mean that boosting itself has to be

> > hierarchical.

> 

> Initially I've actually considered such a design, however...

> 

> >It can be, for example, something along the line of

> > "the descendants are allowed upto this level of boosting" so that the

> > hierarchy just serves to assign the appropriate boosting values to the

> > groups of tasks.

> 

> ... the current "single layer hierarchy" has been proposed instead for

> two main reasons.

> 

> First, we was not able to think about realistic use-cases where we

> need this "up to this level" semantic.

> For boosting purposes, tasks are grouped based on their role and/or

> importance in the system. This property is usually defined in

> "absolute" terms instead of "relative" therms.

> Does it make sense to say that task A can be boosted only up to how

> much is task B? In our experience probably never.


There are basic semantics that people expect when they use cgroup for
resource control and it enables things like layering and delegating
configuration.

> The second reason is mainly related to the possibility to have an

> efficient and low-overhead implementation. The currently defined

> semantic for CPU boosting requires to perform certain operations at

> each task enqueue and dequeue events. Some of these operations are

> part of the hot path in the scheduler code. The flat hierarchy allows

> to use per-cpu data structures and algorithms which aims at being

> efficient in reducing the overheads incurred in doing the required

> accounting.


Unless I'm misunderstanding, the actually applied attributes should be
calculable during config changes or task migration, right?  The
hierarchy's function would be allowing layering and delegating
configurations and shouldn't get in the way of actual enforcement.

> As a final remark, I would like to say that Google is currently using

> SchedTune in Android to classify tasks by "importance" and feed this

> information into the scheduler. Doing this exercise, so far we did not

> spot limitations related to the usage of a flat hierarchy.

> 

> However, I like to have this discussion, which it's actually the main

> goal of this RFC. My suggestion is just that we should think about

> use-cases before and than introduce a more complex solution, but only

> if we convince ourself that it can bring more benefits than burdens in

> code maintainability.

> 

> Is your request for a "proper support for hierarchy" somehow related to

> the requirements for the "unified hierarchy"? Or do you see also other

> more functional/semantic aspects?


Not necessarily.  In general, all controllers, whether on v1 or v2,
should be fully hierarchical for reasons mentioned above.  I get that
flat was fine for android but flat hierarchy would be fine for most
controllers for android.  That's not the only use case we should be
considering, right?

> > Thanks.

> 

> If you are going to attend LPC next week, I hope we can have a chat on

> these topics.


Yeah, sure, I'll be around till Thursday.  Let's chat there.

Thanks.

-- 
tejun
Patrick Bellasi Oct. 27, 2016, 10:34 p.m. UTC | #4
On 27-Oct 16:39, Tejun Heo wrote:
> Hello, Patrick.

> 

> On Thu, Oct 27, 2016 at 09:14:39PM +0100, Patrick Bellasi wrote:

> > I'm wondering also how much confusing and complex it can be to

> > configure a system where you have not overlapping groups of tasks with

> > different bandwidth and boosting requirements.

> > 

> > For example, let assume we have three tasks: A, B, and C and we want:

> > 

> >    Bandwidth:  10% for A and B,  20% for C

> >    Boost:      10% for A,         0% for B and C

> > 

> > IMO, configuring such a set of constraints would be quite complex if

> > we expose the boost value through the cpu controller.

> 

> Going back to your use case point, when would we realistically need

> this?


If we really want to be generic, we cannot exclude there could be this
kind of scenarios. What this toy example aims to show is just that, in
general, how much we want to boost a task can be decoupled from the
bandwidth reservation it shares with others.

> > > Note that hierarchy

> > > support doesn't necessarily mean that boosting itself has to be

> > > hierarchical.

> > 

> > Initially I've actually considered such a design, however...

> > 

> > >It can be, for example, something along the line of

> > > "the descendants are allowed upto this level of boosting" so that the

> > > hierarchy just serves to assign the appropriate boosting values to the

> > > groups of tasks.

> > 

> > ... the current "single layer hierarchy" has been proposed instead for

> > two main reasons.

> > 

> > First, we was not able to think about realistic use-cases where we

> > need this "up to this level" semantic.

> > For boosting purposes, tasks are grouped based on their role and/or

> > importance in the system. This property is usually defined in

> > "absolute" terms instead of "relative" therms.

> > Does it make sense to say that task A can be boosted only up to how

> > much is task B? In our experience probably never.

> 

> There are basic semantics that people expect when they use cgroup for

> resource control and it enables things like layering and delegating

> configuration.


I see your point and I understand it, still I'm not completely
convinced that these concepts (i.e. layering and delegation) are
really required for the specific topic of "tasks classification" for
the purposes of energy-vs-performance tuning.

Perhaps this boils down to the fact that, for the specific needs of
tasks boosting, from the "Control Group" framework we are less
interested in the "Control" component than in the "Group" one.

> > The second reason is mainly related to the possibility to have an

> > efficient and low-overhead implementation. The currently defined

> > semantic for CPU boosting requires to perform certain operations at

> > each task enqueue and dequeue events. Some of these operations are

> > part of the hot path in the scheduler code. The flat hierarchy allows

> > to use per-cpu data structures and algorithms which aims at being

> > efficient in reducing the overheads incurred in doing the required

> > accounting.

> 

> Unless I'm misunderstanding, the actually applied attributes should be

> calculable during config changes or task migration, right?


Perhaps you are missing enqueue/dequeue operations, which is not
necessarily only due to tasks migrations.

For example, the semantic exposed by SchedTune is such that if we have
two tasks RUNNABLE on a CPU:
  T1 30% boosted
  T2 60% boosted
then the CPU will be boosted 60%, while T2 is running, and boosted
only 30% as soon as T2 goes to sleep and only T1 is still runnable.

> The

> hierarchy's function would be allowing layering and delegating

> configurations and shouldn't get in the way of actual enforcement.


Ok, I should better think about this distinction between
layering/delegation and control enforcement.
 
> > As a final remark, I would like to say that Google is currently using

> > SchedTune in Android to classify tasks by "importance" and feed this

> > information into the scheduler. Doing this exercise, so far we did not

> > spot limitations related to the usage of a flat hierarchy.

> > 

> > However, I like to have this discussion, which it's actually the main

> > goal of this RFC. My suggestion is just that we should think about

> > use-cases before and than introduce a more complex solution, but only

> > if we convince ourself that it can bring more benefits than burdens in

> > code maintainability.

> > 

> > Is your request for a "proper support for hierarchy" somehow related to

> > the requirements for the "unified hierarchy"? Or do you see also other

> > more functional/semantic aspects?

> 

> Not necessarily.  In general, all controllers, whether on v1 or v2,

> should be fully hierarchical for reasons mentioned above.  I get that

> flat was fine for android but flat hierarchy would be fine for most

> controllers for android.  That's not the only use case we should be

> considering, right?


So far we had experience mainly with Android and ChromeOS, which
exposes a valuable and quite interesting set of realistic use-cases,
especially if we consider the possibility to collect task's
information from "informed runtimes".

However, I absolutely agree with you, it's worth to consider all the
other use-cases we can think about.

> > > Thanks.

> > 

> > If you are going to attend LPC next week, I hope we can have a chat on

> > these topics.

> 

> Yeah, sure, I'll be around till Thursday.  Let's chat there.


Cool, thanks!

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi
diff mbox

Patch

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0df0336a..4fd0f82 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -20,6 +20,10 @@  SUBSYS(cpu)
 SUBSYS(cpuacct)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_SCHED_TUNE)
+SUBSYS(schedtune)
+#endif
+
 #if IS_ENABLED(CONFIG_BLK_CGROUP)
 SUBSYS(io)
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 461e052..5bce1ef 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1074,6 +1074,48 @@  config RT_GROUP_SCHED
 
 endif #CGROUP_SCHED
 
+config CGROUP_SCHED_TUNE
+	bool "Tasks boosting controller"
+	depends on SCHED_TUNE
+	help
+	  This option allows users to define boost values for groups of
+	  SCHED_OTHER tasks. Once enabled, the utilization of a CPU is boosted
+	  by a factor proportional to the maximum boost value of all the tasks
+	  RUNNABLE on that CPU.
+
+	  This new controller:
+	  1. allows only a two layers hierarchy, where the root defines the
+	     system-wide boost value and its children define a
+	     "boost group" whose tasks will be boosted with the configured
+	     value.
+	  2. supports only a limited number of different boost groups, each
+	     one which could be configured with a different boost value.
+
+	  Say N if unsure.
+
+config SCHED_TUNE_BOOSTGROUPS
+	int "Maximum number of SchedTune's boost groups"
+	range 2 16
+	default 5
+	depends on CGROUP_SCHED_TUNE
+
+	help
+	  When per-task boosting is used we still allow only limited number of
+	  boost groups for two main reasons:
+	  1. on a real system we usually have only few classes of workloads which
+	     make sense to boost with different values,
+	     e.g. background vs foreground tasks, interactive vs low-priority tasks
+	  2. a limited number allows for a simpler and more memory/time efficient
+	     implementation especially for the computation of the per-CPU boost
+	     value
+
+	  NOTE: The first boost group is reserved to defined the global boosting to be
+	  applied to all tasks, thus the minimum number of boost groups is 2.
+	  Indeed, if only global boosting is required than per-task boosting is
+	  not required and this support can be disabled.
+
+	  Use the default value (5) is unsure.
+
 config CGROUP_PIDS
 	bool "PIDs controller"
 	help
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index c28a06f..4eaea1d 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -4,11 +4,239 @@ 
  * Copyright (C) 2016 ARM Ltd, Patrick Bellasi <patrick.bellasi@arm.com>
  */
 
+#include <linux/cgroup.h>
+#include <linux/err.h>
+#include <linux/percpu.h>
+#include <linux/slab.h>
+
 #include "sched.h"
 #include "tune.h"
 
 unsigned int sysctl_sched_cfs_boost __read_mostly;
 
+#ifdef CONFIG_CGROUP_SCHED_TUNE
+
+/*
+ * CFS Scheduler Tunables for Task Groups.
+ */
+
+/* SchedTune tunables for a group of tasks */
+struct schedtune {
+	/* SchedTune CGroup subsystem */
+	struct cgroup_subsys_state css;
+
+	/* Boost group allocated ID */
+	int idx;
+
+	/* Boost value for tasks on that SchedTune CGroup */
+	unsigned int boost;
+
+};
+
+static inline struct schedtune *css_st(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct schedtune, css) : NULL;
+}
+
+static inline struct schedtune *task_schedtune(struct task_struct *tsk)
+{
+	return css_st(task_css(tsk, schedtune_cgrp_id));
+}
+
+static inline struct schedtune *parent_st(struct schedtune *st)
+{
+	return css_st(st->css.parent);
+}
+
+/*
+ * SchedTune root control group
+ * The root control group is used to define a system-wide boosting tuning,
+ * which is applied to all tasks in the system.
+ * Task specific boost tuning could be specified by creating and
+ * configuring a child control group under the root one.
+ * By default, system-wide boosting is disabled, i.e. no boosting is applied
+ * to tasks which are not into a child control group.
+ */
+static struct schedtune
+root_schedtune = {
+	.boost	= 0,
+};
+
+/*
+ * Maximum number of boost groups to support
+ * When per-task boosting is used we still allow only limited number of
+ * boost groups for two main reasons:
+ * 1. on a real system we usually have only few classes of workloads which
+ *    make sense to boost with different values (e.g. background vs foreground
+ *    tasks, interactive vs low-priority tasks)
+ * 2. a limited number allows for a simpler and more memory/time efficient
+ *    implementation especially for the computation of the per-CPU boost
+ *    value
+ */
+#define boostgroups_max CONFIG_SCHED_TUNE_BOOSTGROUPS
+
+/* Array of configured boostgroups */
+static struct schedtune *allocated_group[boostgroups_max] = {
+	&root_schedtune,
+	NULL,
+};
+
+/* SchedTune boost groups
+ * Keep track of all the boost groups which impact on CPU, for example when a
+ * CPU has two RUNNABLE tasks belonging to two different boost groups and thus
+ * likely with different boost values. Since the maximum number of boost
+ * groups is limited by CONFIG_SCHED_TUNE_BOOSTGROUPS, which is limited to 16,
+ * we use a simple array to keep track of the metrics required to compute the
+ * maximum per-CPU
+ * boosting value.
+ */
+struct boost_groups {
+	/* Maximum boost value for all RUNNABLE tasks on a CPU */
+	unsigned int boost_max;
+	struct {
+		/* The boost for tasks on that boost group */
+		unsigned int boost;
+		/* Count of RUNNABLE tasks on that boost group */
+		unsigned int tasks;
+	} group[boostgroups_max];
+};
+
+/* Boost groups affecting each CPU in the system */
+DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups);
+
+static u64
+boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct schedtune *st = css_st(css);
+
+	return st->boost;
+}
+
+static int
+boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
+	    u64 boost)
+{
+	struct schedtune *st = css_st(css);
+
+	if (boost > 100)
+		return -EINVAL;
+	st->boost = boost;
+	if (css == &root_schedtune.css)
+		sysctl_sched_cfs_boost = boost;
+	return 0;
+}
+
+static struct cftype files[] = {
+	{
+		.name = "boost",
+		.read_u64 = boost_read,
+		.write_u64 = boost_write,
+	},
+	{ }	/* terminate */
+};
+
+static int
+schedtune_boostgroup_init(struct schedtune *st)
+{
+	struct boost_groups *bg;
+	int cpu;
+
+	/* Keep track of allocated boost groups */
+	allocated_group[st->idx] = st;
+
+	/* Initialize the per CPU boost groups */
+	for_each_possible_cpu(cpu) {
+		bg = &per_cpu(cpu_boost_groups, cpu);
+		bg->group[st->idx].boost = 0;
+		bg->group[st->idx].tasks = 0;
+	}
+
+	return 0;
+}
+
+static struct cgroup_subsys_state *
+schedtune_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct schedtune *st;
+	int idx;
+
+	if (!parent_css)
+		return &root_schedtune.css;
+
+	/* Allow only single level hierachies */
+	if (parent_css != &root_schedtune.css) {
+		pr_err("Nested SchedTune boosting groups not allowed\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* Allow only a limited number of boosting groups */
+	for (idx = 1; idx < boostgroups_max; ++idx)
+		if (!allocated_group[idx])
+			break;
+	if (idx == boostgroups_max) {
+		pr_err("Trying to create more than %d SchedTune boosting groups\n",
+		       boostgroups_max);
+		return ERR_PTR(-ENOSPC);
+	}
+
+	st = kzalloc(sizeof(*st), GFP_KERNEL);
+	if (!st)
+		goto out;
+
+	/* Initialize per CPUs boost group support */
+	st->idx = idx;
+	if (schedtune_boostgroup_init(st))
+		goto release;
+
+	return &st->css;
+
+release:
+	kfree(st);
+out:
+	return ERR_PTR(-ENOMEM);
+}
+
+static void
+schedtune_boostgroup_release(struct schedtune *st)
+{
+	/* Keep track of allocated boost groups */
+	allocated_group[st->idx] = NULL;
+}
+
+static void
+schedtune_css_free(struct cgroup_subsys_state *css)
+{
+	struct schedtune *st = css_st(css);
+
+	schedtune_boostgroup_release(st);
+	kfree(st);
+}
+
+struct cgroup_subsys schedtune_cgrp_subsys = {
+	.css_alloc	= schedtune_css_alloc,
+	.css_free	= schedtune_css_free,
+	.legacy_cftypes	= files,
+	.early_init	= 1,
+};
+
+static inline void
+schedtune_init_cgroups(void)
+{
+	struct boost_groups *bg;
+	int cpu;
+
+	/* Initialize the per CPU boost groups */
+	for_each_possible_cpu(cpu) {
+		bg = &per_cpu(cpu_boost_groups, cpu);
+		memset(bg, 0, sizeof(struct boost_groups));
+	}
+
+	pr_info("schedtune: configured to support %d boost groups\n",
+		boostgroups_max);
+}
+
+#endif /* CONFIG_CGROUP_SCHED_TUNE */
+
 int
 sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
 			       void __user *buffer, size_t *lenp,
@@ -26,6 +254,11 @@  static int
 schedtune_init(void)
 {
 	schedtune_spc_rdiv = reciprocal_value(100);
+#ifdef CONFIG_CGROUP_SCHED_TUNE
+	schedtune_init_cgroups();
+#else
+	pr_info("schedtune: configured to support global boosting only\n");
+#endif
 	return 0;
 }
 late_initcall(schedtune_init);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 43b6d14..12c3432 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -447,7 +447,11 @@  static struct ctl_table kern_table[] = {
 		.procname	= "sched_cfs_boost",
 		.data		= &sysctl_sched_cfs_boost,
 		.maxlen		= sizeof(sysctl_sched_cfs_boost),
+#ifdef CONFIG_CGROUP_SCHED_TUNE
+		.mode		= 0444,
+#else
 		.mode		= 0644,
+#endif
 		.proc_handler	= &sysctl_sched_cfs_boost_handler,
 		.extra1		= &zero,
 		.extra2		= &one_hundred,