diff mbox series

[RFC,v3,1/5] sched/core: add capacity constraints to CPU controller

Message ID 1488292722-19410-2-git-send-email-patrick.bellasi@arm.com
State New
Headers show
Series Add capacity capping support to the CPU controller | expand

Commit Message

Patrick Bellasi Feb. 28, 2017, 2:38 p.m. UTC
The CPU CGroup controller allows to assign a specified (maximum)
bandwidth to tasks within a group, however it does not enforce any
constraint on how such bandwidth can be consumed.
With the integration of schedutil, the scheduler has now the proper
information about a task to select  the most suitable frequency to
satisfy tasks needs.

This patch extends the CPU controller by adding a couple of new
attributes, capacity_min and capacity_max, which can be used to enforce
bandwidth boosting and capping. More specifically:

- capacity_min: defines the minimum capacity which should be granted
                (by schedutil) when a task in this group is running,
                i.e. the task will run at least at that capacity

- capacity_max: defines the maximum capacity which can be granted
                (by schedutil) when a task in this group is running,
                i.e. the task can run up to that capacity

These attributes:
a) are tunable at all hierarchy levels, i.e. root group too
b) allow to create subgroups of tasks which are not violating the
   capacity constraints defined by the parent group.
   Thus, tasks on a subgroup can only be more boosted and/or more
   capped, which is matching with the "limits" schema proposed by
   the "Resource Distribution Model (RDM)" suggested by the
   CGroups v2 documentation (Documentation/cgroup-v2.txt)

This patch provides the basic support to expose the two new attributes
and to validate their run-time update based on the "limits" schema of
the RDM.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 init/Kconfig         |  17 ++++++
 kernel/sched/core.c  | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   8 +++
 3 files changed, 170 insertions(+)

-- 
2.7.4

Comments

Patrick Bellasi March 15, 2017, 11:20 a.m. UTC | #1
On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi

> <patrick.bellasi@arm.com> wrote:

> > The CPU CGroup controller allows to assign a specified (maximum)

> > bandwidth to tasks within a group, however it does not enforce any

> > constraint on how such bandwidth can be consumed.

> > With the integration of schedutil, the scheduler has now the proper

> > information about a task to select  the most suitable frequency to

> > satisfy tasks needs.

> [..]

> 

> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,

> > +                                    struct cftype *cft)

> > +{

> > +       struct task_group *tg;

> > +       u64 min_capacity;

> > +

> > +       rcu_read_lock();

> > +       tg = css_tg(css);

> > +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];

> 

> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in

> the write path) to avoid load-tearing?


tg->cap_clamp is an "unsigned int" and thus I would expect a single
memory access to write/read it, isn't it? I mean: I do not expect the
compiler "to mess" with these accesses.

However, if your concerns are more about overlapping read/write for the
same capacity from different threads, then perhaps we should better
use a mutex to serialize these two functions... not entirely convinced...

> Thanks,

> Joel


-- 
#include <best/regards.h>

Patrick Bellasi
Joel Fernandes March 15, 2017, 1:20 p.m. UTC | #2
On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> On 13-Mar 03:46, Joel Fernandes (Google) wrote:

>> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi

>> <patrick.bellasi@arm.com> wrote:

>> > The CPU CGroup controller allows to assign a specified (maximum)

>> > bandwidth to tasks within a group, however it does not enforce any

>> > constraint on how such bandwidth can be consumed.

>> > With the integration of schedutil, the scheduler has now the proper

>> > information about a task to select  the most suitable frequency to

>> > satisfy tasks needs.

>> [..]

>>

>> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,

>> > +                                    struct cftype *cft)

>> > +{

>> > +       struct task_group *tg;

>> > +       u64 min_capacity;

>> > +

>> > +       rcu_read_lock();

>> > +       tg = css_tg(css);

>> > +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];

>>

>> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in

>> the write path) to avoid load-tearing?

>

> tg->cap_clamp is an "unsigned int" and thus I would expect a single

> memory access to write/read it, isn't it? I mean: I do not expect the

> compiler "to mess" with these accesses.


This depends on compiler and arch. I'm not sure if its in practice
these days an issue, but see section on 'load tearing' in
Documentation/memory-barriers.txt . If compiler decided to break down
the access to multiple accesses due to some reason, then might be a
problem.

Adding Paul for his expert opinion on the matter ;)

Thanks,
Joel
Paul E. McKenney March 15, 2017, 4:10 p.m. UTC | #3
On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi

> <patrick.bellasi@arm.com> wrote:

> > On 13-Mar 03:46, Joel Fernandes (Google) wrote:

> >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi

> >> <patrick.bellasi@arm.com> wrote:

> >> > The CPU CGroup controller allows to assign a specified (maximum)

> >> > bandwidth to tasks within a group, however it does not enforce any

> >> > constraint on how such bandwidth can be consumed.

> >> > With the integration of schedutil, the scheduler has now the proper

> >> > information about a task to select  the most suitable frequency to

> >> > satisfy tasks needs.

> >> [..]

> >>

> >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,

> >> > +                                    struct cftype *cft)

> >> > +{

> >> > +       struct task_group *tg;

> >> > +       u64 min_capacity;

> >> > +

> >> > +       rcu_read_lock();

> >> > +       tg = css_tg(css);

> >> > +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];

> >>

> >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in

> >> the write path) to avoid load-tearing?

> >

> > tg->cap_clamp is an "unsigned int" and thus I would expect a single

> > memory access to write/read it, isn't it? I mean: I do not expect the

> > compiler "to mess" with these accesses.

> 

> This depends on compiler and arch. I'm not sure if its in practice

> these days an issue, but see section on 'load tearing' in

> Documentation/memory-barriers.txt . If compiler decided to break down

> the access to multiple accesses due to some reason, then might be a

> problem.


The compiler might also be able to inline cpu_capacity_min_read_u64()
fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
If min_capacity is used several times in the ensuing code, the compiler
could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
best might be a bit confusing.

> Adding Paul for his expert opinion on the matter ;)


My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
I can absolutely prove that the compiler cannot do any destructive
optimizations.  And I not-infrequently find unsuspected opportunities
for destructive optimization in my own code.  Your mileage may vary.  ;-)

							Thanx, Paul
Patrick Bellasi March 15, 2017, 4:44 p.m. UTC | #4
On 15-Mar 09:10, Paul E. McKenney wrote:
> On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:

> > On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi

> > <patrick.bellasi@arm.com> wrote:

> > > On 13-Mar 03:46, Joel Fernandes (Google) wrote:

> > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi

> > >> <patrick.bellasi@arm.com> wrote:

> > >> > The CPU CGroup controller allows to assign a specified (maximum)

> > >> > bandwidth to tasks within a group, however it does not enforce any

> > >> > constraint on how such bandwidth can be consumed.

> > >> > With the integration of schedutil, the scheduler has now the proper

> > >> > information about a task to select  the most suitable frequency to

> > >> > satisfy tasks needs.

> > >> [..]

> > >>

> > >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,

> > >> > +                                    struct cftype *cft)

> > >> > +{

> > >> > +       struct task_group *tg;

> > >> > +       u64 min_capacity;

> > >> > +

> > >> > +       rcu_read_lock();

> > >> > +       tg = css_tg(css);

> > >> > +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];

> > >>

> > >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in

> > >> the write path) to avoid load-tearing?

> > >

> > > tg->cap_clamp is an "unsigned int" and thus I would expect a single

> > > memory access to write/read it, isn't it? I mean: I do not expect the

> > > compiler "to mess" with these accesses.

> > 

> > This depends on compiler and arch. I'm not sure if its in practice

> > these days an issue, but see section on 'load tearing' in

> > Documentation/memory-barriers.txt . If compiler decided to break down

> > the access to multiple accesses due to some reason, then might be a

> > problem.

> 

> The compiler might also be able to inline cpu_capacity_min_read_u64()

> fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.

> If min_capacity is used several times in the ensuing code, the compiler

> could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at

> best might be a bit confusing.


That's actually an interesting case, however I don't think it applies
in this case since cpu_capacity_min_read_u64() is called only via
a function poninter and thus it will never be inlined. isn't it?

> > Adding Paul for his expert opinion on the matter ;)

> 

> My personal approach is to use READ_ONCE() and WRITE_ONCE() unless

> I can absolutely prove that the compiler cannot do any destructive

> optimizations.  And I not-infrequently find unsuspected opportunities

> for destructive optimization in my own code.  Your mileage may vary.  ;-)


I guess here the main concern from Joel is that such a pattern:

   u64 var = unsigned_int_value_from_memory;

can result is a couple of "load from memory" operations.

In that case a similar:

  unsigned_int_left_value = new_unsigned_int_value;

executed on a different thread can overlap with the previous memory
read operations and ending up in "var" containing a not consistent
value.

Question is: can this really happen, given the data types in use?


> 							Thanx, Paul


Thanks! ;-)

-- 
#include <best/regards.h>

Patrick Bellasi
Paul E. McKenney March 15, 2017, 5:24 p.m. UTC | #5
On Wed, Mar 15, 2017 at 04:44:39PM +0000, Patrick Bellasi wrote:
> On 15-Mar 09:10, Paul E. McKenney wrote:

> > On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:

> > > On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi

> > > <patrick.bellasi@arm.com> wrote:

> > > > On 13-Mar 03:46, Joel Fernandes (Google) wrote:

> > > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi

> > > >> <patrick.bellasi@arm.com> wrote:

> > > >> > The CPU CGroup controller allows to assign a specified (maximum)

> > > >> > bandwidth to tasks within a group, however it does not enforce any

> > > >> > constraint on how such bandwidth can be consumed.

> > > >> > With the integration of schedutil, the scheduler has now the proper

> > > >> > information about a task to select  the most suitable frequency to

> > > >> > satisfy tasks needs.

> > > >> [..]

> > > >>

> > > >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,

> > > >> > +                                    struct cftype *cft)

> > > >> > +{

> > > >> > +       struct task_group *tg;

> > > >> > +       u64 min_capacity;

> > > >> > +

> > > >> > +       rcu_read_lock();

> > > >> > +       tg = css_tg(css);

> > > >> > +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];

> > > >>

> > > >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in

> > > >> the write path) to avoid load-tearing?

> > > >

> > > > tg->cap_clamp is an "unsigned int" and thus I would expect a single

> > > > memory access to write/read it, isn't it? I mean: I do not expect the

> > > > compiler "to mess" with these accesses.

> > > 

> > > This depends on compiler and arch. I'm not sure if its in practice

> > > these days an issue, but see section on 'load tearing' in

> > > Documentation/memory-barriers.txt . If compiler decided to break down

> > > the access to multiple accesses due to some reason, then might be a

> > > problem.

> > 

> > The compiler might also be able to inline cpu_capacity_min_read_u64()

> > fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.

> > If min_capacity is used several times in the ensuing code, the compiler

> > could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at

> > best might be a bit confusing.

> 

> That's actually an interesting case, however I don't think it applies

> in this case since cpu_capacity_min_read_u64() is called only via

> a function poninter and thus it will never be inlined. isn't it?

> 

> > > Adding Paul for his expert opinion on the matter ;)

> > 

> > My personal approach is to use READ_ONCE() and WRITE_ONCE() unless

> > I can absolutely prove that the compiler cannot do any destructive

> > optimizations.  And I not-infrequently find unsuspected opportunities

> > for destructive optimization in my own code.  Your mileage may vary.  ;-)

> 

> I guess here the main concern from Joel is that such a pattern:

> 

>    u64 var = unsigned_int_value_from_memory;

> 

> can result is a couple of "load from memory" operations.


Indeed it can.  I first learned this the hard way in the early 1990s,
so 20-year-old compiler optimizations are quite capable of making this
sort of thing happen.

> In that case a similar:

> 

>   unsigned_int_left_value = new_unsigned_int_value;

> 

> executed on a different thread can overlap with the previous memory

> read operations and ending up in "var" containing a not consistent

> value.

> 

> Question is: can this really happen, given the data types in use?


So we have an updater changing the value of unsigned_int_left_value,
while readers in other threads are accessing it, correct?  And you
are asking whether the compiler can optimize the updater so as to
mess up the readers, right?

One such optimization would be a byte-wise write, though I have no
idea why a compiler would do such a thing assuming that the variable
is reasonably sized and aligned.  Another is that the compiler could
use the variable as temporary storage just before the assignment.
(You haven't told the compiler that anyone else is reading it, though
I don't know of this being done by production compilers.)  A third is
that the compiler could fuse successive stores, which might or might
not be a problem, depending.

Probably more, but that should be a start.  ;-)

							Thanx, Paul

> Thanks! ;-)

> 

> -- 

> #include <best/regards.h>

> 

> Patrick Bellasi

>
Patrick Bellasi March 20, 2017, 6:08 p.m. UTC | #6
On 20-Mar 13:15, Tejun Heo wrote:
> Hello,

> 

> On Tue, Feb 28, 2017 at 02:38:38PM +0000, Patrick Bellasi wrote:

> > This patch extends the CPU controller by adding a couple of new

> > attributes, capacity_min and capacity_max, which can be used to enforce

> > bandwidth boosting and capping. More specifically:

> > 

> > - capacity_min: defines the minimum capacity which should be granted

> >                 (by schedutil) when a task in this group is running,

> >                 i.e. the task will run at least at that capacity

> > 

> > - capacity_max: defines the maximum capacity which can be granted

> >                 (by schedutil) when a task in this group is running,

> >                 i.e. the task can run up to that capacity

> 

> cpu.capacity.min and cpu.capacity.max are the more conventional names.


Ok, should be an easy renaming.

> I'm not sure about the name capacity as it doesn't encode what it does

> and is difficult to tell apart from cpu bandwidth limits.  I think

> it'd be better to represent what it controls more explicitly.


In the scheduler jargon, capacity represents the amount of computation
that a CPU can provide and it's usually defined to be 1024 for the
biggest CPU (on non SMP systems) running at the highest OPP (i.e.
maximum frequency).

It's true that it kind of overlaps with the concept of "bandwidth".
However, the main difference here is that "bandwidth" is not frequency
(and architecture) scaled.
Thus, for example, assuming we have only one CPU with these two OPPs:

   OPP | Frequency | Capacity
     1 |    500MHz |      512
     2 |      1GHz |     1024

a task running 60% of the time on that CPU when configured to run at
500MHz, from the bandwidth standpoint it's using 60% bandwidth but, from
the capacity standpoint, is using only 30% of the available capacity.

IOW, bandwidth is purely temporal based while capacity factors in both
frequency and architectural differences.
Thus, while a "bandwidth" constraint limits the amount of time a task
can use a CPU, independently from the "actual computation" performed,
with the new "capacity" constraints we can enforce much "actual
computation" a task can perform in the "unit of time".

> > These attributes:

> > a) are tunable at all hierarchy levels, i.e. root group too

> 

> This usually is problematic because there should be a non-cgroup way

> of configuring the feature in case cgroup isn't configured or used,

> and it becomes awkward to have two separate mechanisms configuring the

> same thing.  Maybe the feature is cgroup specific enough that it makes

> sense here but this needs more explanation / justification.


In the previous proposal I used to expose global tunables under
procfs, e.g.:

 /proc/sys/kernel/sched_capacity_min
 /proc/sys/kernel/sched_capacity_max

which can be used to defined tunable root constraints when CGroups are
not available, and becomes RO when CGroups are.

Can this be eventually an acceptable option?

In any case I think that this feature will be mainly targeting CGroup
based systems. Indeed, one of the main goals is to collect
"application specific" information from "informed run-times". Being
"application specific" means that we need a way to classify
applications depending on the runtime context... and that capability
in Linux is ultimately provided via the CGroup interface.

> > b) allow to create subgroups of tasks which are not violating the

> >    capacity constraints defined by the parent group.

> >    Thus, tasks on a subgroup can only be more boosted and/or more

> 

> For both limits and protections, the parent caps the maximum the

> children can get.  At least that's what memcg does for memory.low.

> Doing that makes sense for memcg because for memory the parent can

> still do protections regardless of what its children are doing and it

> makes delegation safe by default.


Just to be more clear, the current proposal enforces:

- capacity_max_child <= capacity_max_parent

  Since, if a task is constrained to get only up to a certain amount
  of capacity, than its childs cannot use more than that... eventually
  they can only be further constrained.

- capacity_min_child >= capacity_min_parent

  Since, if a task has been boosted to run at least as much fast, than
  its childs cannot be constrained to go slower without eventually
  impacting parent performance.

> I understand why you would want a property like capacity to be the

> other direction as that way you get more specific as you walk down the

> tree for both limits and protections;


Right, the protection schema is defined in such a way to never affect
parent constraints.

> however, I think we need to

> think a bit more about it and ensure that the resulting interface

> isn't confusing.


Sure.

> Would it work for capacity to behave the other

> direction - ie. a parent's min restricting the highest min that its

> descendants can get?  It's completely fine if that's weird.


I had a thought about that possibility and it was not convincing me
from the use-cases standpoint, at least for the ones I've considered.

Reason is that capacity_min is used to implement a concept of
"boosting" where, let say we want to "run a task faster then a minimum
frequency". Assuming that this constraint has been defined because we
know that this task, and likely all its descendant threads, needs at
least that capacity level to perform according to expectations.

In that case the "refining down the hierarchy" can require to boost
further some threads but likely not less.

Does this make sense?

To me this seems to match quite well at least Android/ChromeOS
specific use-cases. I'm not sure if there can be other different
use-cases in the domain for example of managed containers.


> Thanks.

> 

> -- 

> tejun


-- 
#include <best/regards.h>

Patrick Bellasi
Joel Fernandes March 24, 2017, 7:02 a.m. UTC | #7
Hi Patrick,

On Thu, Mar 23, 2017 at 3:32 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
[..]
>> > which can be used to defined tunable root constraints when CGroups are

>> > not available, and becomes RO when CGroups are.

>> >

>> > Can this be eventually an acceptable option?

>> >

>> > In any case I think that this feature will be mainly targeting CGroup

>> > based systems. Indeed, one of the main goals is to collect

>> > "application specific" information from "informed run-times". Being

>> > "application specific" means that we need a way to classify

>> > applications depending on the runtime context... and that capability

>> > in Linux is ultimately provided via the CGroup interface.

>>

>> I think the concern raised is more about whether CGroups is the right

>> interface to use for attaching capacity constraints to task or groups

>> of tasks, or is there a better way to attach such constraints?

>

> Notice that CGroups based classification allows to easily enforce

> the concept of "delegation containment". I think this feature should

> be nice to have whatever interface we choose.

>

> However, potentially we can define a proper per-task API; are you

> thinking to something specifically?

>


I was thinking how about adding per-task constraints to the resource
limits API if it makes sense to? There's already RLIMIT_CPU and
RLIMIT_NICE. An informed-runtime could then modify the limits of tasks
using prlimit.

>> The other advantage of such interface is we don't have to

>> create a separate CGroup for every new constraint limit and can have

>> several tasks with different unique constraints.

>

> That's still possible using CGroups and IMO it will not be the "most

> common case".

> Don't you think that in general we will need to set constraints at

> applications level, thus group of tasks?


Some applications could be a single task, also not all tasks in an
application may need constraints right?

> As a general rule we should probably go for an interface which makes

> easy the most common case.


I agree.

Thanks,
Joel
diff mbox series

Patch

diff --git a/init/Kconfig b/init/Kconfig
index e1a93734..71e46ce 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1044,6 +1044,23 @@  menuconfig CGROUP_SCHED
 	  bandwidth allocation to such task groups. It uses cgroups to group
 	  tasks.
 
+config CAPACITY_CLAMPING
+	bool "Capacity clamping per group of tasks"
+	depends on CPU_FREQ_GOV_SCHEDUTIL
+	depends on CGROUP_SCHED
+	default n
+	help
+	  This feature allows the scheduler to enforce maximum and minimum
+	  capacity on each CPU based on RUNNABLE tasks currently scheduled
+	  on that CPU.
+	  Minimum capacity can be used for example to "boost" the performance
+	  of important tasks by running them on an OPP which can be higher than
+	  the minimum one eventually selected by the schedutil governor.
+	  Maximum capacity can be used for example to "restrict" the maximum
+	  OPP which can be requested by background tasks.
+
+	  If in doubt, say N.
+
 if CGROUP_SCHED
 config FAIR_GROUP_SCHED
 	bool "Group scheduling for SCHED_OTHER"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 34e2291..a171d49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6015,6 +6015,11 @@  void __init sched_init(void)
 	autogroup_init(&init_task);
 #endif /* CONFIG_CGROUP_SCHED */
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+	root_task_group.cap_clamp[CAP_CLAMP_MIN] = 0;
+	root_task_group.cap_clamp[CAP_CLAMP_MAX] = SCHED_CAPACITY_SCALE;
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
 	for_each_possible_cpu(i) {
 		struct rq *rq;
 
@@ -6310,6 +6315,11 @@  struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+	tg->cap_clamp[CAP_CLAMP_MIN] = parent->cap_clamp[CAP_CLAMP_MIN];
+	tg->cap_clamp[CAP_CLAMP_MAX] = parent->cap_clamp[CAP_CLAMP_MAX];
+#endif
+
 	return tg;
 
 err:
@@ -6899,6 +6909,129 @@  static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 		sched_move_task(task);
 }
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+
+static DEFINE_MUTEX(cap_clamp_mutex);
+
+static int cpu_capacity_min_write_u64(struct cgroup_subsys_state *css,
+				      struct cftype *cftype, u64 value)
+{
+	struct cgroup_subsys_state *pos;
+	unsigned int min_value;
+	struct task_group *tg;
+	int ret = -EINVAL;
+
+	min_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
+
+	mutex_lock(&cap_clamp_mutex);
+	rcu_read_lock();
+
+	tg = css_tg(css);
+
+	/* Already at the required value */
+	if (tg->cap_clamp[CAP_CLAMP_MIN] == min_value)
+		goto done;
+
+	/* Ensure to not exceed the maximum capacity */
+	if (tg->cap_clamp[CAP_CLAMP_MAX] < min_value)
+		goto out;
+
+	/* Ensure min cap fits within parent constraint */
+	if (tg->parent &&
+	    tg->parent->cap_clamp[CAP_CLAMP_MIN] > min_value)
+		goto out;
+
+	/* Each child must be a subset of us */
+	css_for_each_child(pos, css) {
+		if (css_tg(pos)->cap_clamp[CAP_CLAMP_MIN] < min_value)
+			goto out;
+	}
+
+	tg->cap_clamp[CAP_CLAMP_MIN] = min_value;
+
+done:
+	ret = 0;
+out:
+	rcu_read_unlock();
+	mutex_unlock(&cap_clamp_mutex);
+
+	return ret;
+}
+
+static int cpu_capacity_max_write_u64(struct cgroup_subsys_state *css,
+				      struct cftype *cftype, u64 value)
+{
+	struct cgroup_subsys_state *pos;
+	unsigned int max_value;
+	struct task_group *tg;
+	int ret = -EINVAL;
+
+	max_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
+
+	mutex_lock(&cap_clamp_mutex);
+	rcu_read_lock();
+
+	tg = css_tg(css);
+
+	/* Already at the required value */
+	if (tg->cap_clamp[CAP_CLAMP_MAX] == max_value)
+		goto done;
+
+	/* Ensure to not go below the minimum capacity */
+	if (tg->cap_clamp[CAP_CLAMP_MIN] > max_value)
+		goto out;
+
+	/* Ensure max cap fits within parent constraint */
+	if (tg->parent &&
+	    tg->parent->cap_clamp[CAP_CLAMP_MAX] < max_value)
+		goto out;
+
+	/* Each child must be a subset of us */
+	css_for_each_child(pos, css) {
+		if (css_tg(pos)->cap_clamp[CAP_CLAMP_MAX] > max_value)
+			goto out;
+	}
+
+	tg->cap_clamp[CAP_CLAMP_MAX] = max_value;
+
+done:
+	ret = 0;
+out:
+	rcu_read_unlock();
+	mutex_unlock(&cap_clamp_mutex);
+
+	return ret;
+}
+
+static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
+				     struct cftype *cft)
+{
+	struct task_group *tg;
+	u64 min_capacity;
+
+	rcu_read_lock();
+	tg = css_tg(css);
+	min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
+	rcu_read_unlock();
+
+	return min_capacity;
+}
+
+static u64 cpu_capacity_max_read_u64(struct cgroup_subsys_state *css,
+				     struct cftype *cft)
+{
+	struct task_group *tg;
+	u64 max_capacity;
+
+	rcu_read_lock();
+	tg = css_tg(css);
+	max_capacity = tg->cap_clamp[CAP_CLAMP_MAX];
+	rcu_read_unlock();
+
+	return max_capacity;
+}
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -7193,6 +7326,18 @@  static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CAPACITY_CLAMPING
+	{
+		.name = "capacity_min",
+		.read_u64 = cpu_capacity_min_read_u64,
+		.write_u64 = cpu_capacity_min_write_u64,
+	},
+	{
+		.name = "capacity_max",
+		.read_u64 = cpu_capacity_max_read_u64,
+		.write_u64 = cpu_capacity_max_write_u64,
+	},
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.name = "cfs_quota_us",
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 71b10a9..05dae4a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -273,6 +273,14 @@  struct task_group {
 #endif
 #endif
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+#define CAP_CLAMP_MIN 0
+#define CAP_CLAMP_MAX 1
+
+	/* Min and Max capacity constraints for tasks in this group */
+	unsigned int cap_clamp[2];
+#endif
+
 #ifdef CONFIG_RT_GROUP_SCHED
 	struct sched_rt_entity **rt_se;
 	struct rt_rq **rt_rq;