diff mbox

[v5] cgroup: Add new capability to allow a process to migrate other tasks between cgroups

Message ID 1481593143-18756-1-git-send-email-john.stultz@linaro.org
State New
Headers show

Commit Message

John Stultz Dec. 13, 2016, 1:39 a.m. UTC
This patch adds CAP_GROUP_MIGRATE and logic to allows a process
to migrate other tasks between cgroups.

In Android (where this feature originated), the ActivityManager
tracks various application states (TOP_APP, FOREGROUND,
BACKGROUND, SYSTEM, etc), and then as applications change
states, the SchedPolicy logic will migrate the application tasks
between different cgroups used to control the different
application states (for example, there is a background cpuset
cgroup which can limit background tasks to stay on one low-power
cpu, and the bg_non_interactive cpuctrl cgroup can then further
limit those background tasks to a small percentage of that one
cpu's cpu time).

However, for security reasons, Android doesn't want to make the
system_server (the process that runs the ActivityManager and
SchedPolicy logic), run as root. So in the Android common.git
kernel, they have some logic to allow cgroups to loosen their
permissions so CAP_SYS_NICE tasks can migrate other tasks between
cgroups.

I feel the approach taken there overloads CAP_SYS_NICE a bit much
for non-android environments. Efforts to re-use CAP_SYS_RESOURCE
for this purpose (which Android has since adopted) was also
stymied by concerns about risks from future cgroups that could be
considered "dangerous" by how they might change system semantics.

So to avoid overlapping usage, this patch adds a brand new
process capability flag (CAP_CGROUP_MIGRATE), and uses it when
checking if a task can migrate other tasks between cgroups.

I've tested this with AOSP master (though its a bit hacked in as
I still need to properly get the selinux bits aware of the new
capability bit) with selinux set to permissive and it seems to be
working well.

Thoughts and feedback would be appreciated!

Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: cgroups@vger.kernel.org
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Colin Cross <ccross@android.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Christian Poetzsch <christian.potzsch@imgtec.com>
Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: linux-api@vger.kernel.org
Acked-by: Serge Hallyn <serge@hallyn.com>

Signed-off-by: John Stultz <john.stultz@linaro.org>

---
v2: Renamed to just CAP_CGROUP_MIGRATE as reccomended by Tejun
v3: Switched to just using CAP_SYS_RESOURCE as suggested by Michael
v4: Send out properly folded down version of the patch. :P
v5: Switch back to CAP_CGROUP_MIGRATE due to concerns from Andy
---
 include/uapi/linux/capability.h | 5 ++++-
 kernel/cgroup.c                 | 3 ++-
 2 files changed, 6 insertions(+), 2 deletions(-)

-- 
2.7.4

Comments

John Stultz Dec. 13, 2016, 1:40 a.m. UTC | #1
On Mon, Dec 12, 2016 at 5:39 PM, John Stultz <john.stultz@linaro.org> wrote:
> This patch adds CAP_GROUP_MIGRATE and logic to allows a process

> to migrate other tasks between cgroups.

>

> In Android (where this feature originated), the ActivityManager

> tracks various application states (TOP_APP, FOREGROUND,

> BACKGROUND, SYSTEM, etc), and then as applications change

> states, the SchedPolicy logic will migrate the application tasks

> between different cgroups used to control the different

> application states (for example, there is a background cpuset

> cgroup which can limit background tasks to stay on one low-power

> cpu, and the bg_non_interactive cpuctrl cgroup can then further

> limit those background tasks to a small percentage of that one

> cpu's cpu time).

>

> However, for security reasons, Android doesn't want to make the

> system_server (the process that runs the ActivityManager and

> SchedPolicy logic), run as root. So in the Android common.git

> kernel, they have some logic to allow cgroups to loosen their

> permissions so CAP_SYS_NICE tasks can migrate other tasks between

> cgroups.

>

> I feel the approach taken there overloads CAP_SYS_NICE a bit much

> for non-android environments. Efforts to re-use CAP_SYS_RESOURCE

> for this purpose (which Android has since adopted) was also

> stymied by concerns about risks from future cgroups that could be

> considered "dangerous" by how they might change system semantics.

>

> So to avoid overlapping usage, this patch adds a brand new

> process capability flag (CAP_CGROUP_MIGRATE), and uses it when

> checking if a task can migrate other tasks between cgroups.

>

> I've tested this with AOSP master (though its a bit hacked in as

> I still need to properly get the selinux bits aware of the new

> capability bit) with selinux set to permissive and it seems to be

> working well.

>

> Thoughts and feedback would be appreciated!

>

> Cc: Tejun Heo <tj@kernel.org>

> Cc: Li Zefan <lizefan@huawei.com>

> Cc: Jonathan Corbet <corbet@lwn.net>

> Cc: cgroups@vger.kernel.org

> Cc: Android Kernel Team <kernel-team@android.com>

> Cc: Rom Lemarchand <romlem@android.com>

> Cc: Colin Cross <ccross@android.com>

> Cc: Dmitry Shmidt <dimitrysh@google.com>

> Cc: Todd Kjos <tkjos@google.com>

> Cc: Christian Poetzsch <christian.potzsch@imgtec.com>

> Cc: Amit Pundir <amit.pundir@linaro.org>

> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>

> Cc: Kees Cook <keescook@chromium.org>

> Cc: Serge E. Hallyn <serge@hallyn.com>

> Cc: Andy Lutomirski <luto@amacapital.net>

> Cc: linux-api@vger.kernel.org

> Acked-by: Serge Hallyn <serge@hallyn.com>


After sending this I just realized that this is changed enough I
should probably remove Serge's Acked-by here. Apologies.

But otherwise feedback on this would be appreciated!

thanks
-john
Michael Kerrisk (man-pages) Dec. 13, 2016, 9:47 a.m. UTC | #2
Hi John,

On 13 December 2016 at 02:39, John Stultz <john.stultz@linaro.org> wrote:
> This patch adds CAP_GROUP_MIGRATE and logic to allows a process


s/CAP_GROUP_MIGRATE/CAP_CGROUP_MIGRATE/

> to migrate other tasks between cgroups.

>

> In Android (where this feature originated), the ActivityManager

> tracks various application states (TOP_APP, FOREGROUND,

> BACKGROUND, SYSTEM, etc), and then as applications change

> states, the SchedPolicy logic will migrate the application tasks

> between different cgroups used to control the different

> application states (for example, there is a background cpuset

> cgroup which can limit background tasks to stay on one low-power

> cpu, and the bg_non_interactive cpuctrl cgroup can then further

> limit those background tasks to a small percentage of that one

> cpu's cpu time).

>

> However, for security reasons, Android doesn't want to make the

> system_server (the process that runs the ActivityManager and

> SchedPolicy logic), run as root. So in the Android common.git

> kernel, they have some logic to allow cgroups to loosen their

> permissions so CAP_SYS_NICE tasks can migrate other tasks between

> cgroups.

>

> I feel the approach taken there overloads CAP_SYS_NICE a bit much

> for non-android environments. Efforts to re-use CAP_SYS_RESOURCE

> for this purpose (which Android has since adopted) was also

> stymied by concerns about risks from future cgroups that could be

> considered "dangerous" by how they might change system semantics.

>

> So to avoid overlapping usage, this patch adds a brand new

> process capability flag (CAP_CGROUP_MIGRATE), and uses it when

> checking if a task can migrate other tasks between cgroups.

>

> I've tested this with AOSP master (though its a bit hacked in as

> I still need to properly get the selinux bits aware of the new

> capability bit) with selinux set to permissive and it seems to be

> working well.

>

> Thoughts and feedback would be appreciated!


So, back to the discussion of silos. I understand the argument for
wanting a new silo. But, in that case can we at least try not to make
it a single-use silo?

How about CAP_CGROUP_CONTROL or some such, with the idea that this
might be a capability that allows the holder to step outside usual
cgroup rules? At the moment, that capability would allow only one such
step, but maybe there would be others in the future.

Cheers,

Michael


> Cc: Tejun Heo <tj@kernel.org>

> Cc: Li Zefan <lizefan@huawei.com>

> Cc: Jonathan Corbet <corbet@lwn.net>

> Cc: cgroups@vger.kernel.org

> Cc: Android Kernel Team <kernel-team@android.com>

> Cc: Rom Lemarchand <romlem@android.com>

> Cc: Colin Cross <ccross@android.com>

> Cc: Dmitry Shmidt <dimitrysh@google.com>

> Cc: Todd Kjos <tkjos@google.com>

> Cc: Christian Poetzsch <christian.potzsch@imgtec.com>

> Cc: Amit Pundir <amit.pundir@linaro.org>

> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>

> Cc: Kees Cook <keescook@chromium.org>

> Cc: Serge E. Hallyn <serge@hallyn.com>

> Cc: Andy Lutomirski <luto@amacapital.net>

> Cc: linux-api@vger.kernel.org

> Acked-by: Serge Hallyn <serge@hallyn.com>

> Signed-off-by: John Stultz <john.stultz@linaro.org>

> ---

> v2: Renamed to just CAP_CGROUP_MIGRATE as reccomended by Tejun

> v3: Switched to just using CAP_SYS_RESOURCE as suggested by Michael

> v4: Send out properly folded down version of the patch. :P

> v5: Switch back to CAP_CGROUP_MIGRATE due to concerns from Andy

> ---

>  include/uapi/linux/capability.h | 5 ++++-

>  kernel/cgroup.c                 | 3 ++-

>  2 files changed, 6 insertions(+), 2 deletions(-)

>

> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h

> index 49bc062..32d3829 100644

> --- a/include/uapi/linux/capability.h

> +++ b/include/uapi/linux/capability.h

> @@ -349,8 +349,11 @@ struct vfs_cap_data {

>

>  #define CAP_AUDIT_READ         37

>

> +/* Allow migration of other tasks between cgroups */

>

> -#define CAP_LAST_CAP         CAP_AUDIT_READ

> +#define CAP_CGROUP_MIGRATE     38

> +

> +#define CAP_LAST_CAP         CAP_CGROUP_MIGRATE

>

>  #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)

>

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c

> index 2ee9ec3..784f115 100644

> --- a/kernel/cgroup.c

> +++ b/kernel/cgroup.c

> @@ -2856,7 +2856,8 @@ static int cgroup_procs_write_permission(struct task_struct *task,

>          */

>         if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&

>             !uid_eq(cred->euid, tcred->uid) &&

> -           !uid_eq(cred->euid, tcred->suid))

> +           !uid_eq(cred->euid, tcred->suid) &&

> +           !ns_capable(tcred->user_ns, CAP_CGROUP_MIGRATE))

>                 ret = -EACCES;

>

>         if (!ret && cgroup_on_dfl(dst_cgrp)) {

> --

> 2.7.4

>

> --

> To unsubscribe from this list: send the line "unsubscribe linux-api" in

> the body of a message to majordomo@vger.kernel.org

> More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
John Stultz Dec. 13, 2016, 4:08 p.m. UTC | #3
On Tue, Dec 13, 2016 at 1:47 AM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> On 13 December 2016 at 02:39, John Stultz <john.stultz@linaro.org> wrote:

> So, back to the discussion of silos. I understand the argument for

> wanting a new silo. But, in that case can we at least try not to make

> it a single-use silo?

>

> How about CAP_CGROUP_CONTROL or some such, with the idea that this

> might be a capability that allows the holder to step outside usual

> cgroup rules? At the moment, that capability would allow only one such

> step, but maybe there would be others in the future.


This sounds reasonable to me. Tejun/Andy: Objections?

thanks
-john
Casey Schaufler Dec. 13, 2016, 4:39 p.m. UTC | #4
On 12/13/2016 1:47 AM, Michael Kerrisk (man-pages) wrote:
> Hi John,

>

> On 13 December 2016 at 02:39, John Stultz <john.stultz@linaro.org> wrote:

>> This patch adds CAP_GROUP_MIGRATE and logic to allows a process

> s/CAP_GROUP_MIGRATE/CAP_CGROUP_MIGRATE/

>

>> to migrate other tasks between cgroups.

>>

>> In Android (where this feature originated), the ActivityManager

>> tracks various application states (TOP_APP, FOREGROUND,

>> BACKGROUND, SYSTEM, etc), and then as applications change

>> states, the SchedPolicy logic will migrate the application tasks

>> between different cgroups used to control the different

>> application states (for example, there is a background cpuset

>> cgroup which can limit background tasks to stay on one low-power

>> cpu, and the bg_non_interactive cpuctrl cgroup can then further

>> limit those background tasks to a small percentage of that one

>> cpu's cpu time).

>>

>> However, for security reasons, Android doesn't want to make the

>> system_server (the process that runs the ActivityManager and

>> SchedPolicy logic), run as root. So in the Android common.git

>> kernel, they have some logic to allow cgroups to loosen their

>> permissions so CAP_SYS_NICE tasks can migrate other tasks between

>> cgroups.

>>

>> I feel the approach taken there overloads CAP_SYS_NICE a bit much

>> for non-android environments. Efforts to re-use CAP_SYS_RESOURCE

>> for this purpose (which Android has since adopted) was also

>> stymied by concerns about risks from future cgroups that could be

>> considered "dangerous" by how they might change system semantics.

>>

>> So to avoid overlapping usage, this patch adds a brand new

>> process capability flag (CAP_CGROUP_MIGRATE), and uses it when

>> checking if a task can migrate other tasks between cgroups.

>>

>> I've tested this with AOSP master (though its a bit hacked in as

>> I still need to properly get the selinux bits aware of the new

>> capability bit) with selinux set to permissive and it seems to be

>> working well.

>>

>> Thoughts and feedback would be appreciated!

> So, back to the discussion of silos. I understand the argument for

> wanting a new silo. But, in that case can we at least try not to make

> it a single-use silo?

>

> How about CAP_CGROUP_CONTROL or some such, with the idea that this

> might be a capability that allows the holder to step outside usual

> cgroup rules? At the moment, that capability would allow only one such

> step, but maybe there would be others in the future.


I agree, but want to put it more strongly. The granularity of
capabilities can never be fine enough for some people, and this
is an example of a case where you're going a bit too far. If the
use case is Android as you say, you don't need this. As my friends
on the far side of the aisle would say, "just write SELinux policy"
to correctly control access as required.

>

> Cheers,

>

> Michael

>

>

>> Cc: Tejun Heo <tj@kernel.org>

>> Cc: Li Zefan <lizefan@huawei.com>

>> Cc: Jonathan Corbet <corbet@lwn.net>

>> Cc: cgroups@vger.kernel.org

>> Cc: Android Kernel Team <kernel-team@android.com>

>> Cc: Rom Lemarchand <romlem@android.com>

>> Cc: Colin Cross <ccross@android.com>

>> Cc: Dmitry Shmidt <dimitrysh@google.com>

>> Cc: Todd Kjos <tkjos@google.com>

>> Cc: Christian Poetzsch <christian.potzsch@imgtec.com>

>> Cc: Amit Pundir <amit.pundir@linaro.org>

>> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>

>> Cc: Kees Cook <keescook@chromium.org>

>> Cc: Serge E. Hallyn <serge@hallyn.com>

>> Cc: Andy Lutomirski <luto@amacapital.net>

>> Cc: linux-api@vger.kernel.org

>> Acked-by: Serge Hallyn <serge@hallyn.com>

>> Signed-off-by: John Stultz <john.stultz@linaro.org>

>> ---

>> v2: Renamed to just CAP_CGROUP_MIGRATE as reccomended by Tejun

>> v3: Switched to just using CAP_SYS_RESOURCE as suggested by Michael

>> v4: Send out properly folded down version of the patch. :P

>> v5: Switch back to CAP_CGROUP_MIGRATE due to concerns from Andy

>> ---

>>  include/uapi/linux/capability.h | 5 ++++-

>>  kernel/cgroup.c                 | 3 ++-

>>  2 files changed, 6 insertions(+), 2 deletions(-)

>>

>> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h

>> index 49bc062..32d3829 100644

>> --- a/include/uapi/linux/capability.h

>> +++ b/include/uapi/linux/capability.h

>> @@ -349,8 +349,11 @@ struct vfs_cap_data {

>>

>>  #define CAP_AUDIT_READ         37

>>

>> +/* Allow migration of other tasks between cgroups */

>>

>> -#define CAP_LAST_CAP         CAP_AUDIT_READ

>> +#define CAP_CGROUP_MIGRATE     38

>> +

>> +#define CAP_LAST_CAP         CAP_CGROUP_MIGRATE

>>

>>  #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)

>>

>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c

>> index 2ee9ec3..784f115 100644

>> --- a/kernel/cgroup.c

>> +++ b/kernel/cgroup.c

>> @@ -2856,7 +2856,8 @@ static int cgroup_procs_write_permission(struct task_struct *task,

>>          */

>>         if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&

>>             !uid_eq(cred->euid, tcred->uid) &&

>> -           !uid_eq(cred->euid, tcred->suid))

>> +           !uid_eq(cred->euid, tcred->suid) &&

>> +           !ns_capable(tcred->user_ns, CAP_CGROUP_MIGRATE))

>>                 ret = -EACCES;

>>

>>         if (!ret && cgroup_on_dfl(dst_cgrp)) {

>> --

>> 2.7.4

>>

>> --

>> To unsubscribe from this list: send the line "unsubscribe linux-api" in

>> the body of a message to majordomo@vger.kernel.org

>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

>

>
John Stultz Dec. 13, 2016, 4:49 p.m. UTC | #5
On Tue, Dec 13, 2016 at 8:39 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 12/13/2016 1:47 AM, Michael Kerrisk (man-pages) wrote:

>> Hi John,

>>

>> On 13 December 2016 at 02:39, John Stultz <john.stultz@linaro.org> wrote:

>>> This patch adds CAP_GROUP_MIGRATE and logic to allows a process

>> s/CAP_GROUP_MIGRATE/CAP_CGROUP_MIGRATE/

>>

>>> to migrate other tasks between cgroups.

>>>

>>> In Android (where this feature originated), the ActivityManager

>>> tracks various application states (TOP_APP, FOREGROUND,

>>> BACKGROUND, SYSTEM, etc), and then as applications change

>>> states, the SchedPolicy logic will migrate the application tasks

>>> between different cgroups used to control the different

>>> application states (for example, there is a background cpuset

>>> cgroup which can limit background tasks to stay on one low-power

>>> cpu, and the bg_non_interactive cpuctrl cgroup can then further

>>> limit those background tasks to a small percentage of that one

>>> cpu's cpu time).

>>>

>>> However, for security reasons, Android doesn't want to make the

>>> system_server (the process that runs the ActivityManager and

>>> SchedPolicy logic), run as root. So in the Android common.git

>>> kernel, they have some logic to allow cgroups to loosen their

>>> permissions so CAP_SYS_NICE tasks can migrate other tasks between

>>> cgroups.

>>>

>>> I feel the approach taken there overloads CAP_SYS_NICE a bit much

>>> for non-android environments. Efforts to re-use CAP_SYS_RESOURCE

>>> for this purpose (which Android has since adopted) was also

>>> stymied by concerns about risks from future cgroups that could be

>>> considered "dangerous" by how they might change system semantics.

>>>

>>> So to avoid overlapping usage, this patch adds a brand new

>>> process capability flag (CAP_CGROUP_MIGRATE), and uses it when

>>> checking if a task can migrate other tasks between cgroups.

>>>

>>> I've tested this with AOSP master (though its a bit hacked in as

>>> I still need to properly get the selinux bits aware of the new

>>> capability bit) with selinux set to permissive and it seems to be

>>> working well.

>>>

>>> Thoughts and feedback would be appreciated!

>> So, back to the discussion of silos. I understand the argument for

>> wanting a new silo. But, in that case can we at least try not to make

>> it a single-use silo?

>>

>> How about CAP_CGROUP_CONTROL or some such, with the idea that this

>> might be a capability that allows the holder to step outside usual

>> cgroup rules? At the moment, that capability would allow only one such

>> step, but maybe there would be others in the future.

>

> I agree, but want to put it more strongly. The granularity of

> capabilities can never be fine enough for some people, and this

> is an example of a case where you're going a bit too far. If the

> use case is Android as you say, you don't need this. As my friends

> on the far side of the aisle would say, "just write SELinux policy"

> to correctly control access as required.


So.. The trouble is that while selinux is good for restricting
permissions, the in-kernel permission checks here are already too
restrictive. It seems one must first loosen things up before we can
tighten it with selinux rules. Or are you suggesting the system_server
run as root + further selinux limitations? I worry, the Android
developers may still be hesitant to do that.

thanks
-john
Casey Schaufler Dec. 13, 2016, 5:17 p.m. UTC | #6
On 12/13/2016 8:49 AM, John Stultz wrote:
> On Tue, Dec 13, 2016 at 8:39 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:

>> On 12/13/2016 1:47 AM, Michael Kerrisk (man-pages) wrote:

>>> Hi John,

>>>

>>> On 13 December 2016 at 02:39, John Stultz <john.stultz@linaro.org> wrote:

>>>> This patch adds CAP_GROUP_MIGRATE and logic to allows a process

>>> s/CAP_GROUP_MIGRATE/CAP_CGROUP_MIGRATE/

>>>

>>>> to migrate other tasks between cgroups.

>>>>

>>>> In Android (where this feature originated), the ActivityManager

>>>> tracks various application states (TOP_APP, FOREGROUND,

>>>> BACKGROUND, SYSTEM, etc), and then as applications change

>>>> states, the SchedPolicy logic will migrate the application tasks

>>>> between different cgroups used to control the different

>>>> application states (for example, there is a background cpuset

>>>> cgroup which can limit background tasks to stay on one low-power

>>>> cpu, and the bg_non_interactive cpuctrl cgroup can then further

>>>> limit those background tasks to a small percentage of that one

>>>> cpu's cpu time).

>>>>

>>>> However, for security reasons, Android doesn't want to make the

>>>> system_server (the process that runs the ActivityManager and

>>>> SchedPolicy logic), run as root. So in the Android common.git

>>>> kernel, they have some logic to allow cgroups to loosen their

>>>> permissions so CAP_SYS_NICE tasks can migrate other tasks between

>>>> cgroups.

>>>>

>>>> I feel the approach taken there overloads CAP_SYS_NICE a bit much

>>>> for non-android environments. Efforts to re-use CAP_SYS_RESOURCE

>>>> for this purpose (which Android has since adopted) was also

>>>> stymied by concerns about risks from future cgroups that could be

>>>> considered "dangerous" by how they might change system semantics.

>>>>

>>>> So to avoid overlapping usage, this patch adds a brand new

>>>> process capability flag (CAP_CGROUP_MIGRATE), and uses it when

>>>> checking if a task can migrate other tasks between cgroups.

>>>>

>>>> I've tested this with AOSP master (though its a bit hacked in as

>>>> I still need to properly get the selinux bits aware of the new

>>>> capability bit) with selinux set to permissive and it seems to be

>>>> working well.

>>>>

>>>> Thoughts and feedback would be appreciated!

>>> So, back to the discussion of silos. I understand the argument for

>>> wanting a new silo. But, in that case can we at least try not to make

>>> it a single-use silo?

>>>

>>> How about CAP_CGROUP_CONTROL or some such, with the idea that this

>>> might be a capability that allows the holder to step outside usual

>>> cgroup rules? At the moment, that capability would allow only one such

>>> step, but maybe there would be others in the future.

>> I agree, but want to put it more strongly. The granularity of

>> capabilities can never be fine enough for some people, and this

>> is an example of a case where you're going a bit too far. If the

>> use case is Android as you say, you don't need this. As my friends

>> on the far side of the aisle would say, "just write SELinux policy"

>> to correctly control access as required.

> So.. The trouble is that while selinux is good for restricting

> permissions, the in-kernel permission checks here are already too

> restrictive.


Why did the original authors of cgroups make it that restrictive?
If there isn't a good reason, loosen it up. If there is a good
reason, then pay heed to it.

> It seems one must first loosen things up before we can

> tighten it with selinux rules.


You're looking at splitting the granularity hair. Is your
userspace code really so delicate that it can't handle the
existing, "coarse" privilege and needs to protect at the
"fine" granularity you're proposing?

> Or are you suggesting the system_server

> run as root + further selinux limitations? I worry, the Android

> developers may still be hesitant to do that.


Unlike many of my peers, I am not afraid of running good
solid services with privilege. A proper implementation
of system_server ought to be able to run completely
unconstrained without causing anyone the least concern.
I understand all the arguments against that, and am
disinclined to get into the religious debates that ensue.
So no, I am not going to suggest running system server
as root, but I am going to suggest giving it the capability
currently required and clamping it down with SELinux policy.

>

> thanks

> -john

>
John Stultz Dec. 13, 2016, 5:24 p.m. UTC | #7
On Tue, Dec 13, 2016 at 9:17 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 12/13/2016 8:49 AM, John Stultz wrote:

>> On Tue, Dec 13, 2016 at 8:39 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:

>>> On 12/13/2016 1:47 AM, Michael Kerrisk (man-pages) wrote:

>>>> How about CAP_CGROUP_CONTROL or some such, with the idea that this

>>>> might be a capability that allows the holder to step outside usual

>>>> cgroup rules? At the moment, that capability would allow only one such

>>>> step, but maybe there would be others in the future.

>>> I agree, but want to put it more strongly. The granularity of

>>> capabilities can never be fine enough for some people, and this

>>> is an example of a case where you're going a bit too far. If the

>>> use case is Android as you say, you don't need this. As my friends

>>> on the far side of the aisle would say, "just write SELinux policy"

>>> to correctly control access as required.

>> So.. The trouble is that while selinux is good for restricting

>> permissions, the in-kernel permission checks here are already too

>> restrictive.

>

> Why did the original authors of cgroups make it that restrictive?

> If there isn't a good reason, loosen it up. If there is a good

> reason, then pay heed to it.


That's what this patch is proposing. And I agree with Michael that the
newly proposed cap was a bit to narrowly focused on my immediate use
case, and broadening it to CGROUP_CONTROL is smart. Then that
capability could be further restricted w/ selinux policy, as you
suggest.

thanks
-john
Casey Schaufler Dec. 13, 2016, 5:48 p.m. UTC | #8
On 12/13/2016 9:24 AM, John Stultz wrote:
> On Tue, Dec 13, 2016 at 9:17 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:

>> On 12/13/2016 8:49 AM, John Stultz wrote:

>>> On Tue, Dec 13, 2016 at 8:39 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:

>>>> On 12/13/2016 1:47 AM, Michael Kerrisk (man-pages) wrote:

>>>>> How about CAP_CGROUP_CONTROL or some such, with the idea that this

>>>>> might be a capability that allows the holder to step outside usual

>>>>> cgroup rules? At the moment, that capability would allow only one such

>>>>> step, but maybe there would be others in the future.

>>>> I agree, but want to put it more strongly. The granularity of

>>>> capabilities can never be fine enough for some people, and this

>>>> is an example of a case where you're going a bit too far. If the

>>>> use case is Android as you say, you don't need this. As my friends

>>>> on the far side of the aisle would say, "just write SELinux policy"

>>>> to correctly control access as required.

>>> So.. The trouble is that while selinux is good for restricting

>>> permissions, the in-kernel permission checks here are already too

>>> restrictive.

>> Why did the original authors of cgroups make it that restrictive?

>> If there isn't a good reason, loosen it up. If there is a good

>> reason, then pay heed to it.

> That's what this patch is proposing. And I agree with Michael that the

> newly proposed cap was a bit to narrowly focused on my immediate use

> case, and broadening it to CGROUP_CONTROL is smart. Then that

> capability could be further restricted w/ selinux policy, as you

> suggest.


Adding a new capability is unnecessary. The current use of CAP_SYS_NICE,
while arguably obscure, provides as much "security" as a new capability
does. While cgroups are a wonderful thing, they don't need a separate
capability.

>

> thanks

> -john

>
John Stultz Dec. 13, 2016, 6:13 p.m. UTC | #9
On Tue, Dec 13, 2016 at 9:48 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 12/13/2016 9:24 AM, John Stultz wrote:

>> On Tue, Dec 13, 2016 at 9:17 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:

>>> On 12/13/2016 8:49 AM, John Stultz wrote:

>>>> On Tue, Dec 13, 2016 at 8:39 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:

>>>>> On 12/13/2016 1:47 AM, Michael Kerrisk (man-pages) wrote:

>>>>>> How about CAP_CGROUP_CONTROL or some such, with the idea that this

>>>>>> might be a capability that allows the holder to step outside usual

>>>>>> cgroup rules? At the moment, that capability would allow only one such

>>>>>> step, but maybe there would be others in the future.

>>>>> I agree, but want to put it more strongly. The granularity of

>>>>> capabilities can never be fine enough for some people, and this

>>>>> is an example of a case where you're going a bit too far. If the

>>>>> use case is Android as you say, you don't need this. As my friends

>>>>> on the far side of the aisle would say, "just write SELinux policy"

>>>>> to correctly control access as required.

>>>> So.. The trouble is that while selinux is good for restricting

>>>> permissions, the in-kernel permission checks here are already too

>>>> restrictive.

>>> Why did the original authors of cgroups make it that restrictive?

>>> If there isn't a good reason, loosen it up. If there is a good

>>> reason, then pay heed to it.

>> That's what this patch is proposing. And I agree with Michael that the

>> newly proposed cap was a bit to narrowly focused on my immediate use

>> case, and broadening it to CGROUP_CONTROL is smart. Then that

>> capability could be further restricted w/ selinux policy, as you

>> suggest.

>

> Adding a new capability is unnecessary. The current use of CAP_SYS_NICE,

> while arguably obscure, provides as much "security" as a new capability

> does. While cgroups are a wonderful thing, they don't need a separate

> capability.


The trouble is that CAP_SYS_NICE or _RESOURCE (which was tried in an
earlier version of this patch) aren't necessarily appropriate for
non-android systems. See Andy's objection here:
https://lkml.org/lkml/2016/11/8/946

thanks
-john
Casey Schaufler Dec. 13, 2016, 6:32 p.m. UTC | #10
On 12/13/2016 10:13 AM, John Stultz wrote:
> On Tue, Dec 13, 2016 at 9:48 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:

>> On 12/13/2016 9:24 AM, John Stultz wrote:

>>> On Tue, Dec 13, 2016 at 9:17 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:

>>>> On 12/13/2016 8:49 AM, John Stultz wrote:

>>>>> On Tue, Dec 13, 2016 at 8:39 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:

>>>>>> On 12/13/2016 1:47 AM, Michael Kerrisk (man-pages) wrote:

>>>>>>> How about CAP_CGROUP_CONTROL or some such, with the idea that this

>>>>>>> might be a capability that allows the holder to step outside usual

>>>>>>> cgroup rules? At the moment, that capability would allow only one such

>>>>>>> step, but maybe there would be others in the future.

>>>>>> I agree, but want to put it more strongly. The granularity of

>>>>>> capabilities can never be fine enough for some people, and this

>>>>>> is an example of a case where you're going a bit too far. If the

>>>>>> use case is Android as you say, you don't need this. As my friends

>>>>>> on the far side of the aisle would say, "just write SELinux policy"

>>>>>> to correctly control access as required.

>>>>> So.. The trouble is that while selinux is good for restricting

>>>>> permissions, the in-kernel permission checks here are already too

>>>>> restrictive.

>>>> Why did the original authors of cgroups make it that restrictive?

>>>> If there isn't a good reason, loosen it up. If there is a good

>>>> reason, then pay heed to it.

>>> That's what this patch is proposing. And I agree with Michael that the

>>> newly proposed cap was a bit to narrowly focused on my immediate use

>>> case, and broadening it to CGROUP_CONTROL is smart. Then that

>>> capability could be further restricted w/ selinux policy, as you

>>> suggest.

>> Adding a new capability is unnecessary. The current use of CAP_SYS_NICE,

>> while arguably obscure, provides as much "security" as a new capability

>> does. While cgroups are a wonderful thing, they don't need a separate

>> capability.

> The trouble is that CAP_SYS_NICE or _RESOURCE (which was tried in an

> earlier version of this patch) aren't necessarily appropriate for

> non-android systems. See Andy's objection here:

> https://lkml.org/lkml/2016/11/8/946


Then we need to see what those as-yet-unimplemented systems
require and how to address them. I don't think that taking
the "someone might want it" approach is really appropriate.

>

> thanks

> -john

>
Tejun Heo Dec. 13, 2016, 6:40 p.m. UTC | #11
Hello,

On Tue, Dec 13, 2016 at 08:08:16AM -0800, John Stultz wrote:
> On Tue, Dec 13, 2016 at 1:47 AM, Michael Kerrisk (man-pages)

> <mtk.manpages@gmail.com> wrote:

> > On 13 December 2016 at 02:39, John Stultz <john.stultz@linaro.org> wrote:

> > So, back to the discussion of silos. I understand the argument for

> > wanting a new silo. But, in that case can we at least try not to make

> > it a single-use silo?

> >

> > How about CAP_CGROUP_CONTROL or some such, with the idea that this

> > might be a capability that allows the holder to step outside usual

> > cgroup rules? At the moment, that capability would allow only one such

> > step, but maybe there would be others in the future.

> 

> This sounds reasonable to me. Tejun/Andy: Objections?


Control group control?  The word control has a specific meaning for
cgroups and that second control doesn't make much sense to me.  Given
how this is mostly to patch up a hole in v1's delegation model and how
migration operations are different from others, I doubt that we will
end up overloading it.  Maybe just CAP_CGROUP?

Thanks.

-- 
tejun
Tejun Heo Dec. 13, 2016, 6:47 p.m. UTC | #12
Hello, Casey.

On Tue, Dec 13, 2016 at 10:32:14AM -0800, Casey Schaufler wrote:
> > The trouble is that CAP_SYS_NICE or _RESOURCE (which was tried in an

> > earlier version of this patch) aren't necessarily appropriate for

> > non-android systems. See Andy's objection here:

> > https://lkml.org/lkml/2016/11/8/946

> 

> Then we need to see what those as-yet-unimplemented systems

> require and how to address them. I don't think that taking

> the "someone might want it" approach is really appropriate.


I understands that there can be reservations regarding adding a new
CAP but this isn't about someone possibly wanting it in the future.
It's more about overloading existing CAPs leading to permitting
unintended operations.  e.g. ppl who've been delegating
CAP_SYS_RESOURCES would automatically end up delegating cgroup
organization without intending so.  Using an existing cap would have
been nice but it just doesn't look like we have a good one to
overload.

Thanks.

-- 
tejun
John Stultz Dec. 13, 2016, 6:47 p.m. UTC | #13
On Tue, Dec 13, 2016 at 10:40 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,

>

> On Tue, Dec 13, 2016 at 08:08:16AM -0800, John Stultz wrote:

>> On Tue, Dec 13, 2016 at 1:47 AM, Michael Kerrisk (man-pages)

>> <mtk.manpages@gmail.com> wrote:

>> > On 13 December 2016 at 02:39, John Stultz <john.stultz@linaro.org> wrote:

>> > So, back to the discussion of silos. I understand the argument for

>> > wanting a new silo. But, in that case can we at least try not to make

>> > it a single-use silo?

>> >

>> > How about CAP_CGROUP_CONTROL or some such, with the idea that this

>> > might be a capability that allows the holder to step outside usual

>> > cgroup rules? At the moment, that capability would allow only one such

>> > step, but maybe there would be others in the future.

>>

>> This sounds reasonable to me. Tejun/Andy: Objections?

>

> Control group control?  The word control has a specific meaning for

> cgroups and that second control doesn't make much sense to me.


But this would go against the long tradition of RAS syndrome and
things like "struct task_struct".  :)

>  Given

> how this is mostly to patch up a hole in v1's delegation model and how

> migration operations are different from others, I doubt that we will

> end up overloading it.  Maybe just CAP_CGROUP?


Sounds ok to me.

thanks
-john
Tejun Heo Dec. 13, 2016, 6:53 p.m. UTC | #14
On Tue, Dec 13, 2016 at 10:47:19AM -0800, John Stultz wrote:
> > Control group control?  The word control has a specific meaning for

> > cgroups and that second control doesn't make much sense to me.

> 

> But this would go against the long tradition of RAS syndrome and

> things like "struct task_struct".  :)


Well, now that you put it that way, it's starting to look good. :)
But, let's just go for CAP_CGROUP if everyone is okay with it.

Thanks.

-- 
tejun
diff mbox

Patch

diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 49bc062..32d3829 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -349,8 +349,11 @@  struct vfs_cap_data {
 
 #define CAP_AUDIT_READ		37
 
+/* Allow migration of other tasks between cgroups */
 
-#define CAP_LAST_CAP         CAP_AUDIT_READ
+#define CAP_CGROUP_MIGRATE	38
+
+#define CAP_LAST_CAP         CAP_CGROUP_MIGRATE
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2ee9ec3..784f115 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2856,7 +2856,8 @@  static int cgroup_procs_write_permission(struct task_struct *task,
 	 */
 	if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
 	    !uid_eq(cred->euid, tcred->uid) &&
-	    !uid_eq(cred->euid, tcred->suid))
+	    !uid_eq(cred->euid, tcred->suid) &&
+	    !ns_capable(tcred->user_ns, CAP_CGROUP_MIGRATE))
 		ret = -EACCES;
 
 	if (!ret && cgroup_on_dfl(dst_cgrp)) {