diff mbox

[RESEND,v4] cgroup: Use CAP_SYS_RESOURCE to allow a process to migrate other tasks between cgroups

Message ID 1478647728-30357-1-git-send-email-john.stultz@linaro.org
State New
Headers show

Commit Message

John Stultz Nov. 8, 2016, 11:28 p.m. UTC
This patch adds logic to allows a process to migrate other tasks
between cgroups if they have CAP_SYS_RESOURCE.

In Android (where this feature originated), the ActivityManager tracks
various application states (TOP_APP, FOREGROUND, BACKGROUND, SYSTEM,
etc), and then as applications change states, the SchedPolicy logic
will migrate the application tasks between different cgroups used
to control the different application states (for example, there is a
background cpuset cgroup which can limit background tasks to stay
on one low-power cpu, and the bg_non_interactive cpuctrl cgroup can
then further limit those background tasks to a small percentage of
that one cpu's cpu time).

However, for security reasons, Android doesn't want to make the
system_server (the process that runs the ActivityManager and
SchedPolicy logic), run as root. So in the Android common.git
kernel, they have some logic to allow cgroups to loosen their
permissions so CAP_SYS_NICE tasks can migrate other tasks between
cgroups.

I feel the approach taken there overloads CAP_SYS_NICE a bit much
for non-android environments.

So this patch, as suggested by Michael Kerrisk, simply adds a
check for CAP_SYS_RESOURCE.

I've tested this with AOSP master, and this seems to work well
as Zygote and system_server already use CAP_SYS_RESOURCE. I've
also submitted patches against the android-4.4 kernel to change
it to use CAP_SYS_RESOURCE, and the Android developers just merged
it.

Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: cgroups@vger.kernel.org
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Colin Cross <ccross@android.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Christian Poetzsch <christian.potzsch@imgtec.com>
Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: linux-api@vger.kernel.org
Acked-by: Serge Hallyn <serge@hallyn.com>

Signed-off-by: John Stultz <john.stultz@linaro.org>

---
v2: Renamed to just CAP_CGROUP_MIGRATE as recommended by Tejun
v3: Switched to just using CAP_SYS_RESOURCE as suggested by Michael
v4: Send out properly folded down version of the patch. :P
---
 kernel/cgroup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

-- 
2.7.4

Comments

Kees Cook Nov. 8, 2016, 11:41 p.m. UTC | #1
On Tue, Nov 8, 2016 at 3:28 PM, John Stultz <john.stultz@linaro.org> wrote:
> This patch adds logic to allows a process to migrate other tasks

> between cgroups if they have CAP_SYS_RESOURCE.

>

> In Android (where this feature originated), the ActivityManager tracks

> various application states (TOP_APP, FOREGROUND, BACKGROUND, SYSTEM,

> etc), and then as applications change states, the SchedPolicy logic

> will migrate the application tasks between different cgroups used

> to control the different application states (for example, there is a

> background cpuset cgroup which can limit background tasks to stay

> on one low-power cpu, and the bg_non_interactive cpuctrl cgroup can

> then further limit those background tasks to a small percentage of

> that one cpu's cpu time).

>

> However, for security reasons, Android doesn't want to make the

> system_server (the process that runs the ActivityManager and

> SchedPolicy logic), run as root. So in the Android common.git

> kernel, they have some logic to allow cgroups to loosen their

> permissions so CAP_SYS_NICE tasks can migrate other tasks between

> cgroups.

>

> I feel the approach taken there overloads CAP_SYS_NICE a bit much

> for non-android environments.

>

> So this patch, as suggested by Michael Kerrisk, simply adds a

> check for CAP_SYS_RESOURCE.

>

> I've tested this with AOSP master, and this seems to work well

> as Zygote and system_server already use CAP_SYS_RESOURCE. I've

> also submitted patches against the android-4.4 kernel to change

> it to use CAP_SYS_RESOURCE, and the Android developers just merged

> it.

>

> Cc: Tejun Heo <tj@kernel.org>

> Cc: Li Zefan <lizefan@huawei.com>

> Cc: Jonathan Corbet <corbet@lwn.net>

> Cc: cgroups@vger.kernel.org

> Cc: Android Kernel Team <kernel-team@android.com>

> Cc: Rom Lemarchand <romlem@android.com>

> Cc: Colin Cross <ccross@android.com>

> Cc: Dmitry Shmidt <dimitrysh@google.com>

> Cc: Todd Kjos <tkjos@google.com>

> Cc: Christian Poetzsch <christian.potzsch@imgtec.com>

> Cc: Amit Pundir <amit.pundir@linaro.org>

> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>

> Cc: Kees Cook <keescook@chromium.org>

> Cc: Serge E. Hallyn <serge@hallyn.com>

> Cc: linux-api@vger.kernel.org

> Acked-by: Serge Hallyn <serge@hallyn.com>

> Signed-off-by: John Stultz <john.stultz@linaro.org>

> ---

> v2: Renamed to just CAP_CGROUP_MIGRATE as recommended by Tejun

> v3: Switched to just using CAP_SYS_RESOURCE as suggested by Michael

> v4: Send out properly folded down version of the patch. :P

> ---

>  kernel/cgroup.c | 3 ++-

>  1 file changed, 2 insertions(+), 1 deletion(-)

>

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c

> index 85bc9be..866059a 100644

> --- a/kernel/cgroup.c

> +++ b/kernel/cgroup.c

> @@ -2856,7 +2856,8 @@ static int cgroup_procs_write_permission(struct task_struct *task,

>          */

>         if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&

>             !uid_eq(cred->euid, tcred->uid) &&

> -           !uid_eq(cred->euid, tcred->suid))

> +           !uid_eq(cred->euid, tcred->suid) &&

> +           !ns_capable(tcred->user_ns, CAP_SYS_RESOURCE))

>                 ret = -EACCES;

>

>         if (!ret && cgroup_on_dfl(dst_cgrp)) {

> --

> 2.7.4

>


Reviewed-by: Kees Cook <keescook@chromium.org>


-Kees

-- 
Kees Cook
Nexus Security
Alexei Starovoitov Nov. 9, 2016, 12:03 a.m. UTC | #2
On Tue, Nov 08, 2016 at 03:51:40PM -0800, Andy Lutomirski wrote:
> On Tue, Nov 8, 2016 at 3:28 PM, John Stultz <john.stultz@linaro.org> wrote:

> > This patch adds logic to allows a process to migrate other tasks

> > between cgroups if they have CAP_SYS_RESOURCE.

> >

> > In Android (where this feature originated), the ActivityManager tracks

> > various application states (TOP_APP, FOREGROUND, BACKGROUND, SYSTEM,

> > etc), and then as applications change states, the SchedPolicy logic

> > will migrate the application tasks between different cgroups used

> > to control the different application states (for example, there is a

> > background cpuset cgroup which can limit background tasks to stay

> > on one low-power cpu, and the bg_non_interactive cpuctrl cgroup can

> > then further limit those background tasks to a small percentage of

> > that one cpu's cpu time).

> >

> > However, for security reasons, Android doesn't want to make the

> > system_server (the process that runs the ActivityManager and

> > SchedPolicy logic), run as root. So in the Android common.git

> > kernel, they have some logic to allow cgroups to loosen their

> > permissions so CAP_SYS_NICE tasks can migrate other tasks between

> > cgroups.

> >

> > I feel the approach taken there overloads CAP_SYS_NICE a bit much

> > for non-android environments.

> >

> > So this patch, as suggested by Michael Kerrisk, simply adds a

> > check for CAP_SYS_RESOURCE.

> >

> > I've tested this with AOSP master, and this seems to work well

> > as Zygote and system_server already use CAP_SYS_RESOURCE. I've

> > also submitted patches against the android-4.4 kernel to change

> > it to use CAP_SYS_RESOURCE, and the Android developers just merged

> > it.

> >

> 

> I hate to say it, but I think I may see a problem.  Current

> developments are afoot to make cgroups do more than resource control.

> For example, there's Landlock and there's Daniel's ingress/egress

> filter thing.  Current cgroup controllers can mostly just DoS their

> controlled processes.  These new controllers (or controller-like

> things) can exfiltrate data and change semantics.

> 

> Does anyone have a security model in mind for these controllers and

> the cgroups that they're attached to?  I'm reasonably confident that

> CAP_SYS_RESOURCE is not the answer...


and specifically the answer is... ?
Also would be great if you start with specifying the question first
and the problem you're trying to solve.
Andy Lutomirski Nov. 9, 2016, 12:12 a.m. UTC | #3
On Tue, Nov 8, 2016 at 4:03 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Nov 08, 2016 at 03:51:40PM -0800, Andy Lutomirski wrote:

>> On Tue, Nov 8, 2016 at 3:28 PM, John Stultz <john.stultz@linaro.org> wrote:

>> > This patch adds logic to allows a process to migrate other tasks

>> > between cgroups if they have CAP_SYS_RESOURCE.

>> >

>> > In Android (where this feature originated), the ActivityManager tracks

>> > various application states (TOP_APP, FOREGROUND, BACKGROUND, SYSTEM,

>> > etc), and then as applications change states, the SchedPolicy logic

>> > will migrate the application tasks between different cgroups used

>> > to control the different application states (for example, there is a

>> > background cpuset cgroup which can limit background tasks to stay

>> > on one low-power cpu, and the bg_non_interactive cpuctrl cgroup can

>> > then further limit those background tasks to a small percentage of

>> > that one cpu's cpu time).

>> >

>> > However, for security reasons, Android doesn't want to make the

>> > system_server (the process that runs the ActivityManager and

>> > SchedPolicy logic), run as root. So in the Android common.git

>> > kernel, they have some logic to allow cgroups to loosen their

>> > permissions so CAP_SYS_NICE tasks can migrate other tasks between

>> > cgroups.

>> >

>> > I feel the approach taken there overloads CAP_SYS_NICE a bit much

>> > for non-android environments.

>> >

>> > So this patch, as suggested by Michael Kerrisk, simply adds a

>> > check for CAP_SYS_RESOURCE.

>> >

>> > I've tested this with AOSP master, and this seems to work well

>> > as Zygote and system_server already use CAP_SYS_RESOURCE. I've

>> > also submitted patches against the android-4.4 kernel to change

>> > it to use CAP_SYS_RESOURCE, and the Android developers just merged

>> > it.

>> >

>>

>> I hate to say it, but I think I may see a problem.  Current

>> developments are afoot to make cgroups do more than resource control.

>> For example, there's Landlock and there's Daniel's ingress/egress

>> filter thing.  Current cgroup controllers can mostly just DoS their

>> controlled processes.  These new controllers (or controller-like

>> things) can exfiltrate data and change semantics.

>>

>> Does anyone have a security model in mind for these controllers and

>> the cgroups that they're attached to?  I'm reasonably confident that

>> CAP_SYS_RESOURCE is not the answer...

>

> and specifically the answer is... ?

> Also would be great if you start with specifying the question first

> and the problem you're trying to solve.

>


I don't have a good answer right now.  Here are some constraints, though:

1. An insufficiently privileged process should not be able to move a
victim into a dangerous cgroup.

2. An insufficiently privileged process should not be able to move
itself into a dangerous cgroup and then use execve to gain privilege
such that the execve'd program can be compromised.

3. An insufficiently privileged process should not be able to make an
existing cgroup dangerous in a way that could compromise a victim in
that cgroup.

4. An insufficiently privileged process should not be able to make a
cgroup dangerous in a way that bypasses protections that would
otherwise protect execve() as used by itself or some other process in
that cgroup.

Keep in mind that "dangerous" may apply to a cgroup's descendents in
addition to the cgroup being controlled.
John Stultz Nov. 23, 2016, 12:57 a.m. UTC | #4
On Tue, Nov 8, 2016 at 4:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Nov 8, 2016 at 4:03 PM, Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

>> On Tue, Nov 08, 2016 at 03:51:40PM -0800, Andy Lutomirski wrote:

>>>

>>> I hate to say it, but I think I may see a problem.  Current

>>> developments are afoot to make cgroups do more than resource control.

>>> For example, there's Landlock and there's Daniel's ingress/egress

>>> filter thing.  Current cgroup controllers can mostly just DoS their

>>> controlled processes.  These new controllers (or controller-like

>>> things) can exfiltrate data and change semantics.

>>>

>>> Does anyone have a security model in mind for these controllers and

>>> the cgroups that they're attached to?  I'm reasonably confident that

>>> CAP_SYS_RESOURCE is not the answer...

>>

>> and specifically the answer is... ?

>> Also would be great if you start with specifying the question first

>> and the problem you're trying to solve.

>>

>

> I don't have a good answer right now.  Here are some constraints, though:

>

> 1. An insufficiently privileged process should not be able to move a

> victim into a dangerous cgroup.

>

> 2. An insufficiently privileged process should not be able to move

> itself into a dangerous cgroup and then use execve to gain privilege

> such that the execve'd program can be compromised.

>

> 3. An insufficiently privileged process should not be able to make an

> existing cgroup dangerous in a way that could compromise a victim in

> that cgroup.

>

> 4. An insufficiently privileged process should not be able to make a

> cgroup dangerous in a way that bypasses protections that would

> otherwise protect execve() as used by itself or some other process in

> that cgroup.

>

> Keep in mind that "dangerous" may apply to a cgroup's descendents in

> addition to the cgroup being controlled.


Sorry for taking awhile to get back to you here.  I'm a little
befuddled as to what next steps I should consider (and honestly, I'm
not totally sure I really grok your concern here, particularly what
you mean with "dangrous cgroups").

So is going back to the CAP_CGROUP_MIGRATE approach (to properly
separate "sufficiently" from "insufficiently privileged") better?

Or something closer to the original method Android used of each cgroup
having an allow_attach() check which could determine what is
sufficiently privledged for the respective level of danger the cgroup
might poise?

Or just stepping back, what method would you imagine to be reasonable
to allow a specified task to migrate other tasks between cgroups
without it having to be root/suid?

thanks
-john
John Stultz Dec. 6, 2016, 12:28 a.m. UTC | #5
On Tue, Nov 22, 2016 at 4:57 PM, John Stultz <john.stultz@linaro.org> wrote:
> On Tue, Nov 8, 2016 at 4:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:

>> On Tue, Nov 8, 2016 at 4:03 PM, Alexei Starovoitov

>> <alexei.starovoitov@gmail.com> wrote:

>>> On Tue, Nov 08, 2016 at 03:51:40PM -0800, Andy Lutomirski wrote:

>>>>

>>>> I hate to say it, but I think I may see a problem.  Current

>>>> developments are afoot to make cgroups do more than resource control.

>>>> For example, there's Landlock and there's Daniel's ingress/egress

>>>> filter thing.  Current cgroup controllers can mostly just DoS their

>>>> controlled processes.  These new controllers (or controller-like

>>>> things) can exfiltrate data and change semantics.

>>>>

>>>> Does anyone have a security model in mind for these controllers and

>>>> the cgroups that they're attached to?  I'm reasonably confident that

>>>> CAP_SYS_RESOURCE is not the answer...

>>>

>>> and specifically the answer is... ?

>>> Also would be great if you start with specifying the question first

>>> and the problem you're trying to solve.

>>>

>>

>> I don't have a good answer right now.  Here are some constraints, though:

>>

>> 1. An insufficiently privileged process should not be able to move a

>> victim into a dangerous cgroup.

>>

>> 2. An insufficiently privileged process should not be able to move

>> itself into a dangerous cgroup and then use execve to gain privilege

>> such that the execve'd program can be compromised.

>>

>> 3. An insufficiently privileged process should not be able to make an

>> existing cgroup dangerous in a way that could compromise a victim in

>> that cgroup.

>>

>> 4. An insufficiently privileged process should not be able to make a

>> cgroup dangerous in a way that bypasses protections that would

>> otherwise protect execve() as used by itself or some other process in

>> that cgroup.

>>

>> Keep in mind that "dangerous" may apply to a cgroup's descendents in

>> addition to the cgroup being controlled.

>

> Sorry for taking awhile to get back to you here.  I'm a little

> befuddled as to what next steps I should consider (and honestly, I'm

> not totally sure I really grok your concern here, particularly what

> you mean with "dangrous cgroups").

>

> So is going back to the CAP_CGROUP_MIGRATE approach (to properly

> separate "sufficiently" from "insufficiently privileged") better?

>

> Or something closer to the original method Android used of each cgroup

> having an allow_attach() check which could determine what is

> sufficiently privledged for the respective level of danger the cgroup

> might poise?

>

> Or just stepping back, what method would you imagine to be reasonable

> to allow a specified task to migrate other tasks between cgroups

> without it having to be root/suid?


Any suggested feedback here?

thanks
-john
Andy Lutomirski Dec. 6, 2016, 12:36 a.m. UTC | #6
On Mon, Dec 5, 2016 at 4:28 PM, John Stultz <john.stultz@linaro.org> wrote:
> On Tue, Nov 22, 2016 at 4:57 PM, John Stultz <john.stultz@linaro.org> wrote:

>> On Tue, Nov 8, 2016 at 4:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:

>>> On Tue, Nov 8, 2016 at 4:03 PM, Alexei Starovoitov

>>> <alexei.starovoitov@gmail.com> wrote:

>>>> On Tue, Nov 08, 2016 at 03:51:40PM -0800, Andy Lutomirski wrote:

>>>>>

>>>>> I hate to say it, but I think I may see a problem.  Current

>>>>> developments are afoot to make cgroups do more than resource control.

>>>>> For example, there's Landlock and there's Daniel's ingress/egress

>>>>> filter thing.  Current cgroup controllers can mostly just DoS their

>>>>> controlled processes.  These new controllers (or controller-like

>>>>> things) can exfiltrate data and change semantics.

>>>>>

>>>>> Does anyone have a security model in mind for these controllers and

>>>>> the cgroups that they're attached to?  I'm reasonably confident that

>>>>> CAP_SYS_RESOURCE is not the answer...

>>>>

>>>> and specifically the answer is... ?

>>>> Also would be great if you start with specifying the question first

>>>> and the problem you're trying to solve.

>>>>

>>>

>>> I don't have a good answer right now.  Here are some constraints, though:

>>>

>>> 1. An insufficiently privileged process should not be able to move a

>>> victim into a dangerous cgroup.

>>>

>>> 2. An insufficiently privileged process should not be able to move

>>> itself into a dangerous cgroup and then use execve to gain privilege

>>> such that the execve'd program can be compromised.

>>>

>>> 3. An insufficiently privileged process should not be able to make an

>>> existing cgroup dangerous in a way that could compromise a victim in

>>> that cgroup.

>>>

>>> 4. An insufficiently privileged process should not be able to make a

>>> cgroup dangerous in a way that bypasses protections that would

>>> otherwise protect execve() as used by itself or some other process in

>>> that cgroup.

>>>

>>> Keep in mind that "dangerous" may apply to a cgroup's descendents in

>>> addition to the cgroup being controlled.

>>

>> Sorry for taking awhile to get back to you here.  I'm a little

>> befuddled as to what next steps I should consider (and honestly, I'm

>> not totally sure I really grok your concern here, particularly what

>> you mean with "dangrous cgroups").

>>

>> So is going back to the CAP_CGROUP_MIGRATE approach (to properly

>> separate "sufficiently" from "insufficiently privileged") better?

>>

>> Or something closer to the original method Android used of each cgroup

>> having an allow_attach() check which could determine what is

>> sufficiently privledged for the respective level of danger the cgroup

>> might poise?

>>

>> Or just stepping back, what method would you imagine to be reasonable

>> to allow a specified task to migrate other tasks between cgroups

>> without it having to be root/suid?

>

> Any suggested feedback here?


I really don't know.  The cgroupfs interface is a bit unfortunate in
that it doesn't really express the constraints.  To safely migrate a
task, ISTM you ought to have some form of privilege over the task
*and* some form of privilege over the cgroup.  cgroupfs only handles
the latter.

CAP_CGROUP_MIGRATE ought to be okay.  Or maybe cgroupfs needs to gain
a concept of "dangerous" cgroups and further restrict them and
CAP_SYS_RESOURCE should be fine for non-dangerous cgroups?  I think I
favor the latter, but it might be nice to hear from Tejun first.

--Andy
Serge E. Hallyn Dec. 6, 2016, 2 a.m. UTC | #7
On Mon, Dec 05, 2016 at 04:36:51PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 5, 2016 at 4:28 PM, John Stultz <john.stultz@linaro.org> wrote:

> > On Tue, Nov 22, 2016 at 4:57 PM, John Stultz <john.stultz@linaro.org> wrote:

> >> On Tue, Nov 8, 2016 at 4:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:

> >>> On Tue, Nov 8, 2016 at 4:03 PM, Alexei Starovoitov

> >>> <alexei.starovoitov@gmail.com> wrote:

> >>>> On Tue, Nov 08, 2016 at 03:51:40PM -0800, Andy Lutomirski wrote:

> >>>>>

> >>>>> I hate to say it, but I think I may see a problem.  Current

> >>>>> developments are afoot to make cgroups do more than resource control.

> >>>>> For example, there's Landlock and there's Daniel's ingress/egress

> >>>>> filter thing.  Current cgroup controllers can mostly just DoS their

> >>>>> controlled processes.  These new controllers (or controller-like

> >>>>> things) can exfiltrate data and change semantics.

> >>>>>

> >>>>> Does anyone have a security model in mind for these controllers and

> >>>>> the cgroups that they're attached to?  I'm reasonably confident that

> >>>>> CAP_SYS_RESOURCE is not the answer...

> >>>>

> >>>> and specifically the answer is... ?

> >>>> Also would be great if you start with specifying the question first

> >>>> and the problem you're trying to solve.

> >>>>

> >>>

> >>> I don't have a good answer right now.  Here are some constraints, though:

> >>>

> >>> 1. An insufficiently privileged process should not be able to move a

> >>> victim into a dangerous cgroup.

> >>>

> >>> 2. An insufficiently privileged process should not be able to move

> >>> itself into a dangerous cgroup and then use execve to gain privilege

> >>> such that the execve'd program can be compromised.

> >>>

> >>> 3. An insufficiently privileged process should not be able to make an

> >>> existing cgroup dangerous in a way that could compromise a victim in

> >>> that cgroup.

> >>>

> >>> 4. An insufficiently privileged process should not be able to make a

> >>> cgroup dangerous in a way that bypasses protections that would

> >>> otherwise protect execve() as used by itself or some other process in

> >>> that cgroup.

> >>>

> >>> Keep in mind that "dangerous" may apply to a cgroup's descendents in

> >>> addition to the cgroup being controlled.

> >>

> >> Sorry for taking awhile to get back to you here.  I'm a little

> >> befuddled as to what next steps I should consider (and honestly, I'm

> >> not totally sure I really grok your concern here, particularly what

> >> you mean with "dangrous cgroups").

> >>

> >> So is going back to the CAP_CGROUP_MIGRATE approach (to properly

> >> separate "sufficiently" from "insufficiently privileged") better?

> >>

> >> Or something closer to the original method Android used of each cgroup

> >> having an allow_attach() check which could determine what is

> >> sufficiently privledged for the respective level of danger the cgroup

> >> might poise?

> >>

> >> Or just stepping back, what method would you imagine to be reasonable

> >> to allow a specified task to migrate other tasks between cgroups

> >> without it having to be root/suid?

> >

> > Any suggested feedback here?

> 

> I really don't know.  The cgroupfs interface is a bit unfortunate in

> that it doesn't really express the constraints.  To safely migrate a

> task, ISTM you ought to have some form of privilege over the task

> *and* some form of privilege over the cgroup.


Agreed.  The problem is that the privilege required should depend on
the controller (I guess).  For memory and cpuset, CAP_SYS_NICE seems
right.  Perhaps CAP_SYS_RESOURCE would be needed for some..  but then,
as I look through the lists (capabilities(7) and the list of controllers),
it seems like CAP_SYS_NICE works for everything.  What else would we need?
Maybe CAP_NET_ADMIN for net_cls and net_prio?  CAP_SYS_RESOURCE|CAP_SYS_ADMIN
for pids?

>   cgroupfs only handles

> the latter.


If we need different checks for different controllers, we can add
checks to cgroupfs.

> CAP_CGROUP_MIGRATE ought to be okay.  Or maybe cgroupfs needs to gain

> a concept of "dangerous" cgroups and further restrict them and

> CAP_SYS_RESOURCE should be fine for non-dangerous cgroups?  I think I

> favor the latter, but it might be nice to hear from Tejun first.

> 

> --Andy
Tejun Heo Dec. 6, 2016, 4:55 p.m. UTC | #8
Hello,

On Mon, Dec 05, 2016 at 04:36:51PM -0800, Andy Lutomirski wrote:
> I really don't know.  The cgroupfs interface is a bit unfortunate in

> that it doesn't really express the constraints.  To safely migrate a

> task, ISTM you ought to have some form of privilege over the task

> *and* some form of privilege over the cgroup.  cgroupfs only handles

> the latter.

> 

> CAP_CGROUP_MIGRATE ought to be okay.  Or maybe cgroupfs needs to gain

> a concept of "dangerous" cgroups and further restrict them and

> CAP_SYS_RESOURCE should be fine for non-dangerous cgroups?  I think I

> favor the latter, but it might be nice to hear from Tejun first.


If we can't do CAP_SYS_RESOURCE due to overlaps, let's go with a
separate CAP.  While for android and cgroup v1, it's nice to have a
finer grained CAP for security control, privilege isolation in cgroup
should also primarily done through hierarchical delegation.  It
doesn't make sense to have another system in parallel.

We can't do it properly on v1 because some controllers aren't properly
hierarchical and delegation model isn't well defined.  e.g. nothing
prevents a process from being pulled across different subtrees with
the same delegation, but v2 can do it properly.  All that's necessary
is to make the CAP test OR'd to other perm checks instead of AND'ing
so that the CAP just allows overriding restrictions expressed through
delegation but it's normally possible to move processes around in
one's own delegated subtree.

Thanks.

-- 
tejun
Tejun Heo Dec. 6, 2016, 4:57 p.m. UTC | #9
Hello, Serge.

On Mon, Dec 05, 2016 at 08:00:11PM -0600, Serge E. Hallyn wrote:
> > I really don't know.  The cgroupfs interface is a bit unfortunate in

> > that it doesn't really express the constraints.  To safely migrate a

> > task, ISTM you ought to have some form of privilege over the task

> > *and* some form of privilege over the cgroup.

> 

> Agreed.  The problem is that the privilege required should depend on

> the controller (I guess).  For memory and cpuset, CAP_SYS_NICE seems

> right.  Perhaps CAP_SYS_RESOURCE would be needed for some..  but then,

> as I look through the lists (capabilities(7) and the list of controllers),

> it seems like CAP_SYS_NICE works for everything.  What else would we need?

> Maybe CAP_NET_ADMIN for net_cls and net_prio?  CAP_SYS_RESOURCE|CAP_SYS_ADMIN

> for pids?


Please see my other reply but I don't think it's a good idea to have
these extra checks on the side when there already is hierarchical
delegation mechanism which should be able to handle both resource
control and process management delegation.

Thanks.

-- 
tejun
Andy Lutomirski Dec. 6, 2016, 5:01 p.m. UTC | #10
On Tue, Dec 6, 2016 at 8:55 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,

>

> On Mon, Dec 05, 2016 at 04:36:51PM -0800, Andy Lutomirski wrote:

>> I really don't know.  The cgroupfs interface is a bit unfortunate in

>> that it doesn't really express the constraints.  To safely migrate a

>> task, ISTM you ought to have some form of privilege over the task

>> *and* some form of privilege over the cgroup.  cgroupfs only handles

>> the latter.

>>

>> CAP_CGROUP_MIGRATE ought to be okay.  Or maybe cgroupfs needs to gain

>> a concept of "dangerous" cgroups and further restrict them and

>> CAP_SYS_RESOURCE should be fine for non-dangerous cgroups?  I think I

>> favor the latter, but it might be nice to hear from Tejun first.

>

> If we can't do CAP_SYS_RESOURCE due to overlaps, let's go with a

> separate CAP.  While for android and cgroup v1, it's nice to have a

> finer grained CAP for security control, privilege isolation in cgroup

> should also primarily done through hierarchical delegation.  It

> doesn't make sense to have another system in parallel.

>

> We can't do it properly on v1 because some controllers aren't properly

> hierarchical and delegation model isn't well defined.  e.g. nothing

> prevents a process from being pulled across different subtrees with

> the same delegation, but v2 can do it properly.  All that's necessary

> is to make the CAP test OR'd to other perm checks instead of AND'ing

> so that the CAP just allows overriding restrictions expressed through

> delegation but it's normally possible to move processes around in

> one's own delegated subtree.


How would one be granted the right to move processes around in one's
own subtree?

Are you imagining that, if you're in /a/b and you want to move a
process that's currently in /a/b/c to /a/b/d then you're allowed to
because the target process is in your tree?  If so, I doubt this has
the security properties you want -- namely, if you can cooperate with
anyone in /, even if they're unprivileged, you can break it.
Tejun Heo Dec. 6, 2016, 6:12 p.m. UTC | #11
Hello,

On Tue, Dec 06, 2016 at 09:01:17AM -0800, Andy Lutomirski wrote:
> How would one be granted the right to move processes around in one's

> own subtree?


Through expicit delegation - chowning of the directory and
cgroup.procs file.

> Are you imagining that, if you're in /a/b and you want to move a

> process that's currently in /a/b/c to /a/b/d then you're allowed to

> because the target process is in your tree?  If so, I doubt this has

> the security properties you want -- namely, if you can cooperate with

> anyone in /, even if they're unprivileged, you can break it.


Delegation is an explicit operation and reflected in the ownership of
the subdirectories and cgroup interface files in them.  The
subhierarchy containment is achieved by requiring the user who's
trying to migrate a process to have write perm on cgroup.procs on the
common ancestor of the source and target in addition to the target.
So, a user who has a subhierarchy delegated to it can move processes
around inside but not out of or into it.  Also, if there are multiple
delegated disjoint subhierarchies, processes can't be moved across
them by the delegatee either.

Thanks.

-- 
tejun
Andy Lutomirski Dec. 6, 2016, 6:13 p.m. UTC | #12
On Tue, Dec 6, 2016 at 10:12 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,

>

> On Tue, Dec 06, 2016 at 09:01:17AM -0800, Andy Lutomirski wrote:

>> How would one be granted the right to move processes around in one's

>> own subtree?

>

> Through expicit delegation - chowning of the directory and

> cgroup.procs file.

>

>> Are you imagining that, if you're in /a/b and you want to move a

>> process that's currently in /a/b/c to /a/b/d then you're allowed to

>> because the target process is in your tree?  If so, I doubt this has

>> the security properties you want -- namely, if you can cooperate with

>> anyone in /, even if they're unprivileged, you can break it.

>

> Delegation is an explicit operation and reflected in the ownership of

> the subdirectories and cgroup interface files in them.  The

> subhierarchy containment is achieved by requiring the user who's

> trying to migrate a process to have write perm on cgroup.procs on the

> common ancestor of the source and target in addition to the target.


OK, I see what you're doing.  That's interesting.
John Stultz Dec. 9, 2016, 5:39 a.m. UTC | #13
On Tue, Dec 6, 2016 at 10:23 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,

>

> On Tue, Dec 06, 2016 at 10:13:53AM -0800, Andy Lutomirski wrote:

>> > Delegation is an explicit operation and reflected in the ownership of

>> > the subdirectories and cgroup interface files in them.  The

>> > subhierarchy containment is achieved by requiring the user who's

>> > trying to migrate a process to have write perm on cgroup.procs on the

>> > common ancestor of the source and target in addition to the target.

>>

>> OK, I see what you're doing.  That's interesting.

>

> It's something born out of usages of cgroup v1.  People used it that

> way (chowning files and directories) and combined with the uid checksn

> it yielded something which is useful sometimes, but it always had

> issues with hierarchical behaviors, which files to chmod and the weird

> combination of uid checks.  cgroup v2 has a clear delegation model but

> the uid checks are still left in as not changing was the default.

>

> It's not necessary and I'm thinking about queueing something like the

> following in the next cycle.

>

> As for the android CAP discussion, I think it'd be nice to share an

> existing CAP but if we can't find a good one to share, let's create a

> new one.


So just to clarify the discussion for my purposes and make sure I
understood, per-cgroup CAP rules was not desired, and instead we
should either utilize an existing cap (are there still objections to
CAP_SYS_RESOURCE? - this isn't clear to me) or create a new one (ie,
bring back the older CAP_CGROUP_MIGRATE patch).

Tejun: Do you have a more finished version of your patch that I should
add my changes on top of?

thanks
-john
Tejun Heo Dec. 9, 2016, 1:27 p.m. UTC | #14
Hello, John.

On Thu, Dec 08, 2016 at 09:39:38PM -0800, John Stultz wrote:
> So just to clarify the discussion for my purposes and make sure I

> understood, per-cgroup CAP rules was not desired, and instead we

> should either utilize an existing cap (are there still objections to

> CAP_SYS_RESOURCE? - this isn't clear to me) or create a new one (ie,

> bring back the older CAP_CGROUP_MIGRATE patch).


Let's create a new one.  It looks to be a bit too different to share
with an existing one.

> Tejun: Do you have a more finished version of your patch that I should

> add my changes on top of?


Oh, just submit the patch on top of the current for-next.  I can queue
mine on top of yours.  They are mostly orthogonal.

Thanks.

-- 
tejun
diff mbox

Patch

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 85bc9be..866059a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2856,7 +2856,8 @@  static int cgroup_procs_write_permission(struct task_struct *task,
 	 */
 	if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
 	    !uid_eq(cred->euid, tcred->uid) &&
-	    !uid_eq(cred->euid, tcred->suid))
+	    !uid_eq(cred->euid, tcred->suid) &&
+	    !ns_capable(tcred->user_ns, CAP_SYS_RESOURCE))
 		ret = -EACCES;
 
 	if (!ret && cgroup_on_dfl(dst_cgrp)) {