[RFC,bpf-next,0/3] bpf: freeze a task cgroup from bpf

Message ID	20240327225334.58474-1-tixxdz@gmail.com
Headers	show Received: from mail-wr1-f54.google.com (mail-wr1-f54.google.com [209.85.221.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C152D2EB04; Wed, 27 Mar 2024 22:54:48 +0000 (UTC) From: Djalal Harouni <tixxdz@gmail.com> To: tixxdz@gmail.com Cc: Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <martin.lau@linux.dev>, Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>, Yonghong Song <yonghong.song@linux.dev>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>, Mykola Lysenko <mykolal@fb.com>, Shuah Khan <shuah@kernel.org>, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf Date: Wed, 27 Mar 2024 23:53:22 +0100 Message-Id: <20240327225334.58474-1-tixxdz@gmail.com> In-Reply-To: <20240327-ccb56fc7a6e80136db80876c@djalal> References: <20240327-ccb56fc7a6e80136db80876c@djalal> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	bpf: freeze a task cgroup from bpf \| expand [RFC,bpf-next,0/3] bpf: freeze a task cgroup from bpf [RFC,bpf-next,1/3] cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf [RFC,bpf-next,2/3] bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task [RFC,bpf-next,3/3] selftests/bpf: add selftest for bpf_task_freeze_cgroup

Djalal Harouni March 27, 2024, 10:53 p.m. UTC

This patch series adds support to freeze the task cgroup hierarchy
that is on a default cgroup v2 without going through kernfs interface.

For some cases we want to freeze the cgroup of a task based on some
signals, doing so from bpf is better than user space which could be
too late.

Planned users of this feature are: tetragon and systemd when freezing
a cgroup hierarchy that could be a K8s pod, container, system service
or a user session.

Patch 1: cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf
Patch 2: bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task
Patch 3: selftests/bpf: add selftest for bpf_task_freeze_cgroup

 include/linux/cgroup.h                                      |   2 ++
 kernel/bpf/helpers.c                                        |  31 ++++
 kernel/cgroup/cgroup.c                                      |  67 ++++++++
 tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c | 165 +++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c | 110 ++++++++++++++
 5 files changed, 375 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c

Tejun Heo March 28, 2024, 5:22 p.m. UTC | #1

Hello, Djalal.

On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni wrote:
> This patch series adds support to freeze the task cgroup hierarchy
> that is on a default cgroup v2 without going through kernfs interface.
> 
> For some cases we want to freeze the cgroup of a task based on some
> signals, doing so from bpf is better than user space which could be
> too late.
> 
> Planned users of this feature are: tetragon and systemd when freezing
> a cgroup hierarchy that could be a K8s pod, container, system service
> or a user session.
> 
> Patch 1: cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf
> Patch 2: bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task
> Patch 3: selftests/bpf: add selftest for bpf_task_freeze_cgroup

It bothers me a bit that it's adding a dedicated interface for something
which already has a defined userspace interface. Would it be better to have
kfunc wrappers for kernel_read() and kernel_write()?

Thanks.

Alexei Starovoitov March 28, 2024, 5:32 p.m. UTC | #2

On Thu, Mar 28, 2024 at 10:22 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Djalal.
>
> On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni wrote:
> > This patch series adds support to freeze the task cgroup hierarchy
> > that is on a default cgroup v2 without going through kernfs interface.
> >
> > For some cases we want to freeze the cgroup of a task based on some
> > signals, doing so from bpf is better than user space which could be
> > too late.
> >
> > Planned users of this feature are: tetragon and systemd when freezing
> > a cgroup hierarchy that could be a K8s pod, container, system service
> > or a user session.
> >
> > Patch 1: cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf
> > Patch 2: bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task
> > Patch 3: selftests/bpf: add selftest for bpf_task_freeze_cgroup
>
> It bothers me a bit that it's adding a dedicated interface for something
> which already has a defined userspace interface. Would it be better to have
> kfunc wrappers for kernel_read() and kernel_write()?

How would that look ?
prog cannot and shouldn't open a file.
The seq_file would be passed/pinned by user space?

Tejun Heo March 28, 2024, 5:58 p.m. UTC | #3

Hello, Alexei.

On Thu, Mar 28, 2024 at 10:32:24AM -0700, Alexei Starovoitov wrote:
> > It bothers me a bit that it's adding a dedicated interface for something
> > which already has a defined userspace interface. Would it be better to have
> > kfunc wrappers for kernel_read() and kernel_write()?
> 
> How would that look ?
> prog cannot and shouldn't open a file.

Oh, I didn't know. Why is that?

> The seq_file would be passed/pinned by user space?

Would it work if it's just "open this file, write this and then close it"?

Thanks.

Alexei Starovoitov March 28, 2024, 7:46 p.m. UTC | #4

On Thu, Mar 28, 2024 at 10:58 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Alexei.
>
> On Thu, Mar 28, 2024 at 10:32:24AM -0700, Alexei Starovoitov wrote:
> > > It bothers me a bit that it's adding a dedicated interface for something
> > > which already has a defined userspace interface. Would it be better to have
> > > kfunc wrappers for kernel_read() and kernel_write()?
> >
> > How would that look ?
> > prog cannot and shouldn't open a file.
>
> Oh, I didn't know. Why is that?
>
> > The seq_file would be passed/pinned by user space?
>
> Would it work if it's just "open this file, write this and then close it"?

Continuing discussion...
To use kernel_file_open() it would need path, inode, cred.
None of that is available now.
Allocating all these structures just to wrap a cgroup pointer
feels like overkill.
Of course, it would solve the need to introduce other
cgroup apis that are already available via text based cgroupfs
read/write. So there are pros and cons in both approaches.
Maybe the 3rd option would be to expose:
cgroup_lock() as a special blend of acquire plus lock.
Then there will be no need for bpf_task_freeze_cgroup() with task
argument. Instead cgroup_freeze() will be such kfunc that
takes cgroup argument and the verifier will check that
cgroup was acquired/locked.
Sort-of what we check to access bpf_rb_root.

Tejun Heo March 28, 2024, 8:02 p.m. UTC | #5

Hello,

On Thu, Mar 28, 2024 at 12:46:03PM -0700, Alexei Starovoitov wrote:
> To use kernel_file_open() it would need path, inode, cred.

Yeah, it's more involved and potentially more controversial. That said, just
to push the argument a bit further, at least for path, it'd be nice to have
a helper anyway which can return cgroup path. I don't know whether we'd need
direct inode access. For cred, just share some root cred?

> None of that is available now.
> Allocating all these structures just to wrap a cgroup pointer
> feels like overkill.
> Of course, it would solve the need to introduce other
> cgroup apis that are already available via text based cgroupfs
> read/write. So there are pros and cons in both approaches.
> Maybe the 3rd option would be to expose:
> cgroup_lock() as a special blend of acquire plus lock.

Oh, let's not expose that. That has been a source of problem for some use
cases and we may have to change how cgroups are locked.

> Then there will be no need for bpf_task_freeze_cgroup() with task
> argument. Instead cgroup_freeze() will be such kfunc that
> takes cgroup argument and the verifier will check that
> cgroup was acquired/locked.
> Sort-of what we check to access bpf_rb_root.

There's also cgroup.kill which would be useful for similar use cases. We can
add interface for both but idk. Let's say we have something like the
following (pardon the bad naming):

  bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)

Would that work? I'm not necessarily in love with the idea or against adding
separate helpers but the duplication still bothers me a bit.

Thanks.

Alexei Starovoitov March 28, 2024, 8:45 p.m. UTC | #6

On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo <tj@kernel.org> wrote:
>
> There's also cgroup.kill which would be useful for similar use cases. We can
> add interface for both but idk. Let's say we have something like the
> following (pardon the bad naming):
>
>   bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
>
> Would that work? I'm not necessarily in love with the idea or against adding
> separate helpers but the duplication still bothers me a bit.

I liked it.
So filename will be one of cgroup_base_files[].name ?
We probably don't want psi or cgroup1_base_files in there.

Tejun Heo March 28, 2024, 9:01 p.m. UTC | #7

Hello,

On Thu, Mar 28, 2024 at 01:45:56PM -0700, Alexei Starovoitov wrote:
> On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > There's also cgroup.kill which would be useful for similar use cases. We can
> > add interface for both but idk. Let's say we have something like the
> > following (pardon the bad naming):
> >
> >   bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
> >
> > Would that work? I'm not necessarily in love with the idea or against adding
> > separate helpers but the duplication still bothers me a bit.
> 
> I liked it.
> So filename will be one of cgroup_base_files[].name ?
> We probably don't want psi or cgroup1_base_files in there.

Would it matter? If the user has root perm, they can do whatever with the
files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we
wanna make sure @filename doesn't include '/'? Or is it that you don't want
to go through the usual file name look up?

> From the verifier pov 2nd arg can be "char *knob__str" and
> the verifier will make sure it's a constant NULL terminated string,
> so at runtime it will be easier to search cgroup_base_files array.
> And 'buf' can be: void *mem, int mem__sz with kfunc doing
> run-time validation that there a null there.

That all sound good.

Thanks.

Alexei Starovoitov March 28, 2024, 9:28 p.m. UTC | #8

On Thu, Mar 28, 2024 at 2:01 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Mar 28, 2024 at 01:45:56PM -0700, Alexei Starovoitov wrote:
> > On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > There's also cgroup.kill which would be useful for similar use cases. We can
> > > add interface for both but idk. Let's say we have something like the
> > > following (pardon the bad naming):
> > >
> > >   bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
> > >
> > > Would that work? I'm not necessarily in love with the idea or against adding
> > > separate helpers but the duplication still bothers me a bit.
> >
> > I liked it.
> > So filename will be one of cgroup_base_files[].name ?
> > We probably don't want psi or cgroup1_base_files in there.
>
> Would it matter?

Few weak reasons:
. cgroup_psi_files have show/write/poll/release which
  doesn't map to this bpf_cgroup_knob_write/read ?
. cgroup1_base_files probably needs to a separate kfunc
  bpf_cgroup1_...

> If the user has root perm, they can do whatever with the
> files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we
> wanna make sure @filename doesn't include '/'? Or is it that you don't want
> to go through the usual file name look up?

yeah. why do a file lookup? The names are there in the array.
cgroup pointer gives that "relative path" and knob name is the last
part of such "path". Easy to search in that array(s).

> > From the verifier pov 2nd arg can be "char *knob__str" and
> > the verifier will make sure it's a constant NULL terminated string,
> > so at runtime it will be easier to search cgroup_base_files array.
> > And 'buf' can be: void *mem, int mem__sz with kfunc doing
> > run-time validation that there a null there.
>
> That all sound good.
>
> Thanks.
>
> --
> tejun

Tejun Heo March 28, 2024, 11:23 p.m. UTC | #9

Hello,

On Thu, Mar 28, 2024 at 02:28:51PM -0700, Alexei Starovoitov wrote:
> > > So filename will be one of cgroup_base_files[].name ?
> > > We probably don't want psi or cgroup1_base_files in there.
> >
> > Would it matter?
> 
> Few weak reasons:
> . cgroup_psi_files have show/write/poll/release which
>   doesn't map to this bpf_cgroup_knob_write/read ?
> . cgroup1_base_files probably needs to a separate kfunc
>   bpf_cgroup1_...
> 
> > If the user has root perm, they can do whatever with the
> > files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we
> > wanna make sure @filename doesn't include '/'? Or is it that you don't want
> > to go through the usual file name look up?
> 
> yeah. why do a file lookup? The names are there in the array.
> cgroup pointer gives that "relative path" and knob name is the last
> part of such "path". Easy to search in that array(s).

Difficult to tell without looking at the implementation but I don't have
strong opinions. The interface makes sense to me and as long as we can hook
it up in a reasonably way, it should be okay. We can always change internal
implementation later if necessary.

Thanks.

Djalal Harouni March 29, 2024, 1:22 p.m. UTC | #10

Hello Tejun, Alexei,

On 3/28/24 22:01, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 28, 2024 at 01:45:56PM -0700, Alexei Starovoitov wrote:
>> On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo <tj@kernel.org> wrote:
>>>
>>> There's also cgroup.kill which would be useful for similar use cases. We can
>>> add interface for both but idk. Let's say we have something like the
>>> following (pardon the bad naming):
>>>

Yes having the cgroup.kill from bpf would be useful!


>>>   bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
>>>
>>> Would that work? I'm not necessarily in love with the idea or against adding
>>> separate helpers but the duplication still bothers me a bit.
>>
>> I liked it.
>> So filename will be one of cgroup_base_files[].name ?
>> We probably don't want psi or cgroup1_base_files in there.
> 
> Would it matter? If the user has root perm, they can do whatever with the
> files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we
> wanna make sure @filename doesn't include '/'? Or is it that you don't want
> to go through the usual file name look up?

It would be easy at least for me if I just start with cgroupv2 and
ensure that it has same available filenames as if we go through kernfs.
Not a root cgroup node and maybe only freeze and kill for now that are
part of cgroup_base_files.

So if I get it right, somehow like what I did but we endup with:

In bpf, cgroup was already acquired.

bpf_cgroup_knob_write(cgroup, "freeze", buf)
|_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock


cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...)
|_ parse params -> cgroup_ref++ -> krnfs_active_ref--  ->
     -> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...

Please let me know if I missed something.

Thanks!

Tejun Heo March 29, 2024, 9:39 p.m. UTC | #11

Hello,

On Fri, Mar 29, 2024 at 02:22:28PM +0100, Djalal Harouni wrote:
> It would be easy at least for me if I just start with cgroupv2 and
> ensure that it has same available filenames as if we go through kernfs.
> Not a root cgroup node and maybe only freeze and kill for now that are
> part of cgroup_base_files.
> 
> So if I get it right, somehow like what I did but we endup with:
> 
> In bpf, cgroup was already acquired.
> 
> bpf_cgroup_knob_write(cgroup, "freeze", buf)
> |_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock
> 
> 
> cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...)
> |_ parse params -> cgroup_ref++ -> krnfs_active_ref--  ->
>      -> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...
> 
> Please let me know if I missed something.

I've thought about it a bit and I wonder whether a better way to do this is
implementing this at the kernfs layer. Something like (hopefully with a
better name):

 s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);

So, about the same, but takes kernfs_node directory instead of cgroup. This
would make the interface useful for accessing sysfs knobs too which use
similar conventions. For cgroup, @dir is just cgrp->kn and for sysfs it'd be
kobj->sd. This way we can avoid the internal object -> path -> internal
object ping-poinging while keeping the interface a lot more generic. What do
you think?

Thanks.

Alexei Starovoitov March 29, 2024, 11:04 p.m. UTC | #12

On Fri, Mar 29, 2024 at 2:39 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Mar 29, 2024 at 02:22:28PM +0100, Djalal Harouni wrote:
> > It would be easy at least for me if I just start with cgroupv2 and
> > ensure that it has same available filenames as if we go through kernfs.
> > Not a root cgroup node and maybe only freeze and kill for now that are
> > part of cgroup_base_files.
> >
> > So if I get it right, somehow like what I did but we endup with:
> >
> > In bpf, cgroup was already acquired.
> >
> > bpf_cgroup_knob_write(cgroup, "freeze", buf)
> > |_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock
> >
> >
> > cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...)
> > |_ parse params -> cgroup_ref++ -> krnfs_active_ref--  ->
> >      -> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...
> >
> > Please let me know if I missed something.
>
> I've thought about it a bit and I wonder whether a better way to do this is
> implementing this at the kernfs layer. Something like (hopefully with a
> better name):
>
>  s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
>
> So, about the same, but takes kernfs_node directory instead of cgroup. This
> would make the interface useful for accessing sysfs knobs too which use
> similar conventions. For cgroup, @dir is just cgrp->kn and for sysfs it'd be
> kobj->sd. This way we can avoid the internal object -> path -> internal
> object ping-poinging while keeping the interface a lot more generic. What do
> you think?

And helpers like cgroup_freeze_write() will be refactored
to take kernfs_node directly instead of kernfs_open_file?
Makes sense to me.
Sounds like a minimal amount of changes and flexible enough.

Michal Koutný April 2, 2024, 5:16 p.m. UTC | #13

Hello.

On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni <tixxdz@gmail.com> wrote:
> ...
> For some cases we want to freeze the cgroup of a task based on some
> signals, doing so from bpf is better than user space which could be
> too late.

Notice that freezer itself is not immediate -- tasks are frozen as if a
signal (kill(2)) was delivered to them (i.e. returning to userspace).

What kind of signals (also kill?) are you talking about for
illustration?

> Planned users of this feature are: tetragon and systemd when freezing
> a cgroup hierarchy that could be a K8s pod, container, system service
> or a user session.

It sounds like the signals are related to a particular process. If so
what is it good for to freeze unrelated processes in the same cgroup?

I think those answers better clarify why this is needed.

As for the generalization to any cgroup attribute (or kernfs). Can this
be compared with sysctls -- I see there are helpers to intercept user
writes but no helpers to affect sysctl values without an outer writer.
What would justify different approaches between kernfs attributes and
sysctls (direct writes vs modified writes)?

Thanks,
Michal

Djalal Harouni April 2, 2024, 5:40 p.m. UTC | #14

Hello,

On 3/30/24 00:04, Alexei Starovoitov wrote:
> On Fri, Mar 29, 2024 at 2:39 PM Tejun Heo <tj@kernel.org> wrote:
>>
>> Hello,
>>
>> On Fri, Mar 29, 2024 at 02:22:28PM +0100, Djalal Harouni wrote:
>>> It would be easy at least for me if I just start with cgroupv2 and
>>> ensure that it has same available filenames as if we go through kernfs.
>>> Not a root cgroup node and maybe only freeze and kill for now that are
>>> part of cgroup_base_files.
>>>
>>> So if I get it right, somehow like what I did but we endup with:
>>>
>>> In bpf, cgroup was already acquired.
>>>
>>> bpf_cgroup_knob_write(cgroup, "freeze", buf)
>>> |_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock
>>>
>>>
>>> cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...)
>>> |_ parse params -> cgroup_ref++ -> krnfs_active_ref--  ->
>>>      -> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...
>>>
>>> Please let me know if I missed something.
>>
>> I've thought about it a bit and I wonder whether a better way to do this is
>> implementing this at the kernfs layer. Something like (hopefully with a
>> better name):
>>
>>  s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
>>
>> So, about the same, but takes kernfs_node directory instead of cgroup. This
>> would make the interface useful for accessing sysfs knobs too which use
>> similar conventions. For cgroup, @dir is just cgrp->kn and for sysfs it'd be
>> kobj->sd. This way we can avoid the internal object -> path -> internal
>> object ping-poinging while keeping the interface a lot more generic. What do
>> you think?
> 
> And helpers like cgroup_freeze_write() will be refactored
> to take kernfs_node directly instead of kernfs_open_file?
> Makes sense to me.
> Sounds like a minimal amount of changes and flexible enough.

Thank you Alexei, Tejun for the feedback. Will try to get back with a v2.

One particular thing is the kernfs_open_file->mutex nests outside of the
refcounting of kernfs_node,  let's see.

Thanks!

Djalal Harouni April 2, 2024, 6:20 p.m. UTC | #15

Hello Michal,

On 4/2/24 18:16, Michal Koutný wrote:
> Hello.
> 
> On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni <tixxdz@gmail.com> wrote:
>> ...
>> For some cases we want to freeze the cgroup of a task based on some
>> signals, doing so from bpf is better than user space which could be
>> too late.
> 
> Notice that freezer itself is not immediate -- tasks are frozen as if a
> signal (kill(2)) was delivered to them (i.e. returning to userspace).

Thanks yes, I would expect freeze to behave like signal, and if one
wants to block immediately there is the LSM override return. The
selftest attached tries to do exactly that.

> What kind of signals (also kill?) are you talking about for
> illustration?

Could be security signals, reading sensitive files or related to any
operation management, for X reasons this user session should be freezed
or killed.

The kill is an effective defense against fork-bombs as an example.

>> Planned users of this feature are: tetragon and systemd when freezing
>> a cgroup hierarchy that could be a K8s pod, container, system service
>> or a user session.
> 
> It sounds like the signals are related to a particular process. If so
> what is it good for to freeze unrelated processes in the same cgroup?

Today some container/pod operations are performed at bpf level, having
the freeze and kill available is straightforward to perform this.

> I think those answers better clarify why this is needed.

Alright will add those in v2.

> 
> As for the generalization to any cgroup attribute (or kernfs). Can this
> be compared with sysctls -- I see there are helpers to intercept user
> writes but no helpers to affect sysctl values without an outer writer.
> What would justify different approaches between kernfs attributes and
> sysctls (direct writes vs modified writes)?

For generalizing this, haven't thought about it that much. First use
case is to try to get freeze and possibly kill support, and use a common
interface as requested.

Thank you!

> 
> Thanks,
> Michal

Yonghong Song April 11, 2024, 12:26 a.m. UTC | #16

On 4/9/24 8:32 AM, Michal Koutný wrote:
> Hi.
>
> On Tue, Apr 02, 2024 at 07:20:45PM +0100, Djalal Harouni <tixxdz@gmail.com> wrote:
>> Thanks yes, I would expect freeze to behave like signal, and if one
>> wants to block immediately there is the LSM override return. The
>> selftest attached tries to do exactly that.
> Are you refering to this part:
>
> 	int BPF_PROG(lsm_freeze_cgroup, int cmd, union bpf_attr *attr, unsigned int size)
> 		...
> 		ret = bpf_task_freeze_cgroup(task, 1);
> 		if (!ret) {
> 			ret = -EPERM;
> 			/* reset for next call */
> ?
>
>
>> Could be security signals, reading sensitive files or related to any
>> operation management, for X reasons this user session should be freezed
>> or killed.
> What can be done with a frozen cgroup after anything of that happens?
> Anything besides killing anyway?
>
> Killing of an offending process could be caught by its supervisor (like
> container runtime or systemd) and propagated accordingly to the whole
> cgroup.
>
>> The kill is an effective defense against fork-bombs as an example.
> There are several ways how to prevent fork-bombs in kernel already, it
> looks like a contrived example.
>
>> Today some container/pod operations are performed at bpf level, having
>> the freeze and kill available is straightforward to perform this.
> It seems to me like an extra step when the same operation can be done from
> (the managing) userspace already.
>
>> For generalizing this, haven't thought about it that much. First use
>> case is to try to get freeze and possibly kill support, and use a common
>> interface as requested.
> BTW, I notice that there is bpf_sys_bpf() helper that allows calling an
> arbitrary syscall. Wouldn't that be sufficient for everything?

This is not true. Currently, only 'bpf' and 'close' syscalls are supported.

static const struct bpf_func_proto *
syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
         switch (func_id) {
         case BPF_FUNC_sys_bpf:
                 return !bpf_token_capable(prog->aux->token, CAP_PERFMON)
                        ? NULL : &bpf_sys_bpf_proto;
         case BPF_FUNC_btf_find_by_name_kind:
                 return &bpf_btf_find_by_name_kind_proto;
         case BPF_FUNC_sys_close:
                 return &bpf_sys_close_proto;
         case BPF_FUNC_kallsyms_lookup_name:
                 return &bpf_kallsyms_lookup_name_proto;
         default:
                 return tracing_prog_func_proto(func_id, prog);
         }
}

More syscalls can be added (through kfunc) if there is a use case for that.

>
> (Based on how I still understand the problem: either you must respond
> immediately and then the direct return from LSM is appropriate or timing
> is not sensitive but you want act on whole cgroup.)
>
> Thanks,
> Michal

Michal Koutný April 11, 2024, 8:25 a.m. UTC | #17

On Wed, Apr 10, 2024 at 05:26:18PM -0700, Yonghong Song <yonghong.song@linux.dev> wrote:
> This is not true.

Oh, I misunderstood a manpage, I can see now it's not for any syscall.

> More syscalls can be added (through kfunc) if there is a use case for that.

Thus, I don't want to open this up.

Michal

[RFC,bpf-next,0/3] bpf: freeze a task cgroup from bpf

Message

Comments