[bpf,v3,1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode

Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d671.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.

Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.

In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.

Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the trade-offs being made back then with regards to races and refcount leaks
as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.

  [0] https://github.com/nestybox/sysbox/
  [1] https://kind.sigs.k8s.io/

Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Martynas Pumputis <m@lambda.lt>
---
 v1 -> v2:
   - Remove unneeded READ_ONCE()/WRITE_ONCE() pair around skcd->cgroup,
     thanks Stanislav!

 include/linux/cgroup-defs.h  | 107 +++++++++--------------------------
 include/linux/cgroup.h       |  22 +------
 kernel/cgroup/cgroup.c       |  50 ++++------------
 net/core/netclassid_cgroup.c |   7 +--
 net/core/netprio_cgroup.c    |  10 +---
 5 files changed, 41 insertions(+), 155 deletions(-)

Message ID	d9744c5af3d9ad3a74de2a209caca81ff76c3d42.1631547359.git.daniel@iogearbox.net
State	New
Headers	show Return-Path: <netdev-owner@kernel.org> From: Daniel Borkmann <daniel@iogearbox.net> To: bpf@vger.kernel.org Cc: netdev@vger.kernel.org, tj@kernel.org, davem@davemloft.net, m@lambda.lt, alexei.starovoitov@gmail.com, andrii@kernel.org, sdf@google.com, Daniel Borkmann <daniel@iogearbox.net> Subject: [PATCH bpf v3 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode Date: Mon, 13 Sep 2021 17:40:08 +0200 Message-Id: <d9744c5af3d9ad3a74de2a209caca81ff76c3d42.1631547359.git.daniel@iogearbox.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[bpf,v3,1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode \| expand [bpf,v3,1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode [bpf,v3,2/3] bpf, selftests: Add cgroup v1 net_cls classid helpers [bpf,v3,3/3] bpf, selftests: Add test case for mixed cgroup v1/v2

[bpf,v3,1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode

Commit Message

Patch