Message ID | 20210417033224.8063-1-alexei.starovoitov@gmail.com |
---|---|
Headers | show |
Series | bpf: syscall program, FD array, loader program, light skeleton. | expand |
On Fri, Apr 16, 2021 at 08:32:20PM -0700, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Add bpf_sys_close() helper to be used by the syscall/loader program to close > intermediate FDs and other cleanup. Conditional NAK. In a lot of contexts close_fd() is very much unsafe. In particular, anything that might call it between fdget() and fdput() is Right Fucking Out(tm). In which contexts can that thing be executed?
On Fri, Apr 16, 2021 at 8:42 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, Apr 16, 2021 at 08:32:20PM -0700, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > > > Add bpf_sys_close() helper to be used by the syscall/loader program to close > > intermediate FDs and other cleanup. > > Conditional NAK. In a lot of contexts close_fd() is very much unsafe. > In particular, anything that might call it between fdget() and fdput() > is Right Fucking Out(tm). > In which contexts can that thing be executed? user context only. It's not for all of bpf _obviously_.
On Fri, Apr 16, 2021 at 9:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, Apr 16, 2021 at 08:46:05PM -0700, Alexei Starovoitov wrote: > > On Fri, Apr 16, 2021 at 8:42 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > On Fri, Apr 16, 2021 at 08:32:20PM -0700, Alexei Starovoitov wrote: > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > Add bpf_sys_close() helper to be used by the syscall/loader program to close > > > > intermediate FDs and other cleanup. > > > > > > Conditional NAK. In a lot of contexts close_fd() is very much unsafe. > > > In particular, anything that might call it between fdget() and fdput() > > > is Right Fucking Out(tm). > > > In which contexts can that thing be executed? > > > > user context only. > > It's not for all of bpf _obviously_. > > Let me restate the question: what call chains could lead to bpf_sys_close()? Already answered. User context only. It's all safe.
On Fri, Apr 16, 2021 at 10:01:43PM -0700, Alexei Starovoitov wrote: > On Fri, Apr 16, 2021 at 9:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > On Fri, Apr 16, 2021 at 08:46:05PM -0700, Alexei Starovoitov wrote: > > > On Fri, Apr 16, 2021 at 8:42 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > > > On Fri, Apr 16, 2021 at 08:32:20PM -0700, Alexei Starovoitov wrote: > > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > Add bpf_sys_close() helper to be used by the syscall/loader program to close > > > > > intermediate FDs and other cleanup. > > > > > > > > Conditional NAK. In a lot of contexts close_fd() is very much unsafe. > > > > In particular, anything that might call it between fdget() and fdput() > > > > is Right Fucking Out(tm). > > > > In which contexts can that thing be executed? > > > > > > user context only. > > > It's not for all of bpf _obviously_. > > > > Let me restate the question: what call chains could lead to bpf_sys_close()? > > Already answered. User context only. It's all safe. Not only sys_close is safe to call. Literally all syscalls are safe to call. The current allowlist contains two syscalls. It may get extended as use cases come up. The following two codes are equivalent: 1. bpf_prog.c: SEC("syscall") int bpf_prog(struct args *ctx) { bpf_sys_close(1); bpf_sys_close(2); bpf_sys_close(3); return 0; } main.c: int main(int ac, char **av) { bpf_prog_load_and_run("bpf_prog.o"); } 2. main.c: int main(int ac, char **av) { close(1); close(2); close(3); } The kernel will perform the same work with FDs. The same locks are held and the same execution conditions are in both cases. The LSM hooks, fsnotify, etc will be called the same way. It's no different if new syscall was introduced "sys_foo(int num)" that would do { return close_fd(num); }. It would opearate in the same user context.
On Sat, Apr 17, 2021 at 07:36:39AM -0700, Alexei Starovoitov wrote: > The kernel will perform the same work with FDs. The same locks are held > and the same execution conditions are in both cases. The LSM hooks, > fsnotify, etc will be called the same way. > It's no different if new syscall was introduced "sys_foo(int num)" that > would do { return close_fd(num); }. > It would opearate in the same user context. Hmm... unless I'm misreading the code, one of the call chains would seem to be sys_bpf() -> bpf_prog_test_run() -> ->test_run() -> ... -> bpf_sys_close(). OK, as long as you make sure bpf_prog_get() does fdput() (i.e. that we don't have it restructured so that fdget()/fdput() pair would be lifted into bpf_prog_test_run(), with fdput() moved in place of bpf_prog_put()). Note that we *really* can not allow close_fd() on anything to be bracketed by fdget()/fdput() pair; we had bugs of that sort and, as the matter of fact, still have one in autofs_dev_ioctl(). The trouble happens if you have file F with 2 references, held by descriptor tables of different processes. Say, process A has descriptor 6 refering to it, while B has descriptor 42 doing the same. Descriptor tables of A and B are not shared with anyone. A: fdget(6) -> returns a reference to F, refcount _not_ touched A: close_fd(6) -> rips the reference to F from descriptor table, does fput(F) refcount drops to 1. B: close(42) -> rips the reference to F from B's descriptor table, does fput(F) This time refcount does reach 0 and we use task_work_add() to make sure the destructor (__fput()) runs before B returns to userland. sys_close() returns and B goes off to userland. On the way out __fput() is run, and among other things, ->release() of F is executed, doing whatever it wants to do. F is freed. And at that point A, which presumably is using the guts of F, gets screwed. In case of autofs_dev_ioctl(), it's possible for a thread to end up blocked inside copy_to_user(), with autofs functions in call chains *AND* module refcount of autofs not pinned by anything. The effects of returning into a function that used to reside in now unmapped page are obviously not pretty... Basically, the rule is * never remove from descriptor table if you currently have an outstadning fdget() (i.e. without having done the matching fdput() yet). That, obviously, covers all ioctls - there we have fdget() done by sys_ioctl() on the issuing descriptor. In your case you seem to be safe, but it's a bloody dangerous minefield - you really need a big warning in all call sites. The worst part is that it won't happen with intended use, so it doesn't show up in routine regression testing. In particular, for autofs the normal case is AUTOFS_DEV_IOCTL_CLOSEMOUNT getting passed a file descriptor of something mounted and *not* the descriptor of /dev/autofs we are holding fdget() on. However, there's no way to prevent a malicious call when we pass exactly that. So please, mark all call sites with "make very sure you never get here with unpaired fdget()". BTW, if my reading (re ->test_run()) is correct, what limits the recursion via bpf_sys_bpf()?
On Sat, Apr 17, 2021 at 04:48:53PM +0000, Al Viro wrote: > On Sat, Apr 17, 2021 at 07:36:39AM -0700, Alexei Starovoitov wrote: > > > The kernel will perform the same work with FDs. The same locks are held > > and the same execution conditions are in both cases. The LSM hooks, > > fsnotify, etc will be called the same way. > > It's no different if new syscall was introduced "sys_foo(int num)" that > > would do { return close_fd(num); }. > > It would opearate in the same user context. > > Hmm... unless I'm misreading the code, one of the call chains would seem to > be sys_bpf() -> bpf_prog_test_run() -> ->test_run() -> ... -> bpf_sys_close(). > OK, as long as you make sure bpf_prog_get() does fdput() (i.e. that we > don't have it restructured so that fdget()/fdput() pair would be lifted into > bpf_prog_test_run(), with fdput() moved in place of bpf_prog_put()). Got it. There is no fdget/put bracketing in the code. On the way to test_run we do __bpf_prog_get() which does fdget and immediately fdput after incrementing refcnt of the prog. I believe this pattern is consistent everywhere in kernel/bpf/* > Note that we *really* can not allow close_fd() on anything to be bracketed > by fdget()/fdput() pair; we had bugs of that sort and, as the matter of fact, > still have one in autofs_dev_ioctl(). > > The trouble happens if you have file F with 2 references, held by descriptor > tables of different processes. Say, process A has descriptor 6 refering to > it, while B has descriptor 42 doing the same. Descriptor tables of A and B > are not shared with anyone. > > A: fdget(6) -> returns a reference to F, refcount _not_ touched > A: close_fd(6) -> rips the reference to F from descriptor table, does fput(F) > refcount drops to 1. > B: close(42) -> rips the reference to F from B's descriptor table, does fput(F) > This time refcount does reach 0 and we use task_work_add() to > make sure the destructor (__fput()) runs before B returns to > userland. sys_close() returns and B goes off to userland. > On the way out __fput() is run, and among other things, > ->release() of F is executed, doing whatever it wants to do. > F is freed. > And at that point A, which presumably is using the guts of F, gets screwed. Thanks for these details. That's really helpful. > So please, mark all call sites with "make very sure you never get > here with unpaired fdget()". Good point. Will add this comment. > BTW, if my reading (re ->test_run()) is correct, what limits the recursion > via bpf_sys_bpf()? Glad you asked! This kind of code review questions are much appreciated. It's an allowlist of possible commands in bpf_sys_bpf(). 'case BPF_PROG_TEST_RUN:' is not there for this exact reason. I'll add a comment to make it more obvious.
From: Alexei Starovoitov <ast@kernel.org> This is a first step towards signed bpf programs and the third approach of that kind. The first approach was to bring libbpf into the kernel as a user-mode-driver. The second approach was to invent a new file format and let kernel execute that format as a sequence of syscalls that create maps and load programs. This third approach is using new type of bpf program instead of inventing file format. 1st and 2nd approaches had too many downsides comparing to this 3rd and were discarded after months of work. To make it work the following new concepts are introduced: 1. syscall bpf program type A kind of bpf program that can do sys_bpf and sys_close syscalls. It can only execute in user context. 2. FD array or FD index. Traditionally BPF instructions are patched with FDs. What it means that maps has to be created first and then instructions modified which breaks signature verification if the program is signed. Instead of patching each instruction with FD patch it with an index into array of FDs. That makes the program signature stable if it uses maps. 3. loader program that is generated as "strace of libbpf". When libbpf is loading bpf_file.o it does a bunch of sys_bpf() syscalls to load BTF, create maps, populate maps and finally load programs. Instead of actually doing the syscalls generate a trace of what libbpf would have done and represent it as the "loader program". The "loader program" consists of single map and single bpf program that does those syscalls. Executing such "loader program" via bpf_prog_test_run() command will replay the sequence of syscalls that libbpf would have done which will result the same maps created and programs loaded as specified in the elf file. The "loader program" removes libelf and majority of libbpf dependency from program loading process. 4. light skeleton Instead of embedding the whole elf file into skeleton and using libbpf to parse it later generate a loader program and embed it into "light skeleton". Such skeleton can load the same set of elf files, but it doesn't need libbpf and libelf to do that. It only needs few sys_bpf wrappers. Future steps: - support CO-RE in the kernel. This patch set is already too big, so that critical feature is left for the next step. - generate light skeleton in golang to allow such users use BTF and all other features provided by libbpf - generate light skeleton for kernel, so that bpf programs can be embeded in the kernel module. The UMD usage in bpf_preload will be replaced with such skeleton, so bpf_preload would become a standard kernel module without user space dependency. - finally do the signing of the loader program. The patches are work in progress with few rough edges. Alexei Starovoitov (15): bpf: Introduce bpf_sys_bpf() helper and program type. bpf: Introduce bpfptr_t user/kernel pointer. bpf: Prepare bpf syscall to be used from kernel and user space. libbpf: Support for syscall program type selftests/bpf: Test for syscall program type bpf: Make btf_load command to be bpfptr_t compatible. selftests/bpf: Test for btf_load command. bpf: Introduce fd_idx libbpf: Support for fd_idx bpf: Add bpf_btf_find_by_name_kind() helper. bpf: Add bpf_sys_close() helper. libbpf: Change the order of data and text relocations. libbpf: Generate loader program out of BPF ELF file. bpftool: Use syscall/loader program in "prog load" and "gen skeleton" command. selftests/bpf: Convert few tests to light skeleton. include/linux/bpf.h | 19 +- include/linux/bpf_types.h | 2 + include/linux/bpf_verifier.h | 1 + include/linux/bpfptr.h | 81 +++ include/linux/btf.h | 2 +- include/uapi/linux/bpf.h | 39 +- kernel/bpf/bpf_iter.c | 13 +- kernel/bpf/btf.c | 59 +- kernel/bpf/syscall.c | 179 ++++-- kernel/bpf/verifier.c | 81 ++- net/bpf/test_run.c | 45 +- tools/bpf/bpftool/Makefile | 2 +- tools/bpf/bpftool/gen.c | 263 ++++++++- tools/bpf/bpftool/main.c | 7 +- tools/bpf/bpftool/main.h | 1 + tools/bpf/bpftool/prog.c | 78 +++ tools/bpf/bpftool/xlated_dumper.c | 3 + tools/include/uapi/linux/bpf.h | 39 +- tools/lib/bpf/Build | 2 +- tools/lib/bpf/bpf.c | 62 ++ tools/lib/bpf/bpf.h | 35 ++ tools/lib/bpf/bpf_gen_internal.h | 38 ++ tools/lib/bpf/gen_trace.c | 529 ++++++++++++++++++ tools/lib/bpf/libbpf.c | 346 ++++++++++-- tools/lib/bpf/libbpf.map | 1 + tools/lib/bpf/libbpf_internal.h | 3 + tools/testing/selftests/bpf/.gitignore | 1 + tools/testing/selftests/bpf/Makefile | 16 +- .../selftests/bpf/prog_tests/fentry_fexit.c | 6 +- .../selftests/bpf/prog_tests/fentry_test.c | 4 +- .../selftests/bpf/prog_tests/fexit_sleep.c | 6 +- .../selftests/bpf/prog_tests/fexit_test.c | 4 +- .../selftests/bpf/prog_tests/kfunc_call.c | 6 +- .../selftests/bpf/prog_tests/syscall.c | 53 ++ tools/testing/selftests/bpf/progs/syscall.c | 121 ++++ .../selftests/bpf/progs/test_subprogs.c | 13 + 36 files changed, 1972 insertions(+), 188 deletions(-) create mode 100644 include/linux/bpfptr.h create mode 100644 tools/lib/bpf/bpf_gen_internal.h create mode 100644 tools/lib/bpf/gen_trace.c create mode 100644 tools/testing/selftests/bpf/prog_tests/syscall.c create mode 100644 tools/testing/selftests/bpf/progs/syscall.c