mbox series

[bpf-next,v5,0/6] Follow-up BPF helper improvements

Message ID 20201010205447.5610-1-daniel@iogearbox.net
Headers show
Series Follow-up BPF helper improvements | expand

Message

Daniel Borkmann Oct. 10, 2020, 8:54 p.m. UTC
This series addresses most of the feedback [0] that was to be followed
up from the last series, that is, UAPI helper comment improvements and
getting rid of the ifindex obj file hacks in the selftest by using a
BPF map instead. The __sk_buff data/data_end pointer work, I'm planning
to do in a later round as well as the mem*() BPF improvements we have
in Cilium for libbpf. Next, the series adds two features, i) a helper
called redirect_peer() to improve latency on netns switch, and ii) to
allow map in map with dynamic inner array map sizes. Selftests for each
are added as well. For details, please check individual patches, thanks!

  [0] https://lore.kernel.org/bpf/cover.1601477936.git.daniel@iogearbox.net/

v4 -> v5:
  - Replace cnt == -EOPNOTSUPP check with cnt < 0; I've used < 0
    here as I think it's useful to keep the existing cnt == 0 ||
    cnt >= ARRAY_SIZE(insn_buf) for error detection (Andrii)
v3 -> v4:
  - Rename new array map flag to BPF_F_INNER_MAP (Alexei)
v2 -> v3:
  - Remove tab that slipped into uapi helper desc (Jakub)
  - Rework map in map for array to error from map_gen_lookup (Andrii)
v1 -> v2:
  - Fixed selftest comment wrt inner1/inner2 value (Yonghong)

Daniel Borkmann (6):
  bpf: improve bpf_redirect_neigh helper description
  bpf: add redirect_peer helper
  bpf: allow for map-in-map with dynamic inner array map entries
  bpf, selftests: add test for different array inner map size
  bpf, selftests: make redirect_neigh test more extensible
  bpf, selftests: add redirect_peer selftest

 drivers/net/veth.c                            |   9 +
 include/linux/bpf.h                           |   2 +-
 include/linux/netdevice.h                     |   4 +
 include/uapi/linux/bpf.h                      |  30 ++-
 kernel/bpf/arraymap.c                         |  17 +-
 kernel/bpf/hashtab.c                          |   6 +-
 kernel/bpf/verifier.c                         |   4 +-
 net/core/dev.c                                |  15 +-
 net/core/filter.c                             |  54 ++++-
 net/xdp/xskmap.c                              |   2 +-
 tools/include/uapi/linux/bpf.h                |  30 ++-
 .../selftests/bpf/prog_tests/btf_map_in_map.c |  39 +++-
 .../selftests/bpf/progs/test_btf_map_in_map.c |  43 ++++
 .../selftests/bpf/progs/test_tc_neigh.c       |  40 ++--
 .../selftests/bpf/progs/test_tc_peer.c        |  45 ++++
 tools/testing/selftests/bpf/test_tc_neigh.sh  | 168 ---------------
 .../testing/selftests/bpf/test_tc_redirect.sh | 204 ++++++++++++++++++
 17 files changed, 488 insertions(+), 224 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_peer.c
 delete mode 100755 tools/testing/selftests/bpf/test_tc_neigh.sh
 create mode 100755 tools/testing/selftests/bpf/test_tc_redirect.sh

Comments

Andrii Nakryiko Oct. 10, 2020, 10:02 p.m. UTC | #1
On Sat, Oct 10, 2020 at 1:54 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
> ("bpf: Relax max_entries check for most of the inner map types") added support
> for dynamic inner max elements for most map-in-map types. Exceptions were maps
> like array or prog array where the map_gen_lookup() callback uses the maps'
> max_entries field as a constant when emitting instructions.
>
> We recently implemented Maglev consistent hashing into Cilium's load balancer
> which uses map-in-map with an outer map being hash and inner being array holding
> the Maglev backend table for each service. This has been designed this way in
> order to reduce overall memory consumption given the outer hash map allows to
> avoid preallocating a large, flat memory area for all services. Also, the
> number of service mappings is not always known a-priori.
>
> The use case for dynamic inner array map entries is to further reduce memory
> overhead, for example, some services might just have a small number of back
> ends while others could have a large number. Right now the Maglev backend table
> for small and large number of backends would need to have the same inner array
> map entries which adds a lot of unneeded overhead.
>
> Dynamic inner array map entries can be realized by avoiding the inlined code
> generation for their lookup. The lookup will still be efficient since it will
> be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
> The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
> inline code generation and relaxes array_map_meta_equal() check to ignore both
> maps' max_entries. This also still allows to have faster lookups for map-in-map
> when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.
>
> Example code generation where inner map is dynamic sized array:
>
>   # bpftool p d x i 125
>   int handle__sys_enter(void * ctx):
>   ; int handle__sys_enter(void *ctx)
>      0: (b4) w1 = 0
>   ; int key = 0;
>      1: (63) *(u32 *)(r10 -4) = r1
>      2: (bf) r2 = r10
>   ;
>      3: (07) r2 += -4
>   ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
>      4: (18) r1 = map[id:468]
>      6: (07) r1 += 272
>      7: (61) r0 = *(u32 *)(r2 +0)
>      8: (35) if r0 >= 0x3 goto pc+5
>      9: (67) r0 <<= 3
>     10: (0f) r0 += r1
>     11: (79) r0 = *(u64 *)(r0 +0)
>     12: (15) if r0 == 0x0 goto pc+1
>     13: (05) goto pc+1
>     14: (b7) r0 = 0
>     15: (b4) w6 = -1
>   ; if (!inner_map)
>     16: (15) if r0 == 0x0 goto pc+6
>     17: (bf) r2 = r10
>   ;
>     18: (07) r2 += -4
>   ; val = bpf_map_lookup_elem(inner_map, &key);
>     19: (bf) r1 = r0                               | No inlining but instead
>     20: (85) call array_map_lookup_elem#149280     | call to array_map_lookup_elem()
>   ; return val ? *val : -1;                        | for inner array lookup.
>     21: (15) if r0 == 0x0 goto pc+1
>   ; return val ? *val : -1;
>     22: (61) r6 = *(u32 *)(r0 +0)
>   ; }
>     23: (bc) w0 = w6
>     24: (95) exit
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> ---
>  include/linux/bpf.h            |  2 +-
>  include/uapi/linux/bpf.h       |  3 +++
>  kernel/bpf/arraymap.c          | 17 +++++++++++------
>  kernel/bpf/hashtab.c           |  6 +++---
>  kernel/bpf/verifier.c          |  4 +++-
>  net/xdp/xskmap.c               |  2 +-
>  tools/include/uapi/linux/bpf.h |  3 +++
>  7 files changed, 25 insertions(+), 12 deletions(-)
>

[...]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f3e36eade3d4..d578875df1ad 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -11049,6 +11049,8 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
>                         if (insn->imm == BPF_FUNC_map_lookup_elem &&
>                             ops->map_gen_lookup) {
>                                 cnt = ops->map_gen_lookup(map_ptr, insn_buf);
> +                               if (cnt < 0)
> +                                       goto patch_map_ops_generic;

but now any reported error will be silently skipped. The logic should be:

if (cnt == -EOPNOTSUPP)
    goto patch_map_ops_generic;
if (cnt <= 0 || cnt >= ARRAY_SIZE(insn_buf))
    verbose(env, "bpf verifier is misconfigured\n");

This way only -EOPNOTSUPP is silently skipped, all other cases where
error is returned, cnt == 0, or cnt is too big would be reported as
error.

>                                 if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf)) {
>                                         verbose(env, "bpf verifier is misconfigured\n");
>                                         return -EINVAL;
> @@ -11079,7 +11081,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
>                                      (int (*)(struct bpf_map *map, void *value))NULL));
>                         BUILD_BUG_ON(!__same_type(ops->map_peek_elem,
>                                      (int (*)(struct bpf_map *map, void *value))NULL));

[...]
Daniel Borkmann Oct. 10, 2020, 11:42 p.m. UTC | #2
On 10/11/20 12:02 AM, Andrii Nakryiko wrote:
> On Sat, Oct 10, 2020 at 1:54 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
[...]
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f3e36eade3d4..d578875df1ad 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -11049,6 +11049,8 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
>>                          if (insn->imm == BPF_FUNC_map_lookup_elem &&
>>                              ops->map_gen_lookup) {
>>                                  cnt = ops->map_gen_lookup(map_ptr, insn_buf);
>> +                               if (cnt < 0)
>> +                                       goto patch_map_ops_generic;
> 
> but now any reported error will be silently skipped. The logic should be:
> 
> if (cnt == -EOPNOTSUPP)
>      goto patch_map_ops_generic;
> if (cnt <= 0 || cnt >= ARRAY_SIZE(insn_buf))
>      verbose(env, "bpf verifier is misconfigured\n");
> 
> This way only -EOPNOTSUPP is silently skipped, all other cases where
> error is returned, cnt == 0, or cnt is too big would be reported as
> error.

Fair enough, I might have misunderstood earlier mail, but agree, that one is more
robust overall. Fixed.

Thanks,
Daniel