mbox series

[bpf-next,V4,0/5] bpf: New approach for BPF MTU handling

Message ID 160381592923.1435097.2008820753108719855.stgit@firesoul
Headers show
Series bpf: New approach for BPF MTU handling | expand

Message

Jesper Dangaard Brouer Oct. 27, 2020, 4:26 p.m. UTC
This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check.
V3: Drop enforcement of MTU in net-core, leave it to drivers.
V4: Keep sanity limit + netdev "up" checks + rename BPF-helper.

---

Jesper Dangaard Brouer (5):
      bpf: Remove MTU check in __bpf_skb_max_len
      bpf: bpf_fib_lookup return MTU value as output when looked up
      bpf: add BPF-helper for MTU checking
      bpf: drop MTU check when doing TC-BPF redirect to ingress
      bpf: make it possible to identify BPF redirected SKBs


 include/linux/netdevice.h      |   31 +++++++-
 include/uapi/linux/bpf.h       |   81 +++++++++++++++++++-
 net/core/dev.c                 |   21 +----
 net/core/filter.c              |  163 ++++++++++++++++++++++++++++++++++++----
 net/sched/Kconfig              |    1 
 tools/include/uapi/linux/bpf.h |   81 +++++++++++++++++++-
 6 files changed, 339 insertions(+), 39 deletions(-)

--

Comments

John Fastabend Oct. 30, 2020, 7:24 p.m. UTC | #1
Jesper Dangaard Brouer wrote:
> Multiple BPF-helpers that can manipulate/increase the size of the SKB uses

> __bpf_skb_max_len() as the max-length. This function limit size against

> the current net_device MTU (skb->dev->mtu).

> 

> When a BPF-prog grow the packet size, then it should not be limited to the

> MTU. The MTU is a transmit limitation, and software receiving this packet

> should be allowed to increase the size. Further more, current MTU check in

> __bpf_skb_max_len uses the MTU from ingress/current net_device, which in

> case of redirects uses the wrong net_device.

> 

> Patch V4 keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit

> is elsewhere in the system. Jesper's testing[1] showed it was not possible

> to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting

> factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for

> SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is

> in-effect due to this being called from softirq context see code

> __gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed

> that frames above 16KiB can cause NICs to reset (but not crash). Keep this

> sanity limit at this level as memory layer can differ based on kernel

> config.

> 

> [1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests

> 

> V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.

> 

> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

> ---


Acked-by: John Fastabend <john.fastabend@gmail.com>