mbox series

[bpf-next,V7,0/8] bpf: New approach for BPF MTU handling

Message ID 160588903254.2817268.4861837335793475314.stgit@firesoul
Headers show
Series bpf: New approach for BPF MTU handling | expand

Message

Jesper Dangaard Brouer Nov. 20, 2020, 4:18 p.m. UTC
This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check.
V3: Drop enforcement of MTU in net-core, leave it to drivers.
V4: Keep sanity limit + netdev "up" checks + rename BPF-helper.
V5: Fix uninit variable + name struct output member mtu_result.
V6: Use bpf_check_mtu() in selftest
V7: Fix logic using tot_len and add another selftest

---

Jesper Dangaard Brouer (8):
      bpf: Remove MTU check in __bpf_skb_max_len
      bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
      bpf: bpf_fib_lookup return MTU value as output when looked up
      bpf: add BPF-helper for MTU checking
      bpf: drop MTU check when doing TC-BPF redirect to ingress
      bpf: make it possible to identify BPF redirected SKBs
      selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect
      bpf/selftests: activating bpf_check_mtu BPF-helper


 include/linux/netdevice.h                          |   31 +++
 include/uapi/linux/bpf.h                           |   78 ++++++++
 net/core/dev.c                                     |   21 --
 net/core/filter.c                                  |  184 ++++++++++++++++++--
 net/sched/Kconfig                                  |    1 
 tools/include/uapi/linux/bpf.h                     |   78 ++++++++
 tools/testing/selftests/bpf/prog_tests/check_mtu.c |   37 ++++
 tools/testing/selftests/bpf/progs/test_check_mtu.c |   33 ++++
 .../selftests/bpf/progs/test_cls_redirect.c        |    7 +
 9 files changed, 427 insertions(+), 43 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

--

Comments

Alexei Starovoitov Nov. 20, 2020, 9:50 p.m. UTC | #1
On Fri, Nov 20, 2020 at 05:18:47PM +0100, Jesper Dangaard Brouer wrote:
> Adding selftest for BPF-helper bpf_check_mtu(). Making sure
> it can be used from both XDP and TC.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  tools/testing/selftests/bpf/prog_tests/check_mtu.c |   37 ++++++++++++++++++++
>  tools/testing/selftests/bpf/progs/test_check_mtu.c |   33 ++++++++++++++++++
>  2 files changed, 70 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
> new file mode 100644
> index 000000000000..09b8f986a17b
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
> @@ -0,0 +1,37 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2020 Red Hat */
> +#include <uapi/linux/bpf.h>
> +#include <linux/if_link.h>
> +#include <test_progs.h>
> +
> +#include "test_check_mtu.skel.h"
> +#define IFINDEX_LO 1
> +
> +void test_check_mtu_xdp(struct test_check_mtu *skel)
> +{
> +	int err = 0;
> +	int fd;
> +
> +	fd = bpf_program__fd(skel->progs.xdp_use_helper);
> +	err = bpf_set_link_xdp_fd(IFINDEX_LO, fd, XDP_FLAGS_SKB_MODE);
> +	if (CHECK_FAIL(err))
> +		return;
> +
> +	bpf_set_link_xdp_fd(IFINDEX_LO, -1, 0);
> +}
> +
> +void test_check_mtu(void)
> +{
> +	struct test_check_mtu *skel;
> +
> +	skel = test_check_mtu__open_and_load();
> +	if (CHECK_FAIL(!skel)) {
> +		perror("test_check_mtu__open_and_load");
> +		return;
> +	}
> +
> +	if (test__start_subtest("bpf_check_mtu XDP-attach"))
> +		test_check_mtu_xdp(skel);
> +
> +	test_check_mtu__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_check_mtu.c b/tools/testing/selftests/bpf/progs/test_check_mtu.c
> new file mode 100644
> index 000000000000..ab97ec925a32
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_check_mtu.c
> @@ -0,0 +1,33 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2020 Red Hat */
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +
> +#include <stddef.h>
> +#include <stdint.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("xdp")
> +int xdp_use_helper(struct xdp_md *ctx)
> +{
> +	uint32_t mtu_len = 0;
> +	int delta = 20;
> +
> +	if (bpf_check_mtu(ctx, 0, &mtu_len, delta, 0)) {
> +		return XDP_ABORTED;
> +	}
> +	return XDP_PASS;
> +}
> +
> +SEC("classifier")
> +int tc_use_helper(struct __sk_buff *ctx)
> +{
> +	uint32_t mtu_len = 0;
> +	int delta = -20;
> +
> +	if (bpf_check_mtu(ctx, 0, &mtu_len, delta, 0)) {
> +		return BPF_DROP;
> +	}
> +	return BPF_OK;
> +}

Patches 7 and 8 are not adequate as tests.
They do not testing the functionality of earlier patches.
bpf_check_mtu() could be returning random value and "tests" 7 and 8 would still pass.
Please fix.
Andrii Nakryiko Nov. 21, 2020, 7:41 a.m. UTC | #2
On Fri, Nov 20, 2020 at 8:21 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>

> Adding selftest for BPF-helper bpf_check_mtu(). Making sure

> it can be used from both XDP and TC.

>

> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

> ---

>  tools/testing/selftests/bpf/prog_tests/check_mtu.c |   37 ++++++++++++++++++++

>  tools/testing/selftests/bpf/progs/test_check_mtu.c |   33 ++++++++++++++++++

>  2 files changed, 70 insertions(+)

>  create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c

>  create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

>

> diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c b/tools/testing/selftests/bpf/prog_tests/check_mtu.c

> new file mode 100644

> index 000000000000..09b8f986a17b

> --- /dev/null

> +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c

> @@ -0,0 +1,37 @@

> +// SPDX-License-Identifier: GPL-2.0

> +/* Copyright (c) 2020 Red Hat */

> +#include <uapi/linux/bpf.h>

> +#include <linux/if_link.h>

> +#include <test_progs.h>

> +

> +#include "test_check_mtu.skel.h"

> +#define IFINDEX_LO 1

> +

> +void test_check_mtu_xdp(struct test_check_mtu *skel)


this should be static func, otherwise it's treated as an independent selftest.

> +{

> +       int err = 0;

> +       int fd;

> +

> +       fd = bpf_program__fd(skel->progs.xdp_use_helper);

> +       err = bpf_set_link_xdp_fd(IFINDEX_LO, fd, XDP_FLAGS_SKB_MODE);

> +       if (CHECK_FAIL(err))


please use CHECK() or one of ASSERT_xxx() helpers. CHECK_FAIL() should
be used for high-volume unlikely to fail test (i.e., very rarely).

> +               return;

> +

> +       bpf_set_link_xdp_fd(IFINDEX_LO, -1, 0);

> +}

> +

> +void test_check_mtu(void)

> +{

> +       struct test_check_mtu *skel;

> +

> +       skel = test_check_mtu__open_and_load();

> +       if (CHECK_FAIL(!skel)) {

> +               perror("test_check_mtu__open_and_load");

> +               return;

> +       }

> +

> +       if (test__start_subtest("bpf_check_mtu XDP-attach"))

> +               test_check_mtu_xdp(skel);

> +

> +       test_check_mtu__destroy(skel);

> +}

> diff --git a/tools/testing/selftests/bpf/progs/test_check_mtu.c b/tools/testing/selftests/bpf/progs/test_check_mtu.c

> new file mode 100644

> index 000000000000..ab97ec925a32

> --- /dev/null

> +++ b/tools/testing/selftests/bpf/progs/test_check_mtu.c

> @@ -0,0 +1,33 @@

> +// SPDX-License-Identifier: GPL-2.0

> +/* Copyright (c) 2020 Red Hat */

> +#include <linux/bpf.h>

> +#include <bpf/bpf_helpers.h>

> +

> +#include <stddef.h>

> +#include <stdint.h>

> +

> +char _license[] SEC("license") = "GPL";

> +

> +SEC("xdp")

> +int xdp_use_helper(struct xdp_md *ctx)

> +{

> +       uint32_t mtu_len = 0;

> +       int delta = 20;

> +

> +       if (bpf_check_mtu(ctx, 0, &mtu_len, delta, 0)) {

> +               return XDP_ABORTED;

> +       }


nit: unnecessary {} for single-line if; same below

> +       return XDP_PASS;

> +}

> +

> +SEC("classifier")

> +int tc_use_helper(struct __sk_buff *ctx)

> +{

> +       uint32_t mtu_len = 0;

> +       int delta = -20;

> +

> +       if (bpf_check_mtu(ctx, 0, &mtu_len, delta, 0)) {

> +               return BPF_DROP;

> +       }

> +       return BPF_OK;

> +}

>

>
Jesper Dangaard Brouer Nov. 24, 2020, 2:33 p.m. UTC | #3
On Fri, 20 Nov 2020 23:41:11 -0800
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Fri, Nov 20, 2020 at 8:21 AM Jesper Dangaard Brouer

> <brouer@redhat.com> wrote:

> >

> > Adding selftest for BPF-helper bpf_check_mtu(). Making sure

> > it can be used from both XDP and TC.

> >

> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

> > ---

> >  tools/testing/selftests/bpf/prog_tests/check_mtu.c |   37 ++++++++++++++++++++

> >  tools/testing/selftests/bpf/progs/test_check_mtu.c |   33 ++++++++++++++++++

> >  2 files changed, 70 insertions(+)

> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c

> >  create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

> >

> > diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c b/tools/testing/selftests/bpf/prog_tests/check_mtu.c

> > new file mode 100644

> > index 000000000000..09b8f986a17b

> > --- /dev/null

> > +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c

> > @@ -0,0 +1,37 @@

> > +// SPDX-License-Identifier: GPL-2.0

> > +/* Copyright (c) 2020 Red Hat */

> > +#include <uapi/linux/bpf.h>

> > +#include <linux/if_link.h>

> > +#include <test_progs.h>

> > +

> > +#include "test_check_mtu.skel.h"

> > +#define IFINDEX_LO 1

> > +

> > +void test_check_mtu_xdp(struct test_check_mtu *skel)  

> 

> this should be static func, otherwise it's treated as an independent selftest.


Ok, fixed.

> > +{

> > +       int err = 0;

> > +       int fd;

> > +

> > +       fd = bpf_program__fd(skel->progs.xdp_use_helper);

> > +       err = bpf_set_link_xdp_fd(IFINDEX_LO, fd, XDP_FLAGS_SKB_MODE);

> > +       if (CHECK_FAIL(err))  

> 

> please use CHECK() or one of ASSERT_xxx() helpers. CHECK_FAIL() should

> be used for high-volume unlikely to fail test (i.e., very rarely).


I could not get CHECK() macro working.  I now realize that this is
because I've not defined a global static variable named 'duration'.

 static __u32 duration;

I wonder, are there any best-practice documentation or blogpost on
howto write these bpf-selftests?


Below signature is the compile error for others to Google for, and
solution above.
- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


$ make
  TEST-OBJ [test_progs] check_mtu.test.o
In file included from /home/jbrouer/git/kernel/bpf-next/tools/testing/selftests/bpf/prog_tests/check_mtu.c:6:
/home/jbrouer/git/kernel/bpf-next/tools/testing/selftests/bpf/prog_tests/check_mtu.c: In function ‘test_check_mtu’:
./test_progs.h:129:25: error: ‘duration’ undeclared (first use in this function)
  129 |  _CHECK(condition, tag, duration, format)
      |                         ^~~~~~~~
./test_progs.h:111:25: note: in definition of macro ‘_CHECK’
  111 |          __func__, tag, duration);   \
      |                         ^~~~~~~~
/home/jbrouer/git/kernel/bpf-next/tools/testing/selftests/bpf/prog_tests/check_mtu.c:33:6: note: in expansion of macro ‘CHECK’
   33 |  if (CHECK(!skel, "open and load skel", "failed"))
      |      ^~~~~
./test_progs.h:129:25: note: each undeclared identifier is reported only once for each function it appears in
  129 |  _CHECK(condition, tag, duration, format)
      |                         ^~~~~~~~
./test_progs.h:111:25: note: in definition of macro ‘_CHECK’
  111 |          __func__, tag, duration);   \
      |                         ^~~~~~~~
/home/jbrouer/git/kernel/bpf-next/tools/testing/selftests/bpf/prog_tests/check_mtu.c:33:6: note: in expansion of macro ‘CHECK’
   33 |  if (CHECK(!skel, "open and load skel", "failed"))
      |      ^~~~~
make: *** [Makefile:396: /home/jbrouer/git/kernel/bpf-next/tools/testing/selftests/bpf/check_mtu.test.o] Error 1
Yonghong Song Nov. 24, 2020, 4:50 p.m. UTC | #4
On 11/24/20 6:33 AM, Jesper Dangaard Brouer wrote:
> On Fri, 20 Nov 2020 23:41:11 -0800

> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> 

>> On Fri, Nov 20, 2020 at 8:21 AM Jesper Dangaard Brouer

>> <brouer@redhat.com> wrote:

>>>

>>> Adding selftest for BPF-helper bpf_check_mtu(). Making sure

>>> it can be used from both XDP and TC.

>>>

>>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

>>> ---

>>>   tools/testing/selftests/bpf/prog_tests/check_mtu.c |   37 ++++++++++++++++++++

>>>   tools/testing/selftests/bpf/progs/test_check_mtu.c |   33 ++++++++++++++++++

>>>   2 files changed, 70 insertions(+)

>>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c

>>>   create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

>>>

>>> diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c b/tools/testing/selftests/bpf/prog_tests/check_mtu.c

>>> new file mode 100644

>>> index 000000000000..09b8f986a17b

>>> --- /dev/null

>>> +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c

>>> @@ -0,0 +1,37 @@

>>> +// SPDX-License-Identifier: GPL-2.0

>>> +/* Copyright (c) 2020 Red Hat */

>>> +#include <uapi/linux/bpf.h>

>>> +#include <linux/if_link.h>

>>> +#include <test_progs.h>

>>> +

>>> +#include "test_check_mtu.skel.h"

>>> +#define IFINDEX_LO 1

>>> +

>>> +void test_check_mtu_xdp(struct test_check_mtu *skel)

>>

>> this should be static func, otherwise it's treated as an independent selftest.

> 

> Ok, fixed.

> 

>>> +{

>>> +       int err = 0;

>>> +       int fd;

>>> +

>>> +       fd = bpf_program__fd(skel->progs.xdp_use_helper);

>>> +       err = bpf_set_link_xdp_fd(IFINDEX_LO, fd, XDP_FLAGS_SKB_MODE);

>>> +       if (CHECK_FAIL(err))

>>

>> please use CHECK() or one of ASSERT_xxx() helpers. CHECK_FAIL() should

>> be used for high-volume unlikely to fail test (i.e., very rarely).

> 

> I could not get CHECK() macro working.  I now realize that this is

> because I've not defined a global static variable named 'duration'.

> 

>   static __u32 duration;

> 

> I wonder, are there any best-practice documentation or blogpost on

> howto write these bpf-selftests?


The 'duration' in old days is used to measure performance. Today most
of selftests actually do not need this. We do not have doc/blogpost
for this. The best is to look at other files under prog_tests to see
how they handle duration ...

> 

> 

> Below signature is the compile error for others to Google for, and

> solution above.

> -

> Best regards,

>    Jesper Dangaard Brouer

>    MSc.CS, Principal Kernel Engineer at Red Hat

>    LinkedIn: http://www.linkedin.com/in/brouer

> 

> 

> $ make

>    TEST-OBJ [test_progs] check_mtu.test.o

> In file included from /home/jbrouer/git/kernel/bpf-next/tools/testing/selftests/bpf/prog_tests/check_mtu.c:6:

> /home/jbrouer/git/kernel/bpf-next/tools/testing/selftests/bpf/prog_tests/check_mtu.c: In function ‘test_check_mtu’:

> ./test_progs.h:129:25: error: ‘duration’ undeclared (first use in this function)

>    129 |  _CHECK(condition, tag, duration, format)

>        |                         ^~~~~~~~

> ./test_progs.h:111:25: note: in definition of macro ‘_CHECK’

>    111 |          __func__, tag, duration);   \

>        |                         ^~~~~~~~

> /home/jbrouer/git/kernel/bpf-next/tools/testing/selftests/bpf/prog_tests/check_mtu.c:33:6: note: in expansion of macro ‘CHECK’

>     33 |  if (CHECK(!skel, "open and load skel", "failed"))

>        |      ^~~~~

> ./test_progs.h:129:25: note: each undeclared identifier is reported only once for each function it appears in

>    129 |  _CHECK(condition, tag, duration, format)

>        |                         ^~~~~~~~

> ./test_progs.h:111:25: note: in definition of macro ‘_CHECK’

>    111 |          __func__, tag, duration);   \

>        |                         ^~~~~~~~

> /home/jbrouer/git/kernel/bpf-next/tools/testing/selftests/bpf/prog_tests/check_mtu.c:33:6: note: in expansion of macro ‘CHECK’

>     33 |  if (CHECK(!skel, "open and load skel", "failed"))

>        |      ^~~~~

> make: *** [Makefile:396: /home/jbrouer/git/kernel/bpf-next/tools/testing/selftests/bpf/check_mtu.test.o] Error 1

>
Daniel Borkmann Dec. 2, 2020, 9:44 p.m. UTC | #5
On 11/20/20 5:18 PM, Jesper Dangaard Brouer wrote:
> BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use

> bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size,

> by adjusting fib_params 'tot_len' with the packet length plus the

> expected encap size. (Just like the bpf_check_mtu helper supports). He

> discovered that for SKB ctx the param->tot_len was not used, instead

> skb->len was used (via MTU check in is_skb_forwardable()).

> 

> Fix this by using fib_params 'tot_len' for MTU check.  If not provided

> (e.g. zero) then keep existing behaviour intact.

> 

> Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status")

> Reported-by: Carlo Carraro <colrack@gmail.com>

> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

> ---

>   net/core/filter.c |   14 ++++++++++++--

>   1 file changed, 12 insertions(+), 2 deletions(-)

> 

> diff --git a/net/core/filter.c b/net/core/filter.c

> index 1ee97fdeea64..84d77c425fbe 100644

> --- a/net/core/filter.c

> +++ b/net/core/filter.c

> @@ -5565,11 +5565,21 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,

>   #endif

>   	}

>   

> -	if (!rc) {

> +	if (rc == BPF_FIB_LKUP_RET_SUCCESS) {

>   		struct net_device *dev;

> +		u32 mtu;

>   

>   		dev = dev_get_by_index_rcu(net, params->ifindex);

> -		if (!is_skb_forwardable(dev, skb))

> +		mtu = READ_ONCE(dev->mtu);

> +

> +		/* Using tot_len for (L3) MTU check if provided by user */

> +		if (params->tot_len && params->tot_len > mtu)

> +			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;


Is there a reason why we cannot reuse and at the same time optimize the built-in
bpf_ipv{4,6}_fib_lookup() check_mtu as we do in XDP when params->tot_len was
specified.. something as presented earlier [0]? My biggest concern for gso skbs
is that the above might be subject to breakage from one kernel version to another
if it was filled from the packet. So going back and building upon [0], to be on
safe side we could have a new flag like below to indicate wanted behavior:

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c3458ec1f30a..d3cd2f47f011 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4934,8 +4934,9 @@ struct bpf_raw_tracepoint_args {
   * OUTPUT:  Do lookup from egress perspective; default is ingress
   */
  enum {
-	BPF_FIB_LOOKUP_DIRECT  = (1U << 0),
-	BPF_FIB_LOOKUP_OUTPUT  = (1U << 1),
+	BPF_FIB_LOOKUP_DIRECT		= (1U << 0),
+	BPF_FIB_LOOKUP_OUTPUT		= (1U << 1),
+	BPF_FIB_LOOKUP_ALWAYS_MTU_CHECK	= (1U << 2),
  };

  enum {
diff --git a/net/core/filter.c b/net/core/filter.c
index 2ca5eecebacf..4fb876ebd6a0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5547,37 +5547,27 @@ static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {
  BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
  	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
  {
-	struct net *net = dev_net(skb->dev);
-	int rc = -EAFNOSUPPORT;
+	bool check_mtu = !skb_is_gso(skb) ||
+			 (flags & BPF_FIB_LOOKUP_ALWAYS_MTU_CHECK);

  	if (plen < sizeof(*params))
  		return -EINVAL;

-	if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT))
+	if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT |
+		      BPF_FIB_LOOKUP_ALWAYS_MTU_CHECK))
  		return -EINVAL;

  	switch (params->family) {
  #if IS_ENABLED(CONFIG_INET)
  	case AF_INET:
-		rc = bpf_ipv4_fib_lookup(net, params, flags, false);
-		break;
+		return bpf_ipv4_fib_lookup(net, params, flags, check_mtu);
  #endif
  #if IS_ENABLED(CONFIG_IPV6)
  	case AF_INET6:
-		rc = bpf_ipv6_fib_lookup(net, params, flags, false);
-		break;
+		return bpf_ipv6_fib_lookup(net, params, flags, check_mtu);
  #endif
  	}
-
-	if (!rc) {
-		struct net_device *dev;
-
-		dev = dev_get_by_index_rcu(net, params->ifindex);
-		if (!is_skb_forwardable(dev, skb))
-			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
-	}
-
-	return rc;
+	return -EAFNOSUPPORT;
  }

  static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {


   [0] https://lore.kernel.org/bpf/65d8f988-5b41-24c2-8501-7cbbddb1238e@iogearbox.net/


> +		/* Notice at this TC cls_bpf level skb->len contains L2 size,

> +		 * but is_skb_forwardable takes that into account

> +		 */

> +		if (params->tot_len == 0 && !is_skb_forwardable(dev, skb))

>   			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;

>   	}

>   

> 

>
Daniel Borkmann Dec. 2, 2020, 10 p.m. UTC | #6
On 12/2/20 10:44 PM, Daniel Borkmann wrote:
> On 11/20/20 5:18 PM, Jesper Dangaard Brouer wrote:

>> BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use

>> bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size,

>> by adjusting fib_params 'tot_len' with the packet length plus the

>> expected encap size. (Just like the bpf_check_mtu helper supports). He

>> discovered that for SKB ctx the param->tot_len was not used, instead

>> skb->len was used (via MTU check in is_skb_forwardable()).

>>

>> Fix this by using fib_params 'tot_len' for MTU check.  If not provided

>> (e.g. zero) then keep existing behaviour intact.

>>

>> Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status")

>> Reported-by: Carlo Carraro <colrack@gmail.com>

>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

>> ---

>>   net/core/filter.c |   14 ++++++++++++--

>>   1 file changed, 12 insertions(+), 2 deletions(-)

>>

>> diff --git a/net/core/filter.c b/net/core/filter.c

>> index 1ee97fdeea64..84d77c425fbe 100644

>> --- a/net/core/filter.c

>> +++ b/net/core/filter.c

>> @@ -5565,11 +5565,21 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,

>>   #endif

>>       }

>> -    if (!rc) {

>> +    if (rc == BPF_FIB_LKUP_RET_SUCCESS) {

>>           struct net_device *dev;

>> +        u32 mtu;

>>           dev = dev_get_by_index_rcu(net, params->ifindex);

>> -        if (!is_skb_forwardable(dev, skb))

>> +        mtu = READ_ONCE(dev->mtu);

>> +

>> +        /* Using tot_len for (L3) MTU check if provided by user */

>> +        if (params->tot_len && params->tot_len > mtu)

>> +            rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;

> 

> Is there a reason why we cannot reuse and at the same time optimize the built-in

> bpf_ipv{4,6}_fib_lookup() check_mtu as we do in XDP when params->tot_len was

> specified.. something as presented earlier [0]? My biggest concern for gso skbs

> is that the above might be subject to breakage from one kernel version to another

> if it was filled from the packet. So going back and building upon [0], to be on

> safe side we could have a new flag like below to indicate wanted behavior:


Also means that if params->tot_len from prog input was 0 we won't break either,
and we would be closer to XDP semantics overall which is desirable. Yes, extra
flag for skb case to force it which is not great, but I'm not sure how it would
be compatible without it tbh.

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

> index c3458ec1f30a..d3cd2f47f011 100644

> --- a/include/uapi/linux/bpf.h

> +++ b/include/uapi/linux/bpf.h

> @@ -4934,8 +4934,9 @@ struct bpf_raw_tracepoint_args {

>    * OUTPUT:  Do lookup from egress perspective; default is ingress

>    */

>   enum {

> -    BPF_FIB_LOOKUP_DIRECT  = (1U << 0),

> -    BPF_FIB_LOOKUP_OUTPUT  = (1U << 1),

> +    BPF_FIB_LOOKUP_DIRECT        = (1U << 0),

> +    BPF_FIB_LOOKUP_OUTPUT        = (1U << 1),

> +    BPF_FIB_LOOKUP_ALWAYS_MTU_CHECK    = (1U << 2),

>   };

> 

>   enum {

> diff --git a/net/core/filter.c b/net/core/filter.c

> index 2ca5eecebacf..4fb876ebd6a0 100644

> --- a/net/core/filter.c

> +++ b/net/core/filter.c

> @@ -5547,37 +5547,27 @@ static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {

>   BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,

>          struct bpf_fib_lookup *, params, int, plen, u32, flags)

>   {

> -    struct net *net = dev_net(skb->dev);

> -    int rc = -EAFNOSUPPORT;

> +    bool check_mtu = !skb_is_gso(skb) ||

> +             (flags & BPF_FIB_LOOKUP_ALWAYS_MTU_CHECK);

> 

>       if (plen < sizeof(*params))

>           return -EINVAL;

> 

> -    if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT))

> +    if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT |

> +              BPF_FIB_LOOKUP_ALWAYS_MTU_CHECK))

>           return -EINVAL;

> 

>       switch (params->family) {

>   #if IS_ENABLED(CONFIG_INET)

>       case AF_INET:

> -        rc = bpf_ipv4_fib_lookup(net, params, flags, false);

> -        break;

> +        return bpf_ipv4_fib_lookup(net, params, flags, check_mtu);

>   #endif

>   #if IS_ENABLED(CONFIG_IPV6)

>       case AF_INET6:

> -        rc = bpf_ipv6_fib_lookup(net, params, flags, false);

> -        break;

> +        return bpf_ipv6_fib_lookup(net, params, flags, check_mtu);

>   #endif

>       }

> -

> -    if (!rc) {

> -        struct net_device *dev;

> -

> -        dev = dev_get_by_index_rcu(net, params->ifindex);

> -        if (!is_skb_forwardable(dev, skb))

> -            rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;

> -    }

> -

> -    return rc;

> +    return -EAFNOSUPPORT;

>   }

> 

>   static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {

> 

> 

>    [0] https://lore.kernel.org/bpf/65d8f988-5b41-24c2-8501-7cbbddb1238e@iogearbox.net/
Stanislav Fomichev Dec. 3, 2020, 6:25 p.m. UTC | #7
On 11/20, Jesper Dangaard Brouer wrote:
> This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.


> The SKB object is complex and the skb->len value (accessible from

> BPF-prog) also include the length of any extra GRO/GSO segments, but

> without taking into account that these GRO/GSO segments get added

> transport (L4) and network (L3) headers before being transmitted. Thus,

> this BPF-helper is created such that the BPF-programmer don't need to

> handle these details in the BPF-prog.


> The API is designed to help the BPF-programmer, that want to do packet

> context size changes, which involves other helpers. These other helpers

> usually does a delta size adjustment. This helper also support a delta

> size (len_diff), which allow BPF-programmer to reuse arguments needed by

> these other helpers, and perform the MTU check prior to doing any actual

> size adjustment of the packet context.


> It is on purpose, that we allow the len adjustment to become a negative

> result, that will pass the MTU check. This might seem weird, but it's not

> this helpers responsibility to "catch" wrong len_diff adjustments. Other

> helpers will take care of these checks, if BPF-programmer chooses to do

> actual size adjustment.


> V6:

> - Took John's advice and dropped BPF_MTU_CHK_RELAX

> - Returned MTU is kept at L3-level (like fib_lookup)


> V4: Lot of changes

>   - ifindex 0 now use current netdev for MTU lookup

>   - rename helper from bpf_mtu_check to bpf_check_mtu

>   - fix bug for GSO pkt length (as skb->len is total len)

>   - remove __bpf_len_adj_positive, simply allow negative len adj


> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

> ---

>   include/uapi/linux/bpf.h       |   67 ++++++++++++++++++++++

>   net/core/filter.c              |  122  

> ++++++++++++++++++++++++++++++++++++++++

>   tools/include/uapi/linux/bpf.h |   67 ++++++++++++++++++++++

>   3 files changed, 256 insertions(+)


> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

> index beacd312ea17..2619ea8c5a08 100644

> --- a/include/uapi/linux/bpf.h

> +++ b/include/uapi/linux/bpf.h

> @@ -3790,6 +3790,61 @@ union bpf_attr {

>    *		*ARG_PTR_TO_BTF_ID* of type *task_struct*.

>    *	Return

>    *		Pointer to the current task.

> + *

> + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff,  

> u64 flags)

> + *	Description

> + *		Check ctx packet size against MTU of net device (based on

> + *		*ifindex*).  This helper will likely be used in combination with

> + *		helpers that adjust/change the packet size.  The argument

> + *		*len_diff* can be used for querying with a planned size

> + *		change. This allows to check MTU prior to changing packet ctx.

> + *

> + *		Specifying *ifindex* zero means the MTU check is performed

> + *		against the current net device.  This is practical if this isn't

> + *		used prior to redirect.

> + *

> + *		The Linux kernel route table can configure MTUs on a more

> + *		specific per route level, which is not provided by this helper.

> + *		For route level MTU checks use the **bpf_fib_lookup**\ ()

> + *		helper.

> + *

> + *		*ctx* is either **struct xdp_md** for XDP programs or

> + *		**struct sk_buff** for tc cls_act programs.

> + *

> + *		The *flags* argument can be a combination of one or more of the

> + *		following values:

> + *

> + *		**BPF_MTU_CHK_SEGS**

> + *			This flag will only works for *ctx* **struct sk_buff**.

> + *			If packet context contains extra packet segment buffers

> + *			(often knows as GSO skb), then MTU check is harder to

> + *			check at this point, because in transmit path it is

> + *			possible for the skb packet to get re-segmented

> + *			(depending on net device features).  This could still be

> + *			a MTU violation, so this flag enables performing MTU

> + *			check against segments, with a different violation

> + *			return code to tell it apart. Check cannot use len_diff.

> + *

> + *		On return *mtu_len* pointer contains the MTU value of the net

> + *		device.  Remember the net device configured MTU is the L3 size,

> + *		which is returned here and XDP and TX length operate at L2.

> + *		Helper take this into account for you, but remember when using

> + *		MTU value in your BPF-code.  On input *mtu_len* must be a valid

> + *		pointer and be initialized (to zero), else verifier will reject

> + *		BPF program.

> + *

> + *	Return

> + *		* 0 on success, and populate MTU value in *mtu_len* pointer.

> + *

> + *		* < 0 if any input argument is invalid (*mtu_len* not updated)

> + *

> + *		MTU violations return positive values, but also populate MTU

> + *		value in *mtu_len* pointer, as this can be needed for

> + *		implementing PMTU handing:

> + *

> + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**

> + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**

> + *

>    */

>   #define __BPF_FUNC_MAPPER(FN)		\

>   	FN(unspec),			\

> @@ -3951,6 +4006,7 @@ union bpf_attr {

>   	FN(task_storage_get),		\

>   	FN(task_storage_delete),	\

>   	FN(get_current_task_btf),	\

> +	FN(check_mtu),			\

>   	/* */


>   /* integer value in 'imm' field of BPF_CALL instruction selects which  

> helper

> @@ -4978,6 +5034,17 @@ struct bpf_redir_neigh {

>   	};

>   };


> +/* bpf_check_mtu flags*/

> +enum  bpf_check_mtu_flags {

> +	BPF_MTU_CHK_SEGS  = (1U << 0),

> +};

> +

> +enum bpf_check_mtu_ret {

> +	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */

> +	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */

> +	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */

> +};

> +

>   enum bpf_task_fd_type {

>   	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */

>   	BPF_FD_TYPE_TRACEPOINT,		/* tp name */

> diff --git a/net/core/filter.c b/net/core/filter.c

> index 25b137ffdced..d6125cfc49c3 100644

> --- a/net/core/filter.c

> +++ b/net/core/filter.c

> @@ -5604,6 +5604,124 @@ static const struct bpf_func_proto  

> bpf_skb_fib_lookup_proto = {

>   	.arg4_type	= ARG_ANYTHING,

>   };


> +static struct net_device *__dev_via_ifindex(struct net_device *dev_curr,

> +					    u32 ifindex)

> +{

> +	struct net *netns = dev_net(dev_curr);

> +

> +	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */

> +	if (ifindex == 0)

> +		return dev_curr;

> +

> +	return dev_get_by_index_rcu(netns, ifindex);

> +}

> +

> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,

> +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)

> +{

> +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;

> +	struct net_device *dev = skb->dev;

> +	int len;

> +	int mtu;

> +

> +	if (flags & ~(BPF_MTU_CHK_SEGS))

> +		return -EINVAL;

> +

> +	dev = __dev_via_ifindex(dev, ifindex);

> +	if (!dev)

> +		return -ENODEV;

> +

> +	mtu = READ_ONCE(dev->mtu);

> +

> +	/* TC len is L2, remove L2-header as dev MTU is L3 size */


[..]
> +	len = skb->len - ETH_HLEN;

Any reason not to do s/ETH_HLEN/dev->hard_header_len/ (or min_header_len?)
thought this patch?
Jesper Dangaard Brouer Dec. 14, 2020, 11:52 a.m. UTC | #8
On Thu, 3 Dec 2020 10:25:41 -0800
sdf@google.com wrote:

> > +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,

> > +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)

> > +{

> > +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;

> > +	struct net_device *dev = skb->dev;

> > +	int len;

> > +	int mtu;

> > +

> > +	if (flags & ~(BPF_MTU_CHK_SEGS))

> > +		return -EINVAL;

> > +

> > +	dev = __dev_via_ifindex(dev, ifindex);

> > +	if (!dev)

> > +		return -ENODEV;

> > +

> > +	mtu = READ_ONCE(dev->mtu);

> > +

> > +	/* TC len is L2, remove L2-header as dev MTU is L3 size */  

> 

> [..]

> > +	len = skb->len - ETH_HLEN;  

> Any reason not to do s/ETH_HLEN/dev->hard_header_len/ (or min_header_len?)

> thought this patch?


Will fix in V9.

There is a very small (performance) overhead, but mostly because
net_device struct layout have placed mtu and hard_header_len on
different cache-lines. (This is something that should be fixed
separately).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer