diff mbox series

[net,v4] net: xdp: account for layer 3 packets in generic skb handler

Message ID 20200813195816.67222-1-Jason@zx2c4.com
State New
Headers show
Series [net,v4] net: xdp: account for layer 3 packets in generic skb handler | expand

Commit Message

Jason A. Donenfeld Aug. 13, 2020, 7:58 p.m. UTC
A user reported that packets from wireguard were possibly ignored by XDP
[1]. Another user reported that modifying packets from layer 3
interfaces results in impossible to diagnose drops.

Apparently, the generic skb xdp handler path seems to assume that
packets will always have an ethernet header, which really isn't always
the case for layer 3 packets, which are produced by multiple drivers.
This patch fixes the oversight. If the mac_len is 0 and so is
hard_header_len, then we know that the skb is a layer 3 packet, and in
that case prepend a pseudo ethhdr to the packet whose h_proto is copied
from skb->protocol, which will have the appropriate v4 or v6 ethertype.
This allows us to keep XDP programs' assumption correct about packets
always having that ethernet header, so that existing code doesn't break,
while still allowing layer 3 devices to use the generic XDP handler.

For the XDP_PASS case, we remove the added ethernet header after the
program runs. And for the XDP_REDIRECT or XDP_TX cases, we leave it on.
This mirrors the logic for layer 2 packets, where mac_len is part of the
buffer given to xdp, and then is pushed for the XDP_REDIRECT or XDP_TX
cases, while it has already been naturally removed for the XDP_PASS
case, since it always existed in the head space. This should preserve
semantics for both cases.

Previous discussions have included the point that maybe XDP should just
be intentionally broken on layer 3 interfaces, by design, and that layer
3 people should be using cls_bpf. However, I think there are good
grounds to reconsider this perspective:

- Complicated deployments wind up applying XDP modifications to a
  variety of different devices on a given host, some of which are using
  specialized ethernet cards and other ones using virtual layer 3
  interfaces, such as WireGuard. Being able to apply one codebase to
  each of these winds up being essential.

- cls_bpf does not support the same feature set as XDP, and operates at
  a slightly different stage in the networking stack. You may reply,
  "then add all the features you want to cls_bpf", but that seems to be
  missing the point, and would still result in there being two ways to
  do everything, which is not desirable for anyone actually _using_ this
  code.

- While XDP was originally made for hardware offloading, and while many
  look disdainfully upon the generic mode, it nevertheless remains a
  highly useful and popular way of adding bespoke packet
  transformations, and from that perspective, a difference between layer
  2 and layer 3 packets is immaterial if the user is primarily concerned
  with transformations to layer 3 and beyond.

[1] https://lore.kernel.org/wireguard/M5WzVK5--3-2@tuta.io/

Reported-by: Thomas Ptacek <thomas@sockpuppet.org>
Reported-by: Adhipati Blambangan <adhipati@tuta.io>
Cc: David Ahern <dsahern@gmail.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

---
I had originally dropped this patch, but the issue kept coming up in
user reports, so here's a v4 of it. Testing of it is still rather slim,
but hopefully that will change in the coming days.

Changes v3->v4:
- We now preserve the same logic for XDP_TX/XDP_REDIRECT as before.
- hard_header_len is checked in addition to mac_len.

 net/core/dev.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

-- 
2.28.0

Comments

Jakub Kicinski Aug. 13, 2020, 9:01 p.m. UTC | #1
On Thu, 13 Aug 2020 21:58:16 +0200 Jason A. Donenfeld wrote:
> - cls_bpf does not support the same feature set as XDP, and operates at

>   a slightly different stage in the networking stack.


Please elaborate.

> I had originally dropped this patch, but the issue kept coming up in

> user reports, so here's a v4 of it. Testing of it is still rather slim,

> but hopefully that will change in the coming days.


Here an alternative patch, untested:

diff --git a/net/core/dev.c b/net/core/dev.c
index d3d53dc601f9..7f68831bdd4f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5434,6 +5434,11 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
        struct bpf_prog *new = xdp->prog;
        int ret = 0;
 
+       if (!dev->hard_header_len) {
+               NL_SET_ERR_MSG(xdp->extack, "generic XDP not supported on L3 devices, please use cls_bpf");
+               return -EINVAL;
+       }
+
        if (new) {
                u32 i;
Jason A. Donenfeld Aug. 14, 2020, 6:56 a.m. UTC | #2
On Thu, Aug 13, 2020 at 11:01 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > I had originally dropped this patch, but the issue kept coming up in

> > user reports, so here's a v4 of it. Testing of it is still rather slim,

> > but hopefully that will change in the coming days.

>

> Here an alternative patch, untested:


Funny. But come on now... Why would we want to deprive our users of
system consistency?

Doesn't it make sense to allow users to use the same code across
interfaces? You actually want them to rewrite their code to use a
totally different trigger point just because of some weird kernel
internals between interfaces?

Why not make XDP more useful and more generic across interfaces? It's
very common for systems to be receiving packets with a heavy ethernet
card from the current data center, in addition to receiving packets
from a tunnel interface connected to a remote data center, with a need
to run the same XDP program on both interfaces. Why not support that
kind of simplicity?

This is _actually_ something that's come up _repeatedly_. This is a
real world need from real users who are doing real things. Why not
help them?

It's not at the expense of any formal consistency, or performance, or
even semantic perfection. It costs very little to support these
popular use cases.

[FYI, there's one tweak I'd like to make, so I'll probably send v5 ~soon.]

Jason
Jakub Kicinski Aug. 14, 2020, 3:31 p.m. UTC | #3
On Fri, 14 Aug 2020 08:56:48 +0200 Jason A. Donenfeld wrote:
> On Thu, Aug 13, 2020 at 11:01 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > > I had originally dropped this patch, but the issue kept coming up in

> > > user reports, so here's a v4 of it. Testing of it is still rather slim,

> > > but hopefully that will change in the coming days.  

> >

> > Here an alternative patch, untested:  

> 

> Funny. But come on now... Why would we want to deprive our users of

> system consistency?


We should try for consistency between xdp and cls_bpf instead.

> Doesn't it make sense to allow users to use the same code across

> interfaces? You actually want them to rewrite their code to use a

> totally different trigger point just because of some weird kernel

> internals between interfaces?


We're not building an abstraction over the kernel stack so that users
won't have to worry how things work. Users need to have a minimal
understanding of how specific hooks integrate with the stack and what
they are for. And therefore why cls_bpf is actually more efficient to
use in L3 tunnel case.

> Why not make XDP more useful and more generic across interfaces? It's

> very common for systems to be receiving packets with a heavy ethernet

> card from the current data center, in addition to receiving packets

> from a tunnel interface connected to a remote data center, with a need

> to run the same XDP program on both interfaces. Why not support that

> kind of simplicity?

> 

> This is _actually_ something that's come up _repeatedly_. This is a

> real world need from real users who are doing real things. Why not

> help them?


I'm sure it comes up repeatedly because we don't return any errors,
so people waste time investigating why it doesn't work.

> It's not at the expense of any formal consistency, or performance, or

> even semantic perfection. It costs very little to support these

> popular use cases.
Jason A. Donenfeld Aug. 14, 2020, 9:04 p.m. UTC | #4
On 8/14/20, Jakub Kicinski <kuba@kernel.org> wrote:
> On Fri, 14 Aug 2020 08:56:48 +0200 Jason A. Donenfeld wrote:

>> On Thu, Aug 13, 2020 at 11:01 PM Jakub Kicinski <kuba@kernel.org> wrote:

>> > > I had originally dropped this patch, but the issue kept coming up in

>> > > user reports, so here's a v4 of it. Testing of it is still rather

>> > > slim,

>> > > but hopefully that will change in the coming days.

>> >

>> > Here an alternative patch, untested:

>>

>> Funny. But come on now... Why would we want to deprive our users of

>> system consistency?

>

> We should try for consistency between xdp and cls_bpf instead.


And still require users to reimplement their packet processing logic twice?

>

>> Doesn't it make sense to allow users to use the same code across

>> interfaces? You actually want them to rewrite their code to use a

>> totally different trigger point just because of some weird kernel

>> internals between interfaces?

>

> We're not building an abstraction over the kernel stack so that users

> won't have to worry how things work. Users need to have a minimal

> understanding of how specific hooks integrate with the stack and what

> they are for. And therefore why cls_bpf is actually more efficient to

> use in L3 tunnel case.


It's not like adding 7 lines of code constitutes adding an abstraction
layer. It's a pretty basic fix to make real things work for real
users. While you might argue that users should do something different,
you also can't deny that being able to hook up the same packet
processing to eth0, eth1, extrafancyeth2, and tun0 is a huge
convenience.

>

>> Why not make XDP more useful and more generic across interfaces? It's

>> very common for systems to be receiving packets with a heavy ethernet

>> card from the current data center, in addition to receiving packets

>> from a tunnel interface connected to a remote data center, with a need

>> to run the same XDP program on both interfaces. Why not support that

>> kind of simplicity?

>>

>> This is _actually_ something that's come up _repeatedly_. This is a

>> real world need from real users who are doing real things. Why not

>> help them?

>

> I'm sure it comes up repeatedly because we don't return any errors,

> so people waste time investigating why it doesn't work.


What? No. It comes up repeatedly because people want to reuse their
XDP processing logic with layer 3 devices. You might be right that if
we tell them to go away, maybe they will, but on the other hand, why
not make this actually work for them? It seems pretty easy to do, and
saves everyone a lot of time.

Are you worried about adding a branch to the
already-slower-and-discouraged non-hardware generic path? If so, I
wouldn't object if you wanted to put unlikely() around the branch
condition in that if statement.

Jason
David Miller Aug. 14, 2020, 9:14 p.m. UTC | #5
From: Jakub Kicinski <kuba@kernel.org>

Date: Fri, 14 Aug 2020 08:31:53 -0700

> I'm sure it comes up repeatedly because we don't return any errors,

> so people waste time investigating why it doesn't work.


+1
David Miller Aug. 14, 2020, 9:26 p.m. UTC | #6
From: "Jason A. Donenfeld" <Jason@zx2c4.com>

Date: Fri, 14 Aug 2020 23:04:56 +0200

> What? No. It comes up repeatedly because people want to reuse their

> XDP processing logic with layer 3 devices.


XDP is a layer 2 packet processing technology.  It assumes an L2
ethernet and/or VLAN header is going to be there.

Putting a pretend ethernet header there doesn't really change that.
Jason A. Donenfeld Aug. 15, 2020, 7:54 a.m. UTC | #7
On Fri, Aug 14, 2020 at 11:27 PM David Miller <davem@davemloft.net> wrote:
>

> From: "Jason A. Donenfeld" <Jason@zx2c4.com>

> Date: Fri, 14 Aug 2020 23:04:56 +0200

>

> > What? No. It comes up repeatedly because people want to reuse their

> > XDP processing logic with layer 3 devices.

>

> XDP is a layer 2 packet processing technology.  It assumes an L2

> ethernet and/or VLAN header is going to be there.

>

> Putting a pretend ethernet header there doesn't really change that.


Actually, I wasn't aware that XDP was explicitly limited to L2-only,
as some kind of fundamental thing. A while back when this patchset
first came up, I initially posted something that gave unmodified L3
frames to XDP programs in the generic handler, but commenters pointed
out that this loses the skb->protocol changing capability, which could
be useful (e.g. some kind of custom v4->v6 modifying code), and adding
the pretend ethernet header would allow people to keep the same
programs for the L2 case as for the L3 case, which seemed *extremely*
compelling to me. Hence, this patch series has gone in the pretend
ethernet header direction.

But, anyway, as I said, I wasn't aware that XDP was explicitly limited
to L2-only, as some kind of fundamental thing. This actually surprises
me. I always thought the original motivation of XDP was that it
allowed putting a lot of steering and modification logic into the
network card hardware, for fancy new cards that support eBPF. Later,
the generic handler got added on so people could reuse those programs
in heterogeneous environments, where some cards have hardware support
and others do not. That seemed like a good idea to me. Extending that
a step further for layer 3 devices seems like a logical next step, in
the sense that if that step is invalid, surely the previous one of
adding the generic handler must be invalid too? At least that's my
impression of the historical evolution of this; I'm confident you know
much better than I do.

It makes me wonder, though, what will you say when hardware comes out
that has layer 3 semantics and a thirst for eBPF? Also deny that
hardware of XDP, because "XDP is a layer 2 packet processing
technology"? I know what you'll say now: "we don't design our
networking stack around hypothetical hardware, so why even bring it
up? I won't entertain that." But nevertheless, contemplating those
hypotheticals might be a good exercise for seeing how committed you
are to the XDP=L2-only assertion. For example, maybe there are
fundamental L2 semantics that XDP needs, and can't be emulated with my
pretend ethernet header -- like if you really are relying on the MACs
for something I'm not aware of; were that the case, it'd be compelling
to me. But if it's a bit weaker, in the form of, "we just don't want
to try anything with L3 at all because," then I admit I'm still a bit
mystified.

Nevertheless, at the risk of irritating you further, I will willingly
drop this patchset at your request, even though I don't completely
understand the entire scope of reasoning for doing so. (For posterity,
I just posted a v6, which splits out that other bug fix into something
separate for you to take, so that this one here can exist on its own.)

Jason
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 7df6c9617321..e6403e74a6aa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4591,12 +4591,12 @@  static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 				     struct xdp_buff *xdp,
 				     struct bpf_prog *xdp_prog)
 {
+	bool orig_bcast, skb_is_l3 = false;
 	struct netdev_rx_queue *rxqueue;
 	void *orig_data, *orig_data_end;
 	u32 metalen, act = XDP_DROP;
 	__be16 orig_eth_type;
 	struct ethhdr *eth;
-	bool orig_bcast;
 	int hlen, off;
 	u32 mac_len;
 
@@ -4630,6 +4630,13 @@  static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	 * header.
 	 */
 	mac_len = skb->data - skb_mac_header(skb);
+	skb_is_l3 = !mac_len && !skb->dev->hard_header_len;
+	if (skb_is_l3) {
+		eth = skb_push(skb, sizeof(struct ethhdr));
+		eth_zero_addr(eth->h_source);
+		eth_zero_addr(eth->h_dest);
+		eth->h_proto = skb->protocol;
+	}
 	hlen = skb_headlen(skb) + mac_len;
 	xdp->data = skb->data - mac_len;
 	xdp->data_meta = xdp->data;
@@ -4684,6 +4691,8 @@  static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 		__skb_push(skb, mac_len);
 		break;
 	case XDP_PASS:
+		if (skb_is_l3)
+			skb_pull(skb, sizeof(struct ethhdr));
 		metalen = xdp->data - xdp->data_meta;
 		if (metalen)
 			skb_metadata_set(skb, metalen);