Message ID | 03ee62602dd7b7101f78e0802249a6e2e4c10b7f.camel@infradead.org |
---|---|
State | Superseded |
Headers | show |
Series | net: tun: fix tun_xdp_one() for IFF_TUN mode | expand |
在 2021/6/19 下午9:33, David Woodhouse 写道: > From: David Woodhouse <dwmw@amazon.co.uk> > > In tun_get_user(), skb->protocol is either taken from the tun_pi header > or inferred from the first byte of the packet in IFF_TUN mode, while > eth_type_trans() is called only in the IFF_TAP mode where the payload > is expected to be an Ethernet frame. > > The alternative path in tun_xdp_one() was unconditionally using > eth_type_trans(), which corrupts packets in IFF_TUN mode. Fix it to > do the correct thing for IFF_TUN mode, as tun_get_user() does. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()") > --- > How is my userspace application going to know that the kernel has this > fix? Should we add a flag to TUN_FEATURES to show that vhost-net in > *IFF_TUN* mode is supported? I think it's probably too late to fix? Since it should work before 043d222f93ab. The only way is to backport this fix to stable. > > drivers/net/tun.c | 44 +++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 43 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 4cf38be26dc9..f812dcdc640e 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -2394,8 +2394,50 @@ static int tun_xdp_one(struct tun_struct *tun, > err = -EINVAL; > goto out; > } > + switch (tun->flags & TUN_TYPE_MASK) { > + case IFF_TUN: > + if (tun->flags & IFF_NO_PI) { > + u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0; > + > + switch (ip_version) { > + case 4: > + skb->protocol = htons(ETH_P_IP); > + break; > + case 6: > + skb->protocol = htons(ETH_P_IPV6); > + break; > + default: > + atomic_long_inc(&tun->dev->rx_dropped); > + kfree_skb(skb); > + err = -EINVAL; > + goto out; > + } > + } else { > + struct tun_pi *pi = (struct tun_pi *)skb->data; > + if (!pskb_may_pull(skb, sizeof(*pi))) { > + atomic_long_inc(&tun->dev->rx_dropped); > + kfree_skb(skb); > + err = -ENOMEM; > + goto out; > + } > + skb_pull_inline(skb, sizeof(*pi)); > + skb->protocol = pi->proto; > + } > + > + skb_reset_mac_header(skb); > + skb->dev = tun->dev; > + break; > + case IFF_TAP: > + if (!pskb_may_pull(skb, ETH_HLEN)) { > + atomic_long_inc(&tun->dev->rx_dropped); > + kfree_skb(skb); > + err = -ENOMEM; > + goto out; > + } > + skb->protocol = eth_type_trans(skb, tun->dev); > + break; I wonder whether we can have some codes unification with tun_get_user(). Thanks > + } > > - skb->protocol = eth_type_trans(skb, tun->dev); > skb_reset_network_header(skb); > skb_probe_transport_header(skb); > skb_record_rx_queue(skb, tfile->queue_index);
On Mon, 2021-06-21 at 15:00 +0800, Jason Wang wrote: > I think it's probably too late to fix? Since it should work before > 043d222f93ab. > > The only way is to backport this fix to stable. Yeah, I assumed the fix would be backported; if not then the "does the kernel have it" check is fairly trivial. I *can* avoid it for now by just using TUNSNDBUF to reduce the sndbuf and then we never take the XDP path at all. My initial crappy hacks are slowly turning into something that I might actually want to commit to mainline (once I've fixed endianness and memory ordering issues): https://gitlab.com/openconnect/openconnect/-/compare/master...vhost I have a couple of remaining problems using vhost-net directly from userspace though. Firstly, I don't think I can set IFF_VNET_HDR on the tun device after opening it. So my model of "open the tun device, then *see* if we can use vhost to accelerate it" doesn't work. I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features instead, but that gives me a weird failure mode where it drops around half the incoming packets, and I haven't yet worked out why. Of course I don't *actually* want a vnet header at all but the vhost code really assumes that *someone* will add one; if I *don't* set VHOST_NET_F_VIRTIO_NET_HDR then it always *assumes* it can read ten bytes more from the tun socket than the 'peek' says, and barfs when it can't. (Or such was my initial half-thought-through diagnosis before I made it go away by setting IFF_VNET_HDR, at least). Secondly, I need to pull numbers out of my posterior for the VHOST_SET_MEM_TABLE call. This works for x86_64: vmem->nregions = 1; vmem->regions[0].guest_phys_addr = 4096; vmem->regions[0].memory_size = 0x7fffffffe000; vmem->regions[0].userspace_addr = 4096; if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) { Is there a way to bypass that and just unconditionally set a 1:1 mapping of *all* userspace address space? It's possible that one or the other of those problems will result in a new advertised "feature" which is so simple (like a 1:1 map) that we can call it a bugfix and backport it along with the tun fix I already posted, and the presence of *that* can indicate that the tun bug is fixed :)
On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote: > > Firstly, I don't think I can set IFF_VNET_HDR on the tun device after > opening it. So my model of "open the tun device, then *see* if we can > use vhost to accelerate it" doesn't work. > > I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features > instead, but that gives me a weird failure mode where it drops around > half the incoming packets, and I haven't yet worked out why. FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP data path.
On Mon, 2021-06-21 at 15:50 +0100, David Woodhouse wrote: > On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote: > > > > Firstly, I don't think I can set IFF_VNET_HDR on the tun device after > > opening it. So my model of "open the tun device, then *see* if we can > > use vhost to accelerate it" doesn't work. > > > > I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features > > instead, but that gives me a weird failure mode where it drops around > > half the incoming packets, and I haven't yet worked out why. > > FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP > data path. Looks like there are two problems there. Firstly, vhost_net_build_xdp() doesn't cope well with sock_hlen being zero. It reads those zero bytes into its buffer, then points 'gso' at the buffer with no valid data in it, and checks gso->flags for the NEEDS_CSUM flag. Secondly, tun_xdp_one() doesn't cope with receiving packets without the virtio header either. While tun_get_user() correctly checks IFF_VNET_HDR, tun_xdp_one() does not, and treats the start of my IP packets as if they were a virtio_net_hdr. I'll look at turning my code into a test case for kernel selftests.
在 2021/6/21 下午6:52, David Woodhouse 写道: > On Mon, 2021-06-21 at 15:00 +0800, Jason Wang wrote: >> I think it's probably too late to fix? Since it should work before >> 043d222f93ab. >> >> The only way is to backport this fix to stable. > Yeah, I assumed the fix would be backported; if not then the "does the > kernel have it" check is fairly trivial. > > I *can* avoid it for now by just using TUNSNDBUF to reduce the sndbuf > and then we never take the XDP path at all. > > My initial crappy hacks are slowly turning into something that I might > actually want to commit to mainline (once I've fixed endianness and > memory ordering issues): > https://gitlab.com/openconnect/openconnect/-/compare/master...vhost > > I have a couple of remaining problems using vhost-net directly from > userspace though. > > Firstly, I don't think I can set IFF_VNET_HDR on the tun device after > opening it. So my model of "open the tun device, then *see* if we can > use vhost to accelerate it" doesn't work. Yes, IFF_VNET_HDR is set during TUN_SET_IFF which can't be changed afterwards. > > I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features > instead, but that gives me a weird failure mode where it drops around > half the incoming packets, and I haven't yet worked out why. > > Of course I don't *actually* want a vnet header at all but the vhost > code really assumes that *someone* will add one; if I *don't* set > VHOST_NET_F_VIRTIO_NET_HDR then it always *assumes* it can read ten > bytes more from the tun socket than the 'peek' says, and barfs when it > can't. (Or such was my initial half-thought-through diagnosis before I > made it go away by setting IFF_VNET_HDR, at least). Yes, vhost always assumes there's a vnet header. > > > Secondly, I need to pull numbers out of my posterior for the > VHOST_SET_MEM_TABLE call. This works for x86_64: > > vmem->nregions = 1; > vmem->regions[0].guest_phys_addr = 4096; > vmem->regions[0].memory_size = 0x7fffffffe000; > vmem->regions[0].userspace_addr = 4096; > if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) { > > Is there a way to bypass that and just unconditionally set a 1:1 > mapping of *all* userspace address space? Memory Table is one of the basic abstraction of the vhost. Basically, you only need to map the userspace buffers. This is how DPDK virtio-user PMD did. Vhost will validate the addresses through access_ok() during VHOST_SET_MEM_TABLE. The range of all usersapce space seems architecture specific, I'm not sure if it's worth to bother. Thanks > > > > It's possible that one or the other of those problems will result in a > new advertised "feature" which is so simple (like a 1:1 map) that we > can call it a bugfix and backport it along with the tun fix I already > posted, and the presence of *that* can indicate that the tun bug is > fixed :)
在 2021/6/21 下午10:50, David Woodhouse 写道: > On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote: >> Firstly, I don't think I can set IFF_VNET_HDR on the tun device after >> opening it. So my model of "open the tun device, then *see* if we can >> use vhost to accelerate it" doesn't work. >> >> I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features >> instead, but that gives me a weird failure mode where it drops around >> half the incoming packets, and I haven't yet worked out why. > FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP > data path. That looks a workaround. Thanks
在 2021/6/22 上午4:43, David Woodhouse 写道: > On Mon, 2021-06-21 at 15:50 +0100, David Woodhouse wrote: >> On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote: >>> Firstly, I don't think I can set IFF_VNET_HDR on the tun device after >>> opening it. So my model of "open the tun device, then *see* if we can >>> use vhost to accelerate it" doesn't work. >>> >>> I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features >>> instead, but that gives me a weird failure mode where it drops around >>> half the incoming packets, and I haven't yet worked out why. >> FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP >> data path. > Looks like there are two problems there. > > Firstly, vhost_net_build_xdp() doesn't cope well with sock_hlen being > zero. It reads those zero bytes into its buffer, then points 'gso' at > the buffer with no valid data in it, and checks gso->flags for the > NEEDS_CSUM flag. > > Secondly, tun_xdp_one() doesn't cope with receiving packets without the > virtio header either. While tun_get_user() correctly checks > IFF_VNET_HDR, tun_xdp_one() does not, and treats the start of my IP > packets as if they were a virtio_net_hdr. > > I'll look at turning my code into a test case for kernel selftests. I cook two patches. Please see and check if they fix the problem. (compile test only for me). Thanks > > From 5d785027da87e40138407d76841f5cdd64914541 Mon Sep 17 00:00:00 2001 From: Jason Wang <jasowang@redhat.com> Date: Tue, 22 Jun 2021 12:07:59 +0800 Subject: [PATCH 1/2] vhost_net: validate gso metadata only if socket has vnet header When sock_hlen is zero, there's no need to validate the socket vnet header since it doesn't support that. Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/vhost/net.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index df82b124170e..5034c4949bc4 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -725,7 +725,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, hdr = buf; gso = &hdr->gso; - if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && + if (nvq->sock_hlen && (gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && vhost16_to_cpu(vq, gso->csum_start) + vhost16_to_cpu(vq, gso->csum_offset) + 2 > vhost16_to_cpu(vq, gso->hdr_len)) {
On Tue, 2021-06-22 at 12:52 +0800, Jason Wang wrote: > > > I cook two patches. Please see and check if they fix the problem. > (compile test only for me). I did the second one slightly differently (below) but those are what I came up with too, which seems to be working. @@ -2331,7 +2344,7 @@ static int tun_xdp_one(struct tun_struct *tun, { unsigned int datasize = xdp->data_end - xdp->data; struct tun_xdp_hdr *hdr = xdp->data_hard_start; - struct virtio_net_hdr *gso = &hdr->gso; + struct virtio_net_hdr *gso = NULL; struct bpf_prog *xdp_prog; struct sk_buff *skb = NULL; u32 rxhash = 0, act; @@ -2340,9 +2353,12 @@ static int tun_xdp_one(struct tun_struct *tun, bool skb_xdp = false; struct page *page; + if (tun->flags & IFF_VNET_HDR) + gso = &hdr->gso; + xdp_prog = rcu_dereference(tun->xdp_prog); if (xdp_prog) { - if (gso->gso_type) { + if (gso && gso->gso_type) { skb_xdp = true; goto build; } @@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct *tun, skb_reserve(skb, xdp->data - xdp->data_hard_start); skb_put(skb, xdp->data_end - xdp->data); - if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) { + if (!gso) + skb_reset_mac_header(skb); + else if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) { atomic_long_inc(&tun->rx_frame_errors); kfree_skb(skb);
On Tue, 2021-06-22 at 12:34 +0800, Jason Wang wrote: > > > > Secondly, I need to pull numbers out of my posterior for the > > VHOST_SET_MEM_TABLE call. This works for x86_64: > > > > vmem->nregions = 1; > > vmem->regions[0].guest_phys_addr = 4096; > > vmem->regions[0].memory_size = 0x7fffffffe000; > > vmem->regions[0].userspace_addr = 4096; > > if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) { > > > > Is there a way to bypass that and just unconditionally set a 1:1 > > mapping of *all* userspace address space? > > > Memory Table is one of the basic abstraction of the vhost. Basically, > you only need to map the userspace buffers. This is how DPDK virtio-user > PMD did. Vhost will validate the addresses through access_ok() during > VHOST_SET_MEM_TABLE. > > The range of all usersapce space seems architecture specific, I'm not > sure if it's worth to bother. The buffers are just malloc'd. I just need a full 1:1 mapping of all "guest" memory to userspace addresses, and was trying to avoid having to map them on demand *just* because I don't know the full range of possible addresses that malloc will return, in advance. I'm tempted to add a new feature for that 1:1 access, with no ->umem or ->iotlb at all. And then I can use it as a key to know that the XDP bugs are fixed too :)
在 2021/6/22 下午3:24, David Woodhouse 写道: > On Tue, 2021-06-22 at 12:52 +0800, Jason Wang wrote: >> >> I cook two patches. Please see and check if they fix the problem. >> (compile test only for me). > I did the second one slightly differently (below) but those are what I > came up with too, which seems to be working. > > @@ -2331,7 +2344,7 @@ static int tun_xdp_one(struct tun_struct *tun, > { > unsigned int datasize = xdp->data_end - xdp->data; > struct tun_xdp_hdr *hdr = xdp->data_hard_start; > - struct virtio_net_hdr *gso = &hdr->gso; > + struct virtio_net_hdr *gso = NULL; > struct bpf_prog *xdp_prog; > struct sk_buff *skb = NULL; > u32 rxhash = 0, act; > @@ -2340,9 +2353,12 @@ static int tun_xdp_one(struct tun_struct *tun, > bool skb_xdp = false; > struct page *page; > > + if (tun->flags & IFF_VNET_HDR) > + gso = &hdr->gso; > + > xdp_prog = rcu_dereference(tun->xdp_prog); > if (xdp_prog) { > - if (gso->gso_type) { > + if (gso && gso->gso_type) { > skb_xdp = true; > goto build; > } > @@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct *tun, > skb_reserve(skb, xdp->data - xdp->data_hard_start); > skb_put(skb, xdp->data_end - xdp->data); > > - if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) { > + if (!gso) > + skb_reset_mac_header(skb); > + else if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) { > atomic_long_inc(&tun->rx_frame_errors); > kfree_skb(skb); This should work as well. Thanks
在 2021/6/22 下午3:28, David Woodhouse 写道: > On Tue, 2021-06-22 at 12:34 +0800, Jason Wang wrote: >>> Secondly, I need to pull numbers out of my posterior for the >>> VHOST_SET_MEM_TABLE call. This works for x86_64: >>> >>> vmem->nregions = 1; >>> vmem->regions[0].guest_phys_addr = 4096; >>> vmem->regions[0].memory_size = 0x7fffffffe000; >>> vmem->regions[0].userspace_addr = 4096; >>> if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) { >>> >>> Is there a way to bypass that and just unconditionally set a 1:1 >>> mapping of *all* userspace address space? >> >> Memory Table is one of the basic abstraction of the vhost. Basically, >> you only need to map the userspace buffers. This is how DPDK virtio-user >> PMD did. Vhost will validate the addresses through access_ok() during >> VHOST_SET_MEM_TABLE. >> >> The range of all usersapce space seems architecture specific, I'm not >> sure if it's worth to bother. > The buffers are just malloc'd. I just need a full 1:1 mapping of all > "guest" memory to userspace addresses, and was trying to avoid having > to map them on demand *just* because I don't know the full range of > possible addresses that malloc will return, in advance. > > I'm tempted to add a new feature for that 1:1 access, with no ->umem or > ->iotlb at all. And then I can use it as a key to know that the XDP > bugs are fixed too :) This means we need validate the userspace address each time before vhost tries to use that. This will de-gradate the performance. So we still need to figure out the legal userspace address range which might not be easy. Thanks >
On 22 June 2021 08:51:43 BST, Jason Wang <jasowang@redhat.com> wrote: > >在 2021/6/22 下午3:24, David Woodhouse 写道: >> On Tue, 2021-06-22 at 12:52 +0800, Jason Wang wrote: >>> >>> I cook two patches. Please see and check if they fix the problem. >>> (compile test only for me). >> I did the second one slightly differently (below) but those are what >I >> came up with too, which seems to be working. >> >> @@ -2331,7 +2344,7 @@ static int tun_xdp_one(struct tun_struct *tun, >> { >> unsigned int datasize = xdp->data_end - xdp->data; >> struct tun_xdp_hdr *hdr = xdp->data_hard_start; >> - struct virtio_net_hdr *gso = &hdr->gso; >> + struct virtio_net_hdr *gso = NULL; >> struct bpf_prog *xdp_prog; >> struct sk_buff *skb = NULL; >> u32 rxhash = 0, act; >> @@ -2340,9 +2353,12 @@ static int tun_xdp_one(struct tun_struct *tun, >> bool skb_xdp = false; >> struct page *page; >> >> + if (tun->flags & IFF_VNET_HDR) >> + gso = &hdr->gso; >> + >> xdp_prog = rcu_dereference(tun->xdp_prog); >> if (xdp_prog) { >> - if (gso->gso_type) { >> + if (gso && gso->gso_type) { >> skb_xdp = true; >> goto build; >> } >> @@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct >*tun, >> skb_reserve(skb, xdp->data - xdp->data_hard_start); >> skb_put(skb, xdp->data_end - xdp->data); >> >> - if (virtio_net_hdr_to_skb(skb, gso, >tun_is_little_endian(tun))) { >> + if (!gso) >> + skb_reset_mac_header(skb); >> + else if (virtio_net_hdr_to_skb(skb, gso, >tun_is_little_endian(tun))) { >> atomic_long_inc(&tun->rx_frame_errors); >> kfree_skb(skb); > > >This should work as well. I'll rip out the rest of my debugging hacks and check that these two are sufficient, and that I hadn't accidentally papered over something else as I debugged it. Then I'll look at the case of having no virtio_net_hdr on either side, and also different sized headers on the tun device (why does it even support that?). And the test case, of course. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Tue, 2021-06-22 at 16:00 +0800, Jason Wang wrote: > > I'm tempted to add a new feature for that 1:1 access, with no ->umem or > > ->iotlb at all. And then I can use it as a key to know that the XDP > > bugs are fixed too :) > > > This means we need validate the userspace address each time before vhost > tries to use that. This will de-gradate the performance. So we still > need to figure out the legal userspace address range which might not be > easy. Easier from the kernel than from userspace though :) But I don't know that we need it. Isn't a call to access_ok() going to be faster than what translate_desc() does to look things up anyway? In the 1:1 mode, the access_ok() is all that's needed since there's no translation. @@ -2038,6 +2065,14 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len, u64 s = 0; int ret = 0; + if (vhost_has_feature(vq, VHOST_F_IDENTITY_MAPPING)) { + if (!access_ok((void __user *)addr, len)) + return -EFAULT; + + iov[0].iov_len = len; + iov[0].iov_base = (void __user *)addr; + return 1; + } while ((u64)len > s) { u64 size; if (unlikely(ret >= iov_size)) {
On Tue, 2021-06-22 at 15:51 +0800, Jason Wang wrote: > > @@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct *tun, > > skb_reserve(skb, xdp->data - xdp->data_hard_start); > > skb_put(skb, xdp->data_end - xdp->data); > > > > - if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) { > > + if (!gso) > > + skb_reset_mac_header(skb); > > + else if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) { > > atomic_long_inc(&tun->rx_frame_errors); > > kfree_skb(skb); > > > This should work as well. Actually there's no need for the skb_reset_mac_header() as that's going to happen anyway just a few lines down — either from eth_type_trans() or explicitly in the IFF_TUN code path that I added in my first patch. I also stripped out a lot more code from vhost_net_build_xdp() in the !sock_hlen case where it was pointless. https://git.infradead.org/users/dwmw2/linux.git/shortlog/vhost-net I'll repost the whole series after I take look at whether I can get it to work without *either* side doing the vnet header, and add a test case based on my userspace code in https://gitlab.com/openconnect/openconnect/-/blob/vhost/vhost.c
在 2021/6/22 下午4:29, David Woodhouse 写道: > On Tue, 2021-06-22 at 16:00 +0800, Jason Wang wrote: >>> I'm tempted to add a new feature for that 1:1 access, with no ->umem or >>> ->iotlb at all. And then I can use it as a key to know that the XDP >>> bugs are fixed too :) >> >> This means we need validate the userspace address each time before vhost >> tries to use that. This will de-gradate the performance. So we still >> need to figure out the legal userspace address range which might not be >> easy. > Easier from the kernel than from userspace though :) Yes. > > But I don't know that we need it. Isn't a call to access_ok() going to > be faster than what translate_desc() does to look things up anyway? Right. > > In the 1:1 mode, the access_ok() is all that's needed since there's no > translation. > > @@ -2038,6 +2065,14 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len, > u64 s = 0; > int ret = 0; > > + if (vhost_has_feature(vq, VHOST_F_IDENTITY_MAPPING)) { Using vhost_has_feature() is kind of tricky since it's used for virtio feature negotiation. We probably need to use backend_features instead. I think we should probably do more: 1) forbid the feature to be set when mem table / IOTLB has at least one mapping 2) forbid the mem table / IOTLB updating after the feature is set Thanks > + if (!access_ok((void __user *)addr, len)) > + return -EFAULT; > + > + iov[0].iov_len = len; > + iov[0].iov_base = (void __user *)addr; > + return 1; > + } > while ((u64)len > s) { > u64 size; > if (unlikely(ret >= iov_size)) {
On Wed, 2021-06-23 at 11:39 +0800, Jason Wang wrote: > > > > In the 1:1 mode, the access_ok() is all that's needed since there's no > > translation. > > > > @@ -2038,6 +2065,14 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len, > > u64 s = 0; > > int ret = 0; > > > > + if (vhost_has_feature(vq, VHOST_F_IDENTITY_MAPPING)) { > > > Using vhost_has_feature() is kind of tricky since it's used for virtio > feature negotiation. > > We probably need to use backend_features instead. > > I think we should probably do more: > > 1) forbid the feature to be set when mem table / IOTLB has at least one > mapping > 2) forbid the mem table / IOTLB updating after the feature is set Yes, that all makes sense. I confess I hadn't actually *implemented* the feature at all; the only time I'd typed 'VHOST_F_IDENTITY_MAPPING' was to show that snippet of "patch" as an example of what translate_desc() would do. I just wanted *something* to put in the 'if()' statement as a placeholder, so I used that. It could *even* have just been 'if (!umem)' but that might let it happen by *accident* which probably isn't a good idea.
在 2021/6/24 下午8:30, David Woodhouse 写道: > From: David Woodhouse <dwmw@amazon.co.uk> > > The vhost-net driver was making wild assumptions about the header length > of the underlying tun/tap socket. It's by design to depend on the userspace to co-ordinate the vnet header setting with the underlying sockets. > Then it was discarding packets if > the number of bytes it got from sock_recvmsg() didn't precisely match > its guess. Anything that is broken by this? The failure is a hint for the userspace that something is wrong during the coordination. > > Fix it to get the correct information along with the socket itself. I'm not sure what is fixed by this. It looks to me it tires to let packet go even if the userspace set the wrong attributes to tap or vhost. This is even sub-optimal than failing explicitly fail the RX. > As a side-effect, this means that tun_get_socket() won't work if the > tun file isn't actually connected to a device, since there's no 'tun' > yet in that case to get the information from. This may break the existing application. Vhost-net is tied to the socket instead of the device that the socket is loosely coupled. > > On the receive side, where the tun device generates the virtio_net_hdr > but VIRITO_NET_F_MSG_RXBUF was negotiated and vhost-net needs to fill > in the 'num_buffers' field on top of the existing virtio_net_hdr, fix > that to use 'sock_hlen - 2' as the location, which means that it goes > in the right place regardless of whether the tun device is using an > additional tun_pi header or not. In this case, the user should have > configured the tun device with a vnet hdr size of 12, to make room. > > Fixes: 8dd014adfea6f ("vhost-net: mergeable buffers support") > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > drivers/net/tap.c | 5 ++++- > drivers/net/tun.c | 16 +++++++++++++++- > drivers/vhost/net.c | 31 +++++++++++++++---------------- > include/linux/if_tap.h | 4 ++-- > include/linux/if_tun.h | 4 ++-- > 5 files changed, 38 insertions(+), 22 deletions(-) > > diff --git a/drivers/net/tap.c b/drivers/net/tap.c > index 8e3a28ba6b28..2170a0d3d34c 100644 > --- a/drivers/net/tap.c > +++ b/drivers/net/tap.c > @@ -1246,7 +1246,7 @@ static const struct proto_ops tap_socket_ops = { > * attached to a device. The returned object works like a packet socket, it > * can be used for sock_sendmsg/sock_recvmsg. The caller is responsible for > * holding a reference to the file for as long as the socket is in use. */ > -struct socket *tap_get_socket(struct file *file) > +struct socket *tap_get_socket(struct file *file, size_t *hlen) > { > struct tap_queue *q; > if (file->f_op != &tap_fops) > @@ -1254,6 +1254,9 @@ struct socket *tap_get_socket(struct file *file) > q = file->private_data; > if (!q) > return ERR_PTR(-EBADFD); > + if (hlen) > + *hlen = (q->flags & IFF_VNET_HDR) ? q->vnet_hdr_sz : 0; > + > return &q->sock; > } > EXPORT_SYMBOL_GPL(tap_get_socket); > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 4cf38be26dc9..67b406fa0881 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -3649,7 +3649,7 @@ static void tun_cleanup(void) > * attached to a device. The returned object works like a packet socket, it > * can be used for sock_sendmsg/sock_recvmsg. The caller is responsible for > * holding a reference to the file for as long as the socket is in use. */ > -struct socket *tun_get_socket(struct file *file) > +struct socket *tun_get_socket(struct file *file, size_t *hlen) > { > struct tun_file *tfile; > if (file->f_op != &tun_fops) > @@ -3657,6 +3657,20 @@ struct socket *tun_get_socket(struct file *file) > tfile = file->private_data; > if (!tfile) > return ERR_PTR(-EBADFD); > + > + if (hlen) { > + struct tun_struct *tun = tun_get(tfile); > + size_t len = 0; > + > + if (!tun) > + return ERR_PTR(-ENOTCONN); > + if (tun->flags & IFF_VNET_HDR) > + len += READ_ONCE(tun->vnet_hdr_sz); > + if (!(tun->flags & IFF_NO_PI)) > + len += sizeof(struct tun_pi); > + tun_put(tun); > + *hlen = len; > + } > return &tfile->socket; > } > EXPORT_SYMBOL_GPL(tun_get_socket); > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > index df82b124170e..b92a7144ed90 100644 > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net) > > vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ? > vq->log : NULL; > - mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF); > + mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) && > + (vhost_hlen || sock_hlen >= sizeof(num_buffers)); So this change the behavior. When mergeable buffer is enabled, userspace expects the vhost to merge buffers. If the feature is disabled silently, it violates virtio spec. If anything wrong in the setup, userspace just breaks itself. E.g if sock_hlen is less that struct virtio_net_hdr_mrg_buf. The packet header might be overwrote by the vnet header. > > do { > sock_len = vhost_net_rx_peek_head_len(net, sock->sk, > @@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net) > } > } else { > /* Header came from socket; we'll need to patch > - * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF > + * ->num_buffers over the last two bytes if > + * VIRTIO_NET_F_MRG_RXBUF is enabled. > */ > - iov_iter_advance(&fixup, sizeof(hdr)); > + iov_iter_advance(&fixup, sock_hlen - 2); I'm not sure what did the above code want to fix. It doesn't change anything if vnet header is set correctly in TUN. It only prevents the the packet header from being rewrote. Thanks
在 2021/6/24 下午8:30, David Woodhouse 写道: > From: David Woodhouse <dwmw@amazon.co.uk> > > Sometimes it's just a data packet. The virtio_net_hdr processing should be > conditional on IFF_VNET_HDR, just as it is in tun_get_user(). > > Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()") > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Jason Wang <jasowang@redhat.com> > --- > drivers/net/tun.c | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 67b406fa0881..9acd448e6dfc 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -2331,7 +2331,7 @@ static int tun_xdp_one(struct tun_struct *tun, > { > unsigned int datasize = xdp->data_end - xdp->data; > struct tun_xdp_hdr *hdr = xdp->data_hard_start; > - struct virtio_net_hdr *gso = &hdr->gso; > + struct virtio_net_hdr *gso = NULL; > struct bpf_prog *xdp_prog; > struct sk_buff *skb = NULL; > u32 rxhash = 0, act; > @@ -2340,9 +2340,12 @@ static int tun_xdp_one(struct tun_struct *tun, > bool skb_xdp = false; > struct page *page; > > + if (tun->flags & IFF_VNET_HDR) > + gso = &hdr->gso; > + > xdp_prog = rcu_dereference(tun->xdp_prog); > if (xdp_prog) { > - if (gso->gso_type) { > + if (gso && gso->gso_type) { > skb_xdp = true; > goto build; > } > @@ -2388,7 +2391,7 @@ static int tun_xdp_one(struct tun_struct *tun, > skb_reserve(skb, xdp->data - xdp->data_hard_start); > skb_put(skb, xdp->data_end - xdp->data); > > - if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) { > + if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) { > atomic_long_inc(&tun->rx_frame_errors); > kfree_skb(skb); > err = -EINVAL;
On Fri, 2021-06-25 at 13:00 +0800, Jason Wang wrote: > 在 2021/6/24 下午8:30, David Woodhouse 写道: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > The vhost-net driver was making wild assumptions about the header length > > of the underlying tun/tap socket. > > > It's by design to depend on the userspace to co-ordinate the vnet header > setting with the underlying sockets. > > > > Then it was discarding packets if > > the number of bytes it got from sock_recvmsg() didn't precisely match > > its guess. > > > Anything that is broken by this? The failure is a hint for the userspace > that something is wrong during the coordination. I am not a fan of this approach. I firmly believe that for a given configuration, the kernel should either *work* or it should gracefully refuse to set it up that way. And the requirements should be clearly documented. Having been on the receiving end of this "hint" of which you speak, I found it distinctly suboptimal as a user interface. I was left scrabbling around trying to find a set of options which *would* work, and it was only through debugging the kernel that I managed to work out that I: • MUST set IFF_NO_PI • MUST use TUNSETSNDBUF to reduce the sndbuf from INT_MAX • MUST use a virtio_net_hdr that I don't want If my application failed to do any of those things, I got a silent failure to transport any packets. The only thing I could do *without* debugging the kernel was tcpdump on the 'tun0' interface and see if the TX packets I put into the ring were even making it to the interface, and what they looked like if they did. (Losing the first 14 bytes and having the *next* 14 bytes scribbled on by an Ethernet header was a fun one.) > > > > Fix it to get the correct information along with the socket itself. > > > I'm not sure what is fixed by this. It looks to me it tires to let > packet go even if the userspace set the wrong attributes to tap or > vhost. This is even sub-optimal than failing explicitly fail the RX. I'm OK with explicit failure. But once I'd let it *get* the information from the underlying socket in order to decide whether it should fail or not, it turned out to be easy enough just to make those configs work anyway. The main case where that "easy enough" is stretched a little (IMO) was when there's a tun_pi header. I have one more of your emails to reply to after this, and I'll address that there. > > > As a side-effect, this means that tun_get_socket() won't work if the > > tun file isn't actually connected to a device, since there's no 'tun' > > yet in that case to get the information from. > > > This may break the existing application. Vhost-net is tied to the socket > instead of the device that the socket is loosely coupled. Hm. Perhaps the PI and vnet hdr should be considered an option of the *socket* (which is tied to the tfile), not purely an option of the underlying device? Or maybe it's sufficient just to get the flags from *either* tfile->tun or tfile->detached, so that it works when the queue is detached. I'll take a look. I suppose we could even have a fallback that makes stuff up like we do today. If the user attempts to attach a tun file descriptor to vhost without ever calling TUNSETIFF on it first, *then* we make the same assumptions we do today? > > --- a/drivers/vhost/net.c > > +++ b/drivers/vhost/net.c > > @@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net) > > > > vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ? > > vq->log : NULL; > > - mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF); > > + mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) && > > + (vhost_hlen || sock_hlen >= sizeof(num_buffers)); > > > So this change the behavior. When mergeable buffer is enabled, userspace > expects the vhost to merge buffers. If the feature is disabled silently, > it violates virtio spec. > > If anything wrong in the setup, userspace just breaks itself. > > E.g if sock_hlen is less that struct virtio_net_hdr_mrg_buf. The packet > header might be overwrote by the vnet header. This wasn't intended to change the behaviour of any code path that is already working today. If *either* vhost or the underlying device have provided a vnet header, we still merge. If *neither* provide a vnet hdr, there's nowhere to put num_buffers and we can't merge. That code path doesn't work at all today, but does after my patches. But you're right, we should explicitly refuse to negotiate VIRITO_NET_F_MSG_RXBUF in that case. > > > > > do { > > sock_len = vhost_net_rx_peek_head_len(net, sock->sk, > > @@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net) > > } > > } else { > > /* Header came from socket; we'll need to patch > > - * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF > > + * ->num_buffers over the last two bytes if > > + * VIRTIO_NET_F_MRG_RXBUF is enabled. > > */ > > - iov_iter_advance(&fixup, sizeof(hdr)); > > + iov_iter_advance(&fixup, sock_hlen - 2); > > > I'm not sure what did the above code want to fix. It doesn't change > anything if vnet header is set correctly in TUN. It only prevents the > the packet header from being rewrote. > It fixes the case where the virtio_net_hdr isn't at the start of the tun header, because the tun actually puts the tun_pi struct *first*, and *then* the virtio_net_hdr. The num_buffers field needs to go at the *end* of sock_hlen. Not at a fixed offset from the *start* of it. At least, that's true unless we want to just declare that we *only* support TUN with the IFF_NO_PI flag. (qv).
On Thu, Jun 24, 2021 at 8:30 AM David Woodhouse <dwmw2@infradead.org> wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > The vhost-net driver was making wild assumptions about the header length If respinning, please more concretely describe which configuration is currently broken. IFF_NO_PI + IFF_VNET_HDR, if I understand correctly. But I got that from the discussion, not the commit message. > of the underlying tun/tap socket. Then it was discarding packets if > the number of bytes it got from sock_recvmsg() didn't precisely match > its guess. > > Fix it to get the correct information along with the socket itself. > As a side-effect, this means that tun_get_socket() won't work if the > tun file isn't actually connected to a device, since there's no 'tun' > yet in that case to get the information from. > > On the receive side, where the tun device generates the virtio_net_hdr > but VIRITO_NET_F_MSG_RXBUF was negotiated and vhost-net needs to fill Nit: VIRTIO_NET_F_MSG_RXBUF > in the 'num_buffers' field on top of the existing virtio_net_hdr, fix > that to use 'sock_hlen - 2' as the location, which means that it goes Please use sizeof(hdr.num_buffers) instead of a raw constant 2, to self document the code. Should this be an independent one-line fix?
On Fri, 2021-06-25 at 14:13 -0400, Willem de Bruijn wrote: > On Thu, Jun 24, 2021 at 8:30 AM David Woodhouse <dwmw2@infradead.org> > wrote: > > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > The vhost-net driver was making wild assumptions about the header > > length > > If respinning, please more concretely describe which configuration is > currently broken. Fairly much all of them. Here's a test run on the 5.12.8 kernel: $ sudo ./test_vhost_net TEST: (hdr 0, xdp 0, pi 0, features 0) RESULT: -1 TEST: (hdr 10, xdp 0, pi 0, features 0) RESULT: 0 TEST: (hdr 12, xdp 0, pi 0, features 0) RESULT: -1 TEST: (hdr 20, xdp 0, pi 0, features 0) RESULT: -1 TEST: (hdr 0, xdp 1, pi 0, features 0) RESULT: -1 TEST: (hdr 10, xdp 1, pi 0, features 0) RESULT: -1 TEST: (hdr 12, xdp 1, pi 0, features 0) RESULT: -1 TEST: (hdr 20, xdp 1, pi 0, features 0) RESULT: -1 TEST: (hdr 0, xdp 0, pi 1, features 0) RESULT: -1 TEST: (hdr 10, xdp 0, pi 1, features 0) RESULT: -1 TEST: (hdr 12, xdp 0, pi 1, features 0) RESULT: -1 TEST: (hdr 20, xdp 0, pi 1, features 0) RESULT: -1 TEST: (hdr 0, xdp 1, pi 1, features 0) RESULT: -1 TEST: (hdr 10, xdp 1, pi 1, features 0) RESULT: -1 TEST: (hdr 12, xdp 1, pi 1, features 0) RESULT: -1 TEST: (hdr 20, xdp 1, pi 1, features 0) RESULT: -1 TEST: (hdr 0, xdp 0, pi 0, features 100000000) RESULT: -1 TEST: (hdr 10, xdp 0, pi 0, features 100000000) RESULT: -1 TEST: (hdr 12, xdp 0, pi 0, features 100000000) RESULT: 0 TEST: (hdr 20, xdp 0, pi 0, features 100000000) RESULT: -1 TEST: (hdr 0, xdp 1, pi 0, features 100000000) RESULT: -1 TEST: (hdr 10, xdp 1, pi 0, features 100000000) RESULT: -1 TEST: (hdr 12, xdp 1, pi 0, features 100000000) RESULT: -1 TEST: (hdr 20, xdp 1, pi 0, features 100000000) RESULT: -1 TEST: (hdr 0, xdp 0, pi 1, features 100000000) RESULT: -1 TEST: (hdr 10, xdp 0, pi 1, features 100000000) RESULT: -1 TEST: (hdr 12, xdp 0, pi 1, features 100000000) RESULT: -1 TEST: (hdr 20, xdp 0, pi 1, features 100000000) RESULT: -1 TEST: (hdr 0, xdp 1, pi 1, features 100000000) RESULT: -1 TEST: (hdr 10, xdp 1, pi 1, features 100000000) RESULT: -1 TEST: (hdr 12, xdp 1, pi 1, features 100000000) RESULT: -1 TEST: (hdr 20, xdp 1, pi 1, features 100000000) RESULT: -1 TEST: (hdr 0, xdp 0, pi 0, features 8000000) RESULT: 0 TEST: (hdr 0, xdp 1, pi 0, features 8000000) RESULT: -1 TEST: (hdr 0, xdp 0, pi 1, features 8000000) RESULT: -1 TEST: (hdr 0, xdp 1, pi 1, features 8000000) RESULT: -1 TEST: (hdr 0, xdp 0, pi 0, features 108000000) RESULT: 0 TEST: (hdr 0, xdp 1, pi 0, features 108000000) RESULT: -1 TEST: (hdr 0, xdp 0, pi 1, features 108000000) RESULT: -1 TEST: (hdr 0, xdp 1, pi 1, features 108000000) RESULT: -1 > IFF_NO_PI + IFF_VNET_HDR, if I understand correctly. That's fairly much the only one that *did* work. As long as you use TUNSETSNDBUF which has the undocumented side-effect of turning off the XDP path. > > On the receive side, where the tun device generates the virtio_net_hdr > > but VIRITO_NET_F_MSG_RXBUF was negotiated and vhost-net needs to fill > > Nit: VIRTIO_NET_F_MSG_RXBUF Thanks. > > in the 'num_buffers' field on top of the existing virtio_net_hdr, fix > > that to use 'sock_hlen - 2' as the location, which means that it goes > > Please use sizeof(hdr.num_buffers) instead of a raw constant 2, to > self document the code. Makes sense. > Should this be an independent one-line fix? I don't think so; it's very much intertwined with the way it makes assumptions about someone else's data.
在 2021/6/25 下午4:23, David Woodhouse 写道: > On Fri, 2021-06-25 at 13:00 +0800, Jason Wang wrote: >> 在 2021/6/24 下午8:30, David Woodhouse 写道: >>> From: David Woodhouse <dwmw@amazon.co.uk> >>> >>> The vhost-net driver was making wild assumptions about the header length >>> of the underlying tun/tap socket. >> >> It's by design to depend on the userspace to co-ordinate the vnet header >> setting with the underlying sockets. >> >> >>> Then it was discarding packets if >>> the number of bytes it got from sock_recvmsg() didn't precisely match >>> its guess. >> >> Anything that is broken by this? The failure is a hint for the userspace >> that something is wrong during the coordination. > I am not a fan of this approach. I firmly believe that for a given > configuration, the kernel should either *work* or it should gracefully > refuse to set it up that way. And the requirements should be clearly > documented. That works only if all the logic were implemented in the same module but not the case in the e.g networking stack that a packet need to iterate several modules. E.g in this case, the vnet header size of the TAP could be changed at anytime via TUNSETVNETHDRSZ, and tuntap is unaware of the existence of vhost_net. This makes it impossible to do refuse in the case of setup (SET_BACKEND). > > Having been on the receiving end of this "hint" of which you speak, I > found it distinctly suboptimal as a user interface. I was left > scrabbling around trying to find a set of options which *would* work, > and it was only through debugging the kernel that I managed to work out > that I: > > • MUST set IFF_NO_PI > • MUST use TUNSETSNDBUF to reduce the sndbuf from INT_MAX > • MUST use a virtio_net_hdr that I don't want > > If my application failed to do any of those things, I got a silent > failure to transport any packets. Yes, this is because the bug when using vhost_net + PI/TUN. And I guess the reason is that nobody tries to use that combination in the past. I'm not even sure if it's a valid setup since vhost-net is a virtio-net kernel server which is not expected to handle L3 packets or PI header (which is Linux specific and out of the scope virtio spec). > The only thing I could do *without* > debugging the kernel was tcpdump on the 'tun0' interface and see if the > TX packets I put into the ring were even making it to the interface, > and what they looked like if they did. (Losing the first 14 bytes and > having the *next* 14 bytes scribbled on by an Ethernet header was a fun > one.) The tricky part is that, the networking stack thinks the packet is successfully received but it was actually dropped by vhost-net. And there's no obvious userspace API to report such dropping as statistics counters or trace-points. Maybe we can tweak the vhost for a better logging in this case. > > > > > >>> Fix it to get the correct information along with the socket itself. >> >> I'm not sure what is fixed by this. It looks to me it tires to let >> packet go even if the userspace set the wrong attributes to tap or >> vhost. This is even sub-optimal than failing explicitly fail the RX. > I'm OK with explicit failure. But once I'd let it *get* the information > from the underlying socket in order to decide whether it should fail or > not, it turned out to be easy enough just to make those configs work > anyway. The problem is that this change may make some wrong configuration "works" silently at the level of vhost or TAP. When using this for VM, it would make the debugging even harder. > > The main case where that "easy enough" is stretched a little (IMO) was > when there's a tun_pi header. I have one more of your emails to reply > to after this, and I'll address that there. > > >>> As a side-effect, this means that tun_get_socket() won't work if the >>> tun file isn't actually connected to a device, since there's no 'tun' >>> yet in that case to get the information from. >> >> This may break the existing application. Vhost-net is tied to the socket >> instead of the device that the socket is loosely coupled. > Hm. Perhaps the PI and vnet hdr should be considered an option of the > *socket* (which is tied to the tfile), not purely an option of the > underlying device? Though this is how it is done in macvtap. It's probably too late to change tuntap. > > Or maybe it's sufficient just to get the flags from *either* tfile->tun > or tfile->detached, so that it works when the queue is detached. I'll > take a look. > > I suppose we could even have a fallback that makes stuff up like we do > today. If the user attempts to attach a tun file descriptor to vhost > without ever calling TUNSETIFF on it first, *then* we make the same > assumptions we do today? Then I would rather keep the using the assumption: 1) the value get from get_socket() might not be correct 2) the complexity or risk for bring a very little improvement of the debug-ability (which is still suspicious). > >>> --- a/drivers/vhost/net.c >>> +++ b/drivers/vhost/net.c >>> @@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net) >>> >>> vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ? >>> vq->log : NULL; >>> - mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF); >>> + mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) && >>> + (vhost_hlen || sock_hlen >= sizeof(num_buffers)); >> >> So this change the behavior. When mergeable buffer is enabled, userspace >> expects the vhost to merge buffers. If the feature is disabled silently, >> it violates virtio spec. >> >> If anything wrong in the setup, userspace just breaks itself. >> >> E.g if sock_hlen is less that struct virtio_net_hdr_mrg_buf. The packet >> header might be overwrote by the vnet header. > This wasn't intended to change the behaviour of any code path that is > already working today. If *either* vhost or the underlying device have > provided a vnet header, we still merge. > > If *neither* provide a vnet hdr, there's nowhere to put num_buffers and > we can't merge. > > That code path doesn't work at all today, but does after my patches. It looks to me it's a bug that userspace can keep working in this case. After mrg rx buffer is negotiated, userspace should always assumes the vhost-net to provide num_buffers. > But you're right, we should explicitly refuse to negotiate > VIRITO_NET_F_MSG_RXBUF in that case. This would be very hard: 1) VHOST_SET_FEATURES and VHOST_NET_SET_BACKEND are two different ioctls 2) vhost_net is not tightly coupled with tuntap, vnet header size could be changed by userspace at any time > >>> >>> do { >>> sock_len = vhost_net_rx_peek_head_len(net, sock->sk, >>> @@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net) >>> } >>> } else { >>> /* Header came from socket; we'll need to patch >>> - * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF >>> + * ->num_buffers over the last two bytes if >>> + * VIRTIO_NET_F_MRG_RXBUF is enabled. >>> */ >>> - iov_iter_advance(&fixup, sizeof(hdr)); >>> + iov_iter_advance(&fixup, sock_hlen - 2); >> >> I'm not sure what did the above code want to fix. It doesn't change >> anything if vnet header is set correctly in TUN. It only prevents the >> the packet header from being rewrote. >> > It fixes the case where the virtio_net_hdr isn't at the start of the > tun header, because the tun actually puts the tun_pi struct *first*, > and *then* the virtio_net_hdr. Right. > The num_buffers field needs to go at the *end* of sock_hlen. Not at a > fixed offset from the *start* of it. > > At least, that's true unless we want to just declare that we *only* > support TUN with the IFF_NO_PI flag. (qv). Yes, that's a good question. This is probably a hint that "vhost-net is never designed to work of PI", and even if it's not true, I'm not sure if it's too late to fix. Thanks
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 4cf38be26dc9..f812dcdc640e 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -2394,8 +2394,50 @@ static int tun_xdp_one(struct tun_struct *tun, err = -EINVAL; goto out; } + switch (tun->flags & TUN_TYPE_MASK) { + case IFF_TUN: + if (tun->flags & IFF_NO_PI) { + u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0; + + switch (ip_version) { + case 4: + skb->protocol = htons(ETH_P_IP); + break; + case 6: + skb->protocol = htons(ETH_P_IPV6); + break; + default: + atomic_long_inc(&tun->dev->rx_dropped); + kfree_skb(skb); + err = -EINVAL; + goto out; + } + } else { + struct tun_pi *pi = (struct tun_pi *)skb->data; + if (!pskb_may_pull(skb, sizeof(*pi))) { + atomic_long_inc(&tun->dev->rx_dropped); + kfree_skb(skb); + err = -ENOMEM; + goto out; + } + skb_pull_inline(skb, sizeof(*pi)); + skb->protocol = pi->proto; + } + + skb_reset_mac_header(skb); + skb->dev = tun->dev; + break; + case IFF_TAP: + if (!pskb_may_pull(skb, ETH_HLEN)) { + atomic_long_inc(&tun->dev->rx_dropped); + kfree_skb(skb); + err = -ENOMEM; + goto out; + } + skb->protocol = eth_type_trans(skb, tun->dev); + break; + } - skb->protocol = eth_type_trans(skb, tun->dev); skb_reset_network_header(skb); skb_probe_transport_header(skb); skb_record_rx_queue(skb, tfile->queue_index);