diff mbox series

[v3,3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves

Message ID 20210624123005.1301761-3-dwmw2@infradead.org
State New
Headers show
Series net: tun: fix tun_xdp_one() for IFF_TUN mode | expand

Commit Message

David Woodhouse June 24, 2021, 12:30 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

When the underlying socket isn't configured with a virtio_net_hdr, the
existing code in vhost_net_build_xdp() would attempt to validate
uninitialised data, by copying zero bytes (sock_hlen) into the local
copy of the header and then trying to validate that.

Fixing it is somewhat non-trivial because the tun device might put a
struct tun_pi *before* the virtio_net_hdr, which makes it hard to find.
So just stop messing with someone else's data in vhost_net_build_xdp(),
and let tap and tun validate it for themselves, as they do in the
non-XDP case anyway.

This means that the 'gso' member of struct tun_xdp_hdr can die, leaving
only 'int buflen'.

The socket header of sock_hlen is still copied separately from the
data payload because there may be a gap between them to ensure suitable
alignment of the latter.

Fixes: 0a0be13b8fe2 ("vhost_net: batch submitting XDP buffers to underlayer sockets")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/net/tap.c      | 25 ++++++++++++++++++++++---
 drivers/net/tun.c      | 21 ++++++++++++++++++---
 drivers/vhost/net.c    | 30 +++++++++---------------------
 include/linux/if_tun.h |  1 -
 4 files changed, 49 insertions(+), 28 deletions(-)

Comments

Jason Wang June 25, 2021, 7:33 a.m. UTC | #1
在 2021/6/24 下午8:30, David Woodhouse 写道:
> From: David Woodhouse<dwmw@amazon.co.uk>

>

> When the underlying socket isn't configured with a virtio_net_hdr, the

> existing code in vhost_net_build_xdp() would attempt to validate

> uninitialised data, by copying zero bytes (sock_hlen) into the local

> copy of the header and then trying to validate that.

>

> Fixing it is somewhat non-trivial because the tun device might put a

> struct tun_pi*before*  the virtio_net_hdr, which makes it hard to find.

> So just stop messing with someone else's data in vhost_net_build_xdp(),

> and let tap and tun validate it for themselves, as they do in the

> non-XDP case anyway.



Thinking in another way. All XDP stuffs for vhost is prepared for TAP. 
XDP is not expected to work for TUN.

So we can simply let's vhost doesn't go with XDP path is the underlayer 
socket is TUN.

Thanks


>

> This means that the 'gso' member of struct tun_xdp_hdr can die, leaving

> only 'int buflen'.

>

> The socket header of sock_hlen is still copied separately from the

> data payload because there may be a gap between them to ensure suitable

> alignment of the latter.

>

> Fixes: 0a0be13b8fe2 ("vhost_net: batch submitting XDP buffers to underlayer sockets")

> Signed-off-by: David Woodhouse<dwmw@amazon.co.uk>

> ---

>   drivers/net/tap.c      | 25 ++++++++++++++++++++++---

>   drivers/net/tun.c      | 21 ++++++++++++++++++---

>   drivers/vhost/net.c    | 30 +++++++++---------------------

>   include/linux/if_tun.h |  1 -

>   4 files changed, 49 insertions(+), 28 deletions(-)
David Woodhouse June 25, 2021, 8:37 a.m. UTC | #2
On Fri, 2021-06-25 at 15:33 +0800, Jason Wang wrote:
> 在 2021/6/24 下午8:30, David Woodhouse 写道:

> > From: David Woodhouse<dwmw@amazon.co.uk>

> > 

> > When the underlying socket isn't configured with a virtio_net_hdr, the

> > existing code in vhost_net_build_xdp() would attempt to validate

> > uninitialised data, by copying zero bytes (sock_hlen) into the local

> > copy of the header and then trying to validate that.

> > 

> > Fixing it is somewhat non-trivial because the tun device might put a

> > struct tun_pi*before*  the virtio_net_hdr, which makes it hard to find.

> > So just stop messing with someone else's data in vhost_net_build_xdp(),

> > and let tap and tun validate it for themselves, as they do in the

> > non-XDP case anyway.

> 

> 

> Thinking in another way. All XDP stuffs for vhost is prepared for TAP. 

> XDP is not expected to work for TUN.

> 

> So we can simply let's vhost doesn't go with XDP path is the underlayer 

> socket is TUN.


Actually, IFF_TUN mode per se isn't that complex. It's fixed purely on
the tun side by that first patch I posted, which I later expanded a
little to factor out tun_skb_set_protocol().

The next two patches in my original set were fixing up the fact that
XDP currently assumes that the *socket* will be doing the vhdr, not
vhost. Those two weren't tun-specific at all.

It's supporting the PI header (which tun puts *before* the virtio
header as I just said) which introduces a tiny bit more complexity.

So yes, avoiding the XDP path if PI is being used would make some
sense.

In fact I wouldn't be entirely averse to refusing PI mode completely,
as long as we fail gracefully at setup time by refusing the
SET_BACKEND. Not by just silently failing to receive packets.

But then again, it's not actually *that* hard to support, and it's
working fine in my selftests at the end of my patch series.
Jason Wang June 28, 2021, 4:23 a.m. UTC | #3
在 2021/6/25 下午4:37, David Woodhouse 写道:
> On Fri, 2021-06-25 at 15:33 +0800, Jason Wang wrote:

>> 在 2021/6/24 下午8:30, David Woodhouse 写道:

>>> From: David Woodhouse<dwmw@amazon.co.uk>

>>>

>>> When the underlying socket isn't configured with a virtio_net_hdr, the

>>> existing code in vhost_net_build_xdp() would attempt to validate

>>> uninitialised data, by copying zero bytes (sock_hlen) into the local

>>> copy of the header and then trying to validate that.

>>>

>>> Fixing it is somewhat non-trivial because the tun device might put a

>>> struct tun_pi*before*  the virtio_net_hdr, which makes it hard to find.

>>> So just stop messing with someone else's data in vhost_net_build_xdp(),

>>> and let tap and tun validate it for themselves, as they do in the

>>> non-XDP case anyway.

>>

>> Thinking in another way. All XDP stuffs for vhost is prepared for TAP.

>> XDP is not expected to work for TUN.

>>

>> So we can simply let's vhost doesn't go with XDP path is the underlayer

>> socket is TUN.

> Actually, IFF_TUN mode per se isn't that complex. It's fixed purely on

> the tun side by that first patch I posted, which I later expanded a

> little to factor out tun_skb_set_protocol().

>

> The next two patches in my original set were fixing up the fact that

> XDP currently assumes that the *socket* will be doing the vhdr, not

> vhost. Those two weren't tun-specific at all.

>

> It's supporting the PI header (which tun puts *before* the virtio

> header as I just said) which introduces a tiny bit more complexity.



This reminds me we need to fix tun_put_user_xdp(), but as we've 
discussed, we need first figure out if PI is worth to support for vhost-net.


>

> So yes, avoiding the XDP path if PI is being used would make some

> sense.

>

> In fact I wouldn't be entirely averse to refusing PI mode completely,

> as long as we fail gracefully at setup time by refusing the

> SET_BACKEND. Not by just silently failing to receive packets.



That's another way. Actually, macvtap mandate IFF_TAP | IFF_NO_PI.

Thanks


>

> But then again, it's not actually *that* hard to support, and it's

> working fine in my selftests at the end of my patch series.

>
David Woodhouse June 28, 2021, 11:23 a.m. UTC | #4
On Mon, 2021-06-28 at 12:23 +0800, Jason Wang wrote:
> 在 2021/6/25 下午4:37, David Woodhouse 写道:

> > On Fri, 2021-06-25 at 15:33 +0800, Jason Wang wrote:

> > > 在 2021/6/24 下午8:30, David Woodhouse 写道:

> > > > From: David Woodhouse<dwmw@amazon.co.uk>

> > > > 

> > > > When the underlying socket isn't configured with a virtio_net_hdr, the

> > > > existing code in vhost_net_build_xdp() would attempt to validate

> > > > uninitialised data, by copying zero bytes (sock_hlen) into the local

> > > > copy of the header and then trying to validate that.

> > > > 

> > > > Fixing it is somewhat non-trivial because the tun device might put a

> > > > struct tun_pi*before*  the virtio_net_hdr, which makes it hard to find.

> > > > So just stop messing with someone else's data in vhost_net_build_xdp(),

> > > > and let tap and tun validate it for themselves, as they do in the

> > > > non-XDP case anyway.

> > > 

> > > Thinking in another way. All XDP stuffs for vhost is prepared for TAP.

> > > XDP is not expected to work for TUN.

> > > 

> > > So we can simply let's vhost doesn't go with XDP path is the underlayer

> > > socket is TUN.

> > 

> > Actually, IFF_TUN mode per se isn't that complex. It's fixed purely on

> > the tun side by that first patch I posted, which I later expanded a

> > little to factor out tun_skb_set_protocol().

> > 

> > The next two patches in my original set were fixing up the fact that

> > XDP currently assumes that the *socket* will be doing the vhdr, not

> > vhost. Those two weren't tun-specific at all.

> > 

> > It's supporting the PI header (which tun puts *before* the virtio

> > header as I just said) which introduces a tiny bit more complexity.

> 

> 

> This reminds me we need to fix tun_put_user_xdp(),


Good point; thanks.

> but as we've discussed, we need first figure out if PI is worth to

> support for vhost-net.


FWIW I certainly don't care about PI support. The only time anyone
would want PI support is if they need to support protocols *other* than
IPv6 and Legacy IP, over tun mode.

I'm fixing this stuff because when I tried to use vhost-tun + tun for
*sensible* use cases, I ended up having to flounder around trying to
find a combination of settings that actually worked. And that offended
me :)

So I wrote a test case to iterate over various possible combinations of
settings, and then kept typing until that all worked.

The only thing I do feel quite strongly about is that stuff should
either *work*, or *explicitly* fail if it's unsupported.

At this point, although I have no actual use for it myself, I'd
probably just about come down on the side of supporting PI. On the
basis that:

 • I've basically made it work already.

 • It allows those code paths like tun_skb_set_protocol() to be
   consolidated as both calling code paths need the same thing.

 • Even in the kernel, and even when modules are as incestuously
   intertwined as vhost-net and tun already are, I'm a strong
   believer in *not* making assumptions about someone else's data,
   so letting *tun* handle its own headers without making those
   assumptions seems like the right thing to do.



If we want to support PI, I need to go fix tun_put_user_xdp() as you
noted (and work out how to add that to the test case). And resolve the
fact that configuration might change after tun_get_socket() is called —
and indeed that there might not *be* a configuration at all when
tun_get_socket() is called.


If we *don't* want to support PI, well, the *interesting* part of the
above needs fixing anyway. Because I strongly believe we should
*prevent* it if we don't support it, and we *still* have the case you
point out of the tun vhdr_size being changed at runtime.

I'll take a look at whether can pass the socklen back from tun to
vhost-net on *every* packet. Is there a MSG_XXX flag we can abuse and
somewhere in the msghdr that could return the header length used for
*this* packet? Or could we make vhost_net_rx_peek_head_len() call
explicitly into the tun device instead of making stuff up in
peek_head_len()? 


To be clear: from the point of view of my *application* I don't care
about any of this; my only motivation here is to clean up the kernel
behaviour and make life easier for potential future users. I have found
a setup that works in today's kernels (even though I have to disable
XDP, and have to use a virtio header that I don't want), and will stick
with that for now, if I actually commit it to my master branch at all:
https://gitlab.com/openconnect/openconnect/-/commit/0da4fe43b886403e6

I might yet abandon it because I haven't *yet* seen it go any faster
than the code which just does read()/write() on the tun device from
userspace. And without XDP or zerocopy it's not clear that it could
ever give me any benefit that I couldn't achieve purely in userspace by
having a separate thread to do tun device I/O. But we'll see...
David Woodhouse June 28, 2021, 11:29 p.m. UTC | #5
On Mon, 2021-06-28 at 12:23 +0100, David Woodhouse wrote:
> 

> To be clear: from the point of view of my *application* I don't care

> about any of this; my only motivation here is to clean up the kernel

> behaviour and make life easier for potential future users. I have found

> a setup that works in today's kernels (even though I have to disable

> XDP, and have to use a virtio header that I don't want), and will stick

> with that for now, if I actually commit it to my master branch at all:

> https://gitlab.com/openconnect/openconnect/-/commit/0da4fe43b886403e6

> 

> I might yet abandon it because I haven't *yet* seen it go any faster

> than the code which just does read()/write() on the tun device from

> userspace. And without XDP or zerocopy it's not clear that it could

> ever give me any benefit that I couldn't achieve purely in userspace by

> having a separate thread to do tun device I/O. But we'll see...


I managed to do some proper testing, between EC2 c5 (Skylake) virtual
instances.

The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about
1.2Gb/s from iperf, as it seems to be doing it all from the iperf
thread.

Before I started messing with OpenConnect, it could transmit 1.6Gb/s.

When I pull in the 'stitched' AES+SHA code from OpenSSL instead of
doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.

Adding vhost support on top of that takes me to 2.46Gb/s, which is a
decent enough win. That's with OpenConnect taking 100% CPU, iperf3
taking 50% of another one, and the vhost kernel thread taking ~20%.
Jason Wang June 29, 2021, 3:21 a.m. UTC | #6
在 2021/6/28 下午7:23, David Woodhouse 写道:
> On Mon, 2021-06-28 at 12:23 +0800, Jason Wang wrote:

>> 在 2021/6/25 下午4:37, David Woodhouse 写道:

>>> On Fri, 2021-06-25 at 15:33 +0800, Jason Wang wrote:

>>>> 在 2021/6/24 下午8:30, David Woodhouse 写道:

>>>>> From: David Woodhouse<dwmw@amazon.co.uk>

>>>>>

>>>>> When the underlying socket isn't configured with a virtio_net_hdr, the

>>>>> existing code in vhost_net_build_xdp() would attempt to validate

>>>>> uninitialised data, by copying zero bytes (sock_hlen) into the local

>>>>> copy of the header and then trying to validate that.

>>>>>

>>>>> Fixing it is somewhat non-trivial because the tun device might put a

>>>>> struct tun_pi*before*  the virtio_net_hdr, which makes it hard to find.

>>>>> So just stop messing with someone else's data in vhost_net_build_xdp(),

>>>>> and let tap and tun validate it for themselves, as they do in the

>>>>> non-XDP case anyway.

>>>> Thinking in another way. All XDP stuffs for vhost is prepared for TAP.

>>>> XDP is not expected to work for TUN.

>>>>

>>>> So we can simply let's vhost doesn't go with XDP path is the underlayer

>>>> socket is TUN.

>>> Actually, IFF_TUN mode per se isn't that complex. It's fixed purely on

>>> the tun side by that first patch I posted, which I later expanded a

>>> little to factor out tun_skb_set_protocol().

>>>

>>> The next two patches in my original set were fixing up the fact that

>>> XDP currently assumes that the *socket* will be doing the vhdr, not

>>> vhost. Those two weren't tun-specific at all.

>>>

>>> It's supporting the PI header (which tun puts *before* the virtio

>>> header as I just said) which introduces a tiny bit more complexity.

>>

>> This reminds me we need to fix tun_put_user_xdp(),

> Good point; thanks.

>

>> but as we've discussed, we need first figure out if PI is worth to

>> support for vhost-net.

> FWIW I certainly don't care about PI support. The only time anyone

> would want PI support is if they need to support protocols *other* than

> IPv6 and Legacy IP, over tun mode.

>

> I'm fixing this stuff because when I tried to use vhost-tun + tun for

> *sensible* use cases, I ended up having to flounder around trying to

> find a combination of settings that actually worked. And that offended

> me :)

>

> So I wrote a test case to iterate over various possible combinations of

> settings, and then kept typing until that all worked.

>

> The only thing I do feel quite strongly about is that stuff should

> either *work*, or *explicitly* fail if it's unsupported.



I fully agree, but I suspect this may only work when we invent something 
new, otherwise I'm not sure if it's too late to fix where it may break 
the existing application.


>

> At this point, although I have no actual use for it myself, I'd

> probably just about come down on the side of supporting PI. On the

> basis that:

>

>   • I've basically made it work already.

>

>   • It allows those code paths like tun_skb_set_protocol() to be

>     consolidated as both calling code paths need the same thing.

>

>   • Even in the kernel, and even when modules are as incestuously

>     intertwined as vhost-net and tun already are, I'm a strong

>     believer in *not* making assumptions about someone else's data,

>     so letting *tun* handle its own headers without making those

>     assumptions seems like the right thing to do.

>

>

>

> If we want to support PI, I need to go fix tun_put_user_xdp() as you

> noted (and work out how to add that to the test case). And resolve the

> fact that configuration might change after tun_get_socket() is called —

> and indeed that there might not *be* a configuration at all when

> tun_get_socket() is called.



Yes, but I tend to leave the code as is PI part consider no one is 
interested in that. (vhost_net + PI).


>

>

> If we *don't* want to support PI, well, the *interesting* part of the

> above needs fixing anyway. Because I strongly believe we should

> *prevent* it if we don't support it, and we *still* have the case you

> point out of the tun vhdr_size being changed at runtime.



As discussed in another thread, it looks me to it's sufficient to have 
some statics counters/API in vhost_net. Or simply use msg_control to 
reuse tx_errors of TUN/TAP or macvtap.


>

> I'll take a look at whether can pass the socklen back from tun to

> vhost-net on *every* packet. Is there a MSG_XXX flag we can abuse and

> somewhere in the msghdr that could return the header length used for

> *this* packet?



msg_control is probably the best place to do this.


>   Or could we make vhost_net_rx_peek_head_len() call

> explicitly into the tun device instead of making stuff up in

> peek_head_len()?



They're working at skb/xdp level which is unaware of PI stuffs.

But again, I think it should be much more cheaper to just add error 
reporting in this case. And it should be sufficient.


>

>

> To be clear: from the point of view of my *application* I don't care

> about any of this; my only motivation here is to clean up the kernel

> behaviour and make life easier for potential future users.



Yes, thanks a lot for having a look at this.

Though I'm not quite sure vhost_net is designed to work on those setups 
but let's ask for Michael (author of vhost/net) for his idea:

Michael, do you think it's worth to support

1) vhost_net + TUN
2) vhost_net + PI

?


> I have found

> a setup that works in today's kernels (even though I have to disable

> XDP, and have to use a virtio header that I don't want), and will stick

> with that for now, if I actually commit it to my master branch at all:

> https://gitlab.com/openconnect/openconnect/-/commit/0da4fe43b886403e6



Yes, but unfortunately it needs some tricks for avoid hitting bugs in 
the kernel.


>

> I might yet abandon it because I haven't *yet* seen it go any faster

> than the code which just does read()/write() on the tun device from

> userspace. And without XDP or zerocopy it's not clear that it could

> ever give me any benefit that I couldn't achieve purely in userspace by

> having a separate thread to do tun device I/O. But we'll see...



Ok.

Thanks.
Jason Wang June 29, 2021, 3:43 a.m. UTC | #7
在 2021/6/29 上午7:29, David Woodhouse 写道:
> On Mon, 2021-06-28 at 12:23 +0100, David Woodhouse wrote:

>> To be clear: from the point of view of my *application* I don't care

>> about any of this; my only motivation here is to clean up the kernel

>> behaviour and make life easier for potential future users. I have found

>> a setup that works in today's kernels (even though I have to disable

>> XDP, and have to use a virtio header that I don't want), and will stick

>> with that for now, if I actually commit it to my master branch at all:

>> https://gitlab.com/openconnect/openconnect/-/commit/0da4fe43b886403e6

>>

>> I might yet abandon it because I haven't *yet* seen it go any faster

>> than the code which just does read()/write() on the tun device from

>> userspace. And without XDP or zerocopy it's not clear that it could

>> ever give me any benefit that I couldn't achieve purely in userspace by

>> having a separate thread to do tun device I/O. But we'll see...

> I managed to do some proper testing, between EC2 c5 (Skylake) virtual

> instances.

>

> The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about

> 1.2Gb/s from iperf, as it seems to be doing it all from the iperf

> thread.

>

> Before I started messing with OpenConnect, it could transmit 1.6Gb/s.

>

> When I pull in the 'stitched' AES+SHA code from OpenSSL instead of

> doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.

>

> Adding vhost support on top of that takes me to 2.46Gb/s, which is a

> decent enough win.



Interesting, I think the latency should be improved as well in this case.

Thanks


> That's with OpenConnect taking 100% CPU, iperf3

> taking 50% of another one, and the vhost kernel thread taking ~20%.

>

>
David Woodhouse June 29, 2021, 6:59 a.m. UTC | #8
On 29 June 2021 04:43:15 BST, Jason Wang <jasowang@redhat.com> wrote:
>

>在 2021/6/29 上午7:29, David Woodhouse 写道:

>> On Mon, 2021-06-28 at 12:23 +0100, David Woodhouse wrote:

>>> To be clear: from the point of view of my *application* I don't care

>>> about any of this; my only motivation here is to clean up the kernel

>>> behaviour and make life easier for potential future users. I have

>found

>>> a setup that works in today's kernels (even though I have to disable

>>> XDP, and have to use a virtio header that I don't want), and will

>stick

>>> with that for now, if I actually commit it to my master branch at

>all:

>>>

>https://gitlab.com/openconnect/openconnect/-/commit/0da4fe43b886403e6

>>>

>>> I might yet abandon it because I haven't *yet* seen it go any faster

>>> than the code which just does read()/write() on the tun device from

>>> userspace. And without XDP or zerocopy it's not clear that it could

>>> ever give me any benefit that I couldn't achieve purely in userspace

>by

>>> having a separate thread to do tun device I/O. But we'll see...

>> I managed to do some proper testing, between EC2 c5 (Skylake) virtual

>> instances.

>>

>> The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about

>> 1.2Gb/s from iperf, as it seems to be doing it all from the iperf

>> thread.

>>

>> Before I started messing with OpenConnect, it could transmit 1.6Gb/s.

>>

>> When I pull in the 'stitched' AES+SHA code from OpenSSL instead of

>> doing the encryption and the HMAC in separate passes, I get to

>2.1Gb/s.

>>

>> Adding vhost support on top of that takes me to 2.46Gb/s, which is a

>> decent enough win.

>

>

>Interesting, I think the latency should be improved as well in this

>case.


I don't know about that. I figured it would be worse in the packet by packet case (especially VoIP traffic) since instead of just *writing* a packet to the tun device, we stick it in the ring and then make the same write() syscall on an eventfd to wake up the vhost thread which then has to do the *same* copy_from_user() that could have happened directly in our own process.

Maybe if I have a batch of only 1 or 2 packets I should just write it directly. I still could :)

>> That's with OpenConnect taking 100% CPU, iperf3

>> taking 50% of another one, and the vhost kernel thread taking ~20%.

>>

>>


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
David Woodhouse June 29, 2021, 10:49 a.m. UTC | #9
On Tue, 2021-06-29 at 11:43 +0800, Jason Wang wrote:
> > The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about

> > 1.2Gb/s from iperf, as it seems to be doing it all from the iperf

> > thread.

> > 

> > Before I started messing with OpenConnect, it could transmit 1.6Gb/s.

> > 

> > When I pull in the 'stitched' AES+SHA code from OpenSSL instead of

> > doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.

> > 

> > Adding vhost support on top of that takes me to 2.46Gb/s, which is a

> > decent enough win.

> 

> 

> Interesting, I think the latency should be improved as well in this

> case.


I tried using 'ping -i 0.1' to get an idea of latency for the
interesting VoIP-like case of packets where we have to wake up each
time.

For the *inbound* case, RX on the tun device followed by TX of the
replies, I see results like this:

     --- 172.16.0.2 ping statistics ---
     141 packets transmitted, 141 received, 0% packet loss, time 14557ms
     rtt min/avg/max/mdev = 0.380/0.419/0.461/0.024 ms


The opposite direction (tun TX then RX) is similar:

     --- 172.16.0.1 ping statistics ---
     295 packets transmitted, 295 received, 0% packet loss, time 30573ms
     rtt min/avg/max/mdev = 0.454/0.545/0.718/0.024 ms


Using vhost-net (and TUNSNDBUF of INT_MAX-1 just to avoid XDP), it
looks like this. Inbound:

     --- 172.16.0.2 ping statistics ---
     139 packets transmitted, 139 received, 0% packet loss, time 14350ms
     rtt min/avg/max/mdev = 0.432/0.578/0.658/0.058 ms

Outbound:

     --- 172.16.0.1 ping statistics ---
     149 packets transmitted, 149 received, 0% packet loss, time 15391ms
     rtt mn/avg/max/mdev = 0.496/0.682/0.935/0.036 ms


So as I expected, the throughput is better with vhost-net once I get to
the point of 100% CPU usage in my main thread, because it offloads the
kernel←→user copies. But latency is somewhat worse.

I'm still using select() instead of epoll() which would give me a
little back — but only a little, as I only poll on 3-4 fds, and more to
the point it'll give me just as much win in the non-vhost case too, so
it won't make much difference to the vhost vs. non-vhost comparison.

Perhaps I really should look into that trick of "if the vhost TX ring
is already stopped and would need a kick, and I only have a few packets
in the batch, just write them directly to /dev/net/tun".

I'm wondering how that optimisation would translate to actual guests,
which presumably have the same problem. Perhaps it would be an
operation on the vhost fd, which ends up processing the ring right
there in the context of *that* process instead of doing a wakeup?

FWIW if I pull in my kernel patches and stop working around those bugs,
enabling the TX XDP path and dropping the virtio-net header that I
don't need, I get some of that latency back:

     --- 172.16.0.2 ping statistics ---
     151 packets transmitted, 151 received, 0% packet loss, time 15599ms
     rtt min/avg/max/mdev = 0.372/0.550/0.661/0.061 ms

     --- 172.16.0.1 ping statistics ---
     214 packets transmitted, 214 received, 0% packet loss, time 22151ms
     rtt min/avg/max/mdev = 0.453/0.626/0.765/0.049 ms

My bandwidth tests go up from 2.46Gb/s with the workarounds, to
2.50Gb/s once I enable XDP, and 2.52Gb/s when I drop the virtio-net
header. But there's no way for userspace to *detect* that those bugs
are fixed, which makes it hard to ship that version.
David Woodhouse June 29, 2021, 1:15 p.m. UTC | #10
On Tue, 2021-06-29 at 11:49 +0100, David Woodhouse wrote:
> On Tue, 2021-06-29 at 11:43 +0800, Jason Wang wrote:

> > > The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about

> > > 1.2Gb/s from iperf, as it seems to be doing it all from the iperf

> > > thread.

> > > 

> > > Before I started messing with OpenConnect, it could transmit 1.6Gb/s.

> > > 

> > > When I pull in the 'stitched' AES+SHA code from OpenSSL instead of

> > > doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.

> > > 

> > > Adding vhost support on top of that takes me to 2.46Gb/s, which is a

> > > decent enough win.

> > 

> > 

> > Interesting, I think the latency should be improved as well in this

> > case.

> 

> I tried using 'ping -i 0.1' to get an idea of latency for the

> interesting VoIP-like case of packets where we have to wake up each

> time.

> 

> For the *inbound* case, RX on the tun device followed by TX of the

> replies, I see results like this:

> 

>      --- 172.16.0.2 ping statistics ---

>      141 packets transmitted, 141 received, 0% packet loss, time 14557ms

>      rtt min/avg/max/mdev = 0.380/0.419/0.461/0.024 ms

> 

> 

> The opposite direction (tun TX then RX) is similar:

> 

>      --- 172.16.0.1 ping statistics ---

>      295 packets transmitted, 295 received, 0% packet loss, time 30573ms

>      rtt min/avg/max/mdev = 0.454/0.545/0.718/0.024 ms

> 

> 

> Using vhost-net (and TUNSNDBUF of INT_MAX-1 just to avoid XDP), it

> looks like this. Inbound:

> 

>      --- 172.16.0.2 ping statistics ---

>      139 packets transmitted, 139 received, 0% packet loss, time 14350ms

>      rtt min/avg/max/mdev = 0.432/0.578/0.658/0.058 ms

> 

> Outbound:

> 

>      --- 172.16.0.1 ping statistics ---

>      149 packets transmitted, 149 received, 0% packet loss, time 15391ms

>      rtt mn/avg/max/mdev = 0.496/0.682/0.935/0.036 ms

> 

> 

> So as I expected, the throughput is better with vhost-net once I get to

> the point of 100% CPU usage in my main thread, because it offloads the

> kernel←→user copies. But latency is somewhat worse.

> 

> I'm still using select() instead of epoll() which would give me a

> little back — but only a little, as I only poll on 3-4 fds, and more to

> the point it'll give me just as much win in the non-vhost case too, so

> it won't make much difference to the vhost vs. non-vhost comparison.

> 

> Perhaps I really should look into that trick of "if the vhost TX ring

> is already stopped and would need a kick, and I only have a few packets

> in the batch, just write them directly to /dev/net/tun".

> 

> I'm wondering how that optimisation would translate to actual guests,

> which presumably have the same problem. Perhaps it would be an

> operation on the vhost fd, which ends up processing the ring right

> there in the context of *that* process instead of doing a wakeup?


That turns out to be fairly trivial: 
https://gitlab.com/openconnect/openconnect/-/commit/668ff1399541be927

It gives me back about half the latency I lost by moving to vhost-net:

     --- 172.16.0.2 ping statistics ---
     133 packets transmitted, 133 received, 0% packet loss, time 13725ms
     rtt min/avg/max/mdev = 0.437/0.510/0.621/0.035 ms

     --- 172.16.0.1 ping statistics ---
     133 packets transmitted, 133 received, 0% packet loss, time 13728ms
     rtt min/avg/max/mdev = 0.541/0.605/0.658/0.022 ms

I think it's definitely worth looking at whether we can/should do
something roughly equivalent for actual guests.
Jason Wang June 30, 2021, 4:39 a.m. UTC | #11
在 2021/6/29 下午6:49, David Woodhouse 写道:
> On Tue, 2021-06-29 at 11:43 +0800, Jason Wang wrote:

>>> The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about

>>> 1.2Gb/s from iperf, as it seems to be doing it all from the iperf

>>> thread.

>>>

>>> Before I started messing with OpenConnect, it could transmit 1.6Gb/s.

>>>

>>> When I pull in the 'stitched' AES+SHA code from OpenSSL instead of

>>> doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.

>>>

>>> Adding vhost support on top of that takes me to 2.46Gb/s, which is a

>>> decent enough win.

>>

>> Interesting, I think the latency should be improved as well in this

>> case.

> I tried using 'ping -i 0.1' to get an idea of latency for the

> interesting VoIP-like case of packets where we have to wake up each

> time.

>

> For the *inbound* case, RX on the tun device followed by TX of the

> replies, I see results like this:

>

>       --- 172.16.0.2 ping statistics ---

>       141 packets transmitted, 141 received, 0% packet loss, time 14557ms

>       rtt min/avg/max/mdev = 0.380/0.419/0.461/0.024 ms

>

>

> The opposite direction (tun TX then RX) is similar:

>

>       --- 172.16.0.1 ping statistics ---

>       295 packets transmitted, 295 received, 0% packet loss, time 30573ms

>       rtt min/avg/max/mdev = 0.454/0.545/0.718/0.024 ms

>

>

> Using vhost-net (and TUNSNDBUF of INT_MAX-1 just to avoid XDP), it

> looks like this. Inbound:

>

>       --- 172.16.0.2 ping statistics ---

>       139 packets transmitted, 139 received, 0% packet loss, time 14350ms

>       rtt min/avg/max/mdev = 0.432/0.578/0.658/0.058 ms

>

> Outbound:

>

>       --- 172.16.0.1 ping statistics ---

>       149 packets transmitted, 149 received, 0% packet loss, time 15391ms

>       rtt mn/avg/max/mdev = 0.496/0.682/0.935/0.036 ms

>

>

> So as I expected, the throughput is better with vhost-net once I get to

> the point of 100% CPU usage in my main thread, because it offloads the

> kernel←→user copies. But latency is somewhat worse.

>

> I'm still using select() instead of epoll() which would give me a

> little back — but only a little, as I only poll on 3-4 fds, and more to

> the point it'll give me just as much win in the non-vhost case too, so

> it won't make much difference to the vhost vs. non-vhost comparison.

>

> Perhaps I really should look into that trick of "if the vhost TX ring

> is already stopped and would need a kick, and I only have a few packets

> in the batch, just write them directly to /dev/net/tun".



That should work on low throughput.


>

> I'm wondering how that optimisation would translate to actual guests,

> which presumably have the same problem. Perhaps it would be an

> operation on the vhost fd, which ends up processing the ring right

> there in the context of *that* process instead of doing a wakeup?



It might improve the latency in an ideal case but several possible issues:

1) this will blocks vCPU running until the sent is done
2) copy_from_user() may sleep which will block the vcpu thread further


>

> FWIW if I pull in my kernel patches and stop working around those bugs,

> enabling the TX XDP path and dropping the virtio-net header that I

> don't need, I get some of that latency back:

>

>       --- 172.16.0.2 ping statistics ---

>       151 packets transmitted, 151 received, 0% packet loss, time 15599ms

>       rtt min/avg/max/mdev = 0.372/0.550/0.661/0.061 ms

>

>       --- 172.16.0.1 ping statistics ---

>       214 packets transmitted, 214 received, 0% packet loss, time 22151ms

>       rtt min/avg/max/mdev = 0.453/0.626/0.765/0.049 ms

>

> My bandwidth tests go up from 2.46Gb/s with the workarounds, to

> 2.50Gb/s once I enable XDP, and 2.52Gb/s when I drop the virtio-net

> header. But there's no way for userspace to *detect* that those bugs

> are fixed, which makes it hard to ship that version.



Yes, that's sad. One possible way to advertise a VHOST_NET_TUN flag via 
VHOST_GET_BACKEND_FEATURES?

Thanks


>
David Woodhouse June 30, 2021, 10:02 a.m. UTC | #12
On Wed, 2021-06-30 at 12:39 +0800, Jason Wang wrote:
> 在 2021/6/29 下午6:49, David Woodhouse 写道:

> > On Tue, 2021-06-29 at 11:43 +0800, Jason Wang wrote:

> > > > The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about

> > > > 1.2Gb/s from iperf, as it seems to be doing it all from the iperf

> > > > thread.

> > > > 

> > > > Before I started messing with OpenConnect, it could transmit 1.6Gb/s.

> > > > 

> > > > When I pull in the 'stitched' AES+SHA code from OpenSSL instead of

> > > > doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.

> > > > 

> > > > Adding vhost support on top of that takes me to 2.46Gb/s, which is a

> > > > decent enough win.

> > > 

> > > Interesting, I think the latency should be improved as well in this

> > > case.

> > 

> > I tried using 'ping -i 0.1' to get an idea of latency for the

> > interesting VoIP-like case of packets where we have to wake up each

> > time.

> > 

> > For the *inbound* case, RX on the tun device followed by TX of the

> > replies, I see results like this:

> > 

> >       --- 172.16.0.2 ping statistics ---

> >       141 packets transmitted, 141 received, 0% packet loss, time 14557ms

> >       rtt min/avg/max/mdev = 0.380/0.419/0.461/0.024 ms

> > 

> > 

> > The opposite direction (tun TX then RX) is similar:

> > 

> >       --- 172.16.0.1 ping statistics ---

> >       295 packets transmitted, 295 received, 0% packet loss, time 30573ms

> >       rtt min/avg/max/mdev = 0.454/0.545/0.718/0.024 ms

> > 

> > 

> > Using vhost-net (and TUNSNDBUF of INT_MAX-1 just to avoid XDP), it

> > looks like this. Inbound:

> > 

> >       --- 172.16.0.2 ping statistics ---

> >       139 packets transmitted, 139 received, 0% packet loss, time 14350ms

> >       rtt min/avg/max/mdev = 0.432/0.578/0.658/0.058 ms

> > 

> > Outbound:

> > 

> >       --- 172.16.0.1 ping statistics ---

> >       149 packets transmitted, 149 received, 0% packet loss, time 15391ms

> >       rtt mn/avg/max/mdev = 0.496/0.682/0.935/0.036 ms

> > 

> > 

> > So as I expected, the throughput is better with vhost-net once I get to

> > the point of 100% CPU usage in my main thread, because it offloads the

> > kernel←→user copies. But latency is somewhat worse.

> > 

> > I'm still using select() instead of epoll() which would give me a

> > little back — but only a little, as I only poll on 3-4 fds, and more to

> > the point it'll give me just as much win in the non-vhost case too, so

> > it won't make much difference to the vhost vs. non-vhost comparison.

> > 

> > Perhaps I really should look into that trick of "if the vhost TX ring

> > is already stopped and would need a kick, and I only have a few packets

> > in the batch, just write them directly to /dev/net/tun".

> 

> 

> That should work on low throughput.


Indeed it works remarkably well, as I noted in my follow-up. I also
fixed a minor stupidity where I was reading from the 'call' eventfd
*before* doing the real work of moving packets around. And that gives
me a few tens of microseconds back too.

> > I'm wondering how that optimisation would translate to actual guests,

> > which presumably have the same problem. Perhaps it would be an

> > operation on the vhost fd, which ends up processing the ring right

> > there in the context of *that* process instead of doing a wakeup?

> 

> 

> It might improve the latency in an ideal case but several possible issues:

> 

> 1) this will blocks vCPU running until the sent is done

> 2) copy_from_user() may sleep which will block the vcpu thread further


Yes, it would block the vCPU for a short period of time, but we could
limit that. The real win is to improve latency of single, short packets
like a first SYN, or SYNACK. It should work fine even if it's limited
to *one* *short* packet which *is* resident in memory.

Although actually I'm not *overly* worried about the 'resident' part.
For a transmit packet, especially a short one not a sendpage(), it's
fairly likely the the guest has touched the buffer right before sending
it. And taken the hit of faulting it in then, if necessary. If the host
is paging out memory which is *active* use by a guest, that guest is
screwed anyway :)

I'm thinking of something like an ioctl on the vhost-net fd which *if*
the thread is currently sleeping and there's a single short packet,
processes it immediately. {Else,then} it wakes the thread just like the
eventfd would have done. (Perhaps just by signalling the kick eventfd,
but perhaps there's a more efficient way anyway).

> > My bandwidth tests go up from 2.46Gb/s with the workarounds, to

> > 2.50Gb/s once I enable XDP, and 2.52Gb/s when I drop the virtio-net

> > header. But there's no way for userspace to *detect* that those bugs

> > are fixed, which makes it hard to ship that version.


I'm up to 2.75Gb/s now with epoll and other fixes (including using
sendmmsg() on the other side). Since the kernel can only do *half*
that, I'm now wondering if I really want my data plane in the kernel at
all, which was my long-term plan :)

> Yes, that's sad. One possible way to advertise a VHOST_NET_TUN flag via 

> VHOST_GET_BACKEND_FEATURES?


Strictly it isn't VHOST_NET_TUN, as that *does* work today if you pick
the right (very non-intuitive) combination of features. The feature is
really "VHOST_NET_TUN_WITHOUT_TUNSNDBUF_OR_UNWANTED_VNET_HEADER" :)

But we don't need a feature specifically for that; I only need to check
for *any* feature that goes in after the fixes.

Maybe if we do add a new low-latency kick then I could key on *that*
feature to assume the bugs are fixed.

Alternatively, there's still the memory map thing I need to fix before
I can commit this in my application:

#ifdef __x86_64__
	vmem->regions[0].guest_phys_addr = 4096;
	vmem->regions[0].memory_size = 0x7fffffffe000;
	vmem->regions[0].userspace_addr = 4096;
#else
#error FIXME
#endif
	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {

Perhaps if we end up with a user-visible feature to deal with that,
then I could use the presence of *that* feature to infer that the tun
bugs are fixed.

Another random thought as I stare at this... can't we handle checksums
in tun_get_user() / tun_put_user()? We could always set NETIF_F_HW_CSUM
on the tun device, and just do it *while* we're copying the packet to
userspace, if userspace doesn't support it. That would be better than
having the kernel complete the checksum in a separate pass *before*
handing the skb to tun_net_xmit().

We could similarly do a partial checksum in tun_get_user() and hand it
off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.
Jason Wang July 1, 2021, 4:13 a.m. UTC | #13
在 2021/6/30 下午6:02, David Woodhouse 写道:
> On Wed, 2021-06-30 at 12:39 +0800, Jason Wang wrote:

>> 在 2021/6/29 下午6:49, David Woodhouse 写道:

>>> On Tue, 2021-06-29 at 11:43 +0800, Jason Wang wrote:

>>>>> The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about

>>>>> 1.2Gb/s from iperf, as it seems to be doing it all from the iperf

>>>>> thread.

>>>>>

>>>>> Before I started messing with OpenConnect, it could transmit 1.6Gb/s.

>>>>>

>>>>> When I pull in the 'stitched' AES+SHA code from OpenSSL instead of

>>>>> doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.

>>>>>

>>>>> Adding vhost support on top of that takes me to 2.46Gb/s, which is a

>>>>> decent enough win.

>>>> Interesting, I think the latency should be improved as well in this

>>>> case.

>>> I tried using 'ping -i 0.1' to get an idea of latency for the

>>> interesting VoIP-like case of packets where we have to wake up each

>>> time.

>>>

>>> For the *inbound* case, RX on the tun device followed by TX of the

>>> replies, I see results like this:

>>>

>>>        --- 172.16.0.2 ping statistics ---

>>>        141 packets transmitted, 141 received, 0% packet loss, time 14557ms

>>>        rtt min/avg/max/mdev = 0.380/0.419/0.461/0.024 ms

>>>

>>>

>>> The opposite direction (tun TX then RX) is similar:

>>>

>>>        --- 172.16.0.1 ping statistics ---

>>>        295 packets transmitted, 295 received, 0% packet loss, time 30573ms

>>>        rtt min/avg/max/mdev = 0.454/0.545/0.718/0.024 ms

>>>

>>>

>>> Using vhost-net (and TUNSNDBUF of INT_MAX-1 just to avoid XDP), it

>>> looks like this. Inbound:

>>>

>>>        --- 172.16.0.2 ping statistics ---

>>>        139 packets transmitted, 139 received, 0% packet loss, time 14350ms

>>>        rtt min/avg/max/mdev = 0.432/0.578/0.658/0.058 ms

>>>

>>> Outbound:

>>>

>>>        --- 172.16.0.1 ping statistics ---

>>>        149 packets transmitted, 149 received, 0% packet loss, time 15391ms

>>>        rtt mn/avg/max/mdev = 0.496/0.682/0.935/0.036 ms

>>>

>>>

>>> So as I expected, the throughput is better with vhost-net once I get to

>>> the point of 100% CPU usage in my main thread, because it offloads the

>>> kernel←→user copies. But latency is somewhat worse.

>>>

>>> I'm still using select() instead of epoll() which would give me a

>>> little back — but only a little, as I only poll on 3-4 fds, and more to

>>> the point it'll give me just as much win in the non-vhost case too, so

>>> it won't make much difference to the vhost vs. non-vhost comparison.

>>>

>>> Perhaps I really should look into that trick of "if the vhost TX ring

>>> is already stopped and would need a kick, and I only have a few packets

>>> in the batch, just write them directly to /dev/net/tun".

>>

>> That should work on low throughput.

> Indeed it works remarkably well, as I noted in my follow-up. I also

> fixed a minor stupidity where I was reading from the 'call' eventfd

> *before* doing the real work of moving packets around. And that gives

> me a few tens of microseconds back too.

>

>>> I'm wondering how that optimisation would translate to actual guests,

>>> which presumably have the same problem. Perhaps it would be an

>>> operation on the vhost fd, which ends up processing the ring right

>>> there in the context of *that* process instead of doing a wakeup?

>>

>> It might improve the latency in an ideal case but several possible issues:

>>

>> 1) this will blocks vCPU running until the sent is done

>> 2) copy_from_user() may sleep which will block the vcpu thread further

> Yes, it would block the vCPU for a short period of time, but we could

> limit that. The real win is to improve latency of single, short packets

> like a first SYN, or SYNACK. It should work fine even if it's limited

> to *one* *short* packet which *is* resident in memory.



This looks tricky since we need to poke both virtqueue metadata as well 
as the payload.

And we need to let the packet iterate the network stack which might have 
extra latency (qdiscs, eBPF, switch/OVS).

So it looks to me it's better to use vhost_net busy polling instead 
(VHOST_SET_VRING_BUSYLOOP_TIMEOUT).

Userspace can detect this feature by validating the existence of the ioctl.


>

> Although actually I'm not *overly* worried about the 'resident' part.

> For a transmit packet, especially a short one not a sendpage(), it's

> fairly likely the the guest has touched the buffer right before sending

> it. And taken the hit of faulting it in then, if necessary. If the host

> is paging out memory which is *active* use by a guest, that guest is

> screwed anyway :)



Right, but there could be workload that is unrelated to the networking. 
Block vCPU thread in this case seems sub-optimal.


>

> I'm thinking of something like an ioctl on the vhost-net fd which *if*

> the thread is currently sleeping and there's a single short packet,

> processes it immediately. {Else,then} it wakes the thread just like the

> eventfd would have done. (Perhaps just by signalling the kick eventfd,

> but perhaps there's a more efficient way anyway).

>

>>> My bandwidth tests go up from 2.46Gb/s with the workarounds, to

>>> 2.50Gb/s once I enable XDP, and 2.52Gb/s when I drop the virtio-net

>>> header. But there's no way for userspace to *detect* that those bugs

>>> are fixed, which makes it hard to ship that version.

> I'm up to 2.75Gb/s now with epoll and other fixes (including using

> sendmmsg() on the other side). Since the kernel can only do *half*

> that, I'm now wondering if I really want my data plane in the kernel at

> all, which was my long-term plan :)



Good to know that.


>

>> Yes, that's sad. One possible way to advertise a VHOST_NET_TUN flag via

>> VHOST_GET_BACKEND_FEATURES?

> Strictly it isn't VHOST_NET_TUN, as that *does* work today if you pick

> the right (very non-intuitive) combination of features. The feature is

> really "VHOST_NET_TUN_WITHOUT_TUNSNDBUF_OR_UNWANTED_VNET_HEADER" :)



Yes, but it's a hint for userspace that TUN could be work without any 
workarounds.


>

> But we don't need a feature specifically for that; I only need to check

> for *any* feature that goes in after the fixes.

>

> Maybe if we do add a new low-latency kick then I could key on *that*

> feature to assume the bugs are fixed.

>

> Alternatively, there's still the memory map thing I need to fix before

> I can commit this in my application:

>

> #ifdef __x86_64__

> 	vmem->regions[0].guest_phys_addr = 4096;

> 	vmem->regions[0].memory_size = 0x7fffffffe000;

> 	vmem->regions[0].userspace_addr = 4096;

> #else

> #error FIXME

> #endif

> 	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {

>

> Perhaps if we end up with a user-visible feature to deal with that,

> then I could use the presence of *that* feature to infer that the tun

> bugs are fixed.



As we discussed before it could be a new backend feature. VHOST_NET_SVA 
(shared virtual address)?


>

> Another random thought as I stare at this... can't we handle checksums

> in tun_get_user() / tun_put_user()? We could always set NETIF_F_HW_CSUM

> on the tun device, and just do it *while* we're copying the packet to

> userspace, if userspace doesn't support it. That would be better than

> having the kernel complete the checksum in a separate pass *before*

> handing the skb to tun_net_xmit().



I'm not sure I get this, but for performance reason we don't do any csum 
in this case?


>

> We could similarly do a partial checksum in tun_get_user() and hand it

> off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.



I think that's is how it is expected to work (via vnet header), see 
virtio_net_hdr_to_skb().

Thanks


>

>
David Woodhouse July 1, 2021, 5:39 p.m. UTC | #14
On Thu, 2021-07-01 at 12:13 +0800, Jason Wang wrote:
> 在 2021/6/30 下午6:02, David Woodhouse 写道:

> > On Wed, 2021-06-30 at 12:39 +0800, Jason Wang wrote:

> > > 在 2021/6/29 下午6:49, David Woodhouse 写道:

> > > > So as I expected, the throughput is better with vhost-net once I get to

> > > > the point of 100% CPU usage in my main thread, because it offloads the

> > > > kernel←→user copies. But latency is somewhat worse.

> > > > 

> > > > I'm still using select() instead of epoll() which would give me a

> > > > little back — but only a little, as I only poll on 3-4 fds, and more to

> > > > the point it'll give me just as much win in the non-vhost case too, so

> > > > it won't make much difference to the vhost vs. non-vhost comparison.

> > > > 

> > > > Perhaps I really should look into that trick of "if the vhost TX ring

> > > > is already stopped and would need a kick, and I only have a few packets

> > > > in the batch, just write them directly to /dev/net/tun".

> > > 

> > > That should work on low throughput.

> > 

> > Indeed it works remarkably well, as I noted in my follow-up. I also

> > fixed a minor stupidity where I was reading from the 'call' eventfd

> > *before* doing the real work of moving packets around. And that gives

> > me a few tens of microseconds back too.

> > 

> > > > I'm wondering how that optimisation would translate to actual guests,

> > > > which presumably have the same problem. Perhaps it would be an

> > > > operation on the vhost fd, which ends up processing the ring right

> > > > there in the context of *that* process instead of doing a wakeup?

> > > 

> > > It might improve the latency in an ideal case but several possible issues:

> > > 

> > > 1) this will blocks vCPU running until the sent is done

> > > 2) copy_from_user() may sleep which will block the vcpu thread further

> > 

> > Yes, it would block the vCPU for a short period of time, but we could

> > limit that. The real win is to improve latency of single, short packets

> > like a first SYN, or SYNACK. It should work fine even if it's limited

> > to *one* *short* packet which *is* resident in memory.

> 

> 

> This looks tricky since we need to poke both virtqueue metadata as well 

> as the payload.


That's OK as we'd *only* do it if the thread is quiescent anyway.

> And we need to let the packet iterate the network stack which might have 

> extra latency (qdiscs, eBPF, switch/OVS).

> 

> So it looks to me it's better to use vhost_net busy polling instead 

> (VHOST_SET_VRING_BUSYLOOP_TIMEOUT).


Or something very similar, with the *trylock* and bailing out.

> Userspace can detect this feature by validating the existence of the ioctl.


Yep. Or if we want to get fancy, we could even offer it to the guest.
As a *different* doorbell register to poke if they want to relinquish
the physical CPU to process the packet quicker. We wouldn't even *need*
to go through userspace at all, if we put that into a separate page...
but that probably *is* overengineering it :)

> > Although actually I'm not *overly* worried about the 'resident' part.

> > For a transmit packet, especially a short one not a sendpage(), it's

> > fairly likely the the guest has touched the buffer right before sending

> > it. And taken the hit of faulting it in then, if necessary. If the host

> > is paging out memory which is *active* use by a guest, that guest is

> > screwed anyway :)

> 

> 

> Right, but there could be workload that is unrelated to the networking. 

> Block vCPU thread in this case seems sub-optimal.

> 


Right, but the VMM (or the guest, if we're letting the guest choose)
wouldn't have to use it for those cases.

> > Alternatively, there's still the memory map thing I need to fix before

> > I can commit this in my application:

> > 

> > #ifdef __x86_64__

> > 	vmem->regions[0].guest_phys_addr = 4096;

> > 	vmem->regions[0].memory_size = 0x7fffffffe000;

> > 	vmem->regions[0].userspace_addr = 4096;

> > #else

> > #error FIXME

> > #endif

> > 	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {

> > 

> > Perhaps if we end up with a user-visible feature to deal with that,

> > then I could use the presence of *that* feature to infer that the tun

> > bugs are fixed.

> 

> 

> As we discussed before it could be a new backend feature. VHOST_NET_SVA 

> (shared virtual address)?


Yeah, I'll take a look.

> > Another random thought as I stare at this... can't we handle checksums

> > in tun_get_user() / tun_put_user()? We could always set NETIF_F_HW_CSUM

> > on the tun device, and just do it *while* we're copying the packet to

> > userspace, if userspace doesn't support it. That would be better than

> > having the kernel complete the checksum in a separate pass *before*

> > handing the skb to tun_net_xmit().

> 

> 

> I'm not sure I get this, but for performance reason we don't do any csum 

> in this case?


I think we have to; the packets can't leave the box without a valid
checksum. If the skb isn't CHECKSUM_COMPLETE at the time it's handed
off to the ->hard_start_xmit of a netdev which doesn't advertise
hardware checksum support, the network stack will do it manually in an
extra pass.

Which is kind of silly if the tun device is going to do a pass over all
the data *anyway* as it copies it up to userspace. Even in the normal
case without vhost-net.

> > We could similarly do a partial checksum in tun_get_user() and hand it

> > off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.

> 

> 

> I think that's is how it is expected to work (via vnet header), see 

> virtio_net_hdr_to_skb().


But only if the "guest" supports it; it doesn't get handled by the tun
device. It *could*, since we already have the helpers to checksum *as*
we copy to/from userspace.

It doesn't help for me to advertise that I support TX checksums in
userspace because I'd have to do an extra pass for that. I only do one
pass over the data as I encrypt it, and in many block cipher modes the
encryption of the early blocks affects the IV for the subsequent
blocks... do I can't just go back and "fix" the checksum at the start
of the packet, once I'm finished.

So doing the checksum as the packet is copied up to userspace would be
very useful.
Jason Wang July 2, 2021, 3:13 a.m. UTC | #15
在 2021/7/2 上午1:39, David Woodhouse 写道:
> On Thu, 2021-07-01 at 12:13 +0800, Jason Wang wrote:

>> 在 2021/6/30 下午6:02, David Woodhouse 写道:

>>> On Wed, 2021-06-30 at 12:39 +0800, Jason Wang wrote:

>>>> 在 2021/6/29 下午6:49, David Woodhouse 写道:

>>>>> So as I expected, the throughput is better with vhost-net once I get to

>>>>> the point of 100% CPU usage in my main thread, because it offloads the

>>>>> kernel←→user copies. But latency is somewhat worse.

>>>>>

>>>>> I'm still using select() instead of epoll() which would give me a

>>>>> little back — but only a little, as I only poll on 3-4 fds, and more to

>>>>> the point it'll give me just as much win in the non-vhost case too, so

>>>>> it won't make much difference to the vhost vs. non-vhost comparison.

>>>>>

>>>>> Perhaps I really should look into that trick of "if the vhost TX ring

>>>>> is already stopped and would need a kick, and I only have a few packets

>>>>> in the batch, just write them directly to /dev/net/tun".

>>>> That should work on low throughput.

>>> Indeed it works remarkably well, as I noted in my follow-up. I also

>>> fixed a minor stupidity where I was reading from the 'call' eventfd

>>> *before* doing the real work of moving packets around. And that gives

>>> me a few tens of microseconds back too.

>>>

>>>>> I'm wondering how that optimisation would translate to actual guests,

>>>>> which presumably have the same problem. Perhaps it would be an

>>>>> operation on the vhost fd, which ends up processing the ring right

>>>>> there in the context of *that* process instead of doing a wakeup?

>>>> It might improve the latency in an ideal case but several possible issues:

>>>>

>>>> 1) this will blocks vCPU running until the sent is done

>>>> 2) copy_from_user() may sleep which will block the vcpu thread further

>>> Yes, it would block the vCPU for a short period of time, but we could

>>> limit that. The real win is to improve latency of single, short packets

>>> like a first SYN, or SYNACK. It should work fine even if it's limited

>>> to *one* *short* packet which *is* resident in memory.

>>

>> This looks tricky since we need to poke both virtqueue metadata as well

>> as the payload.

> That's OK as we'd *only* do it if the thread is quiescent anyway.

>

>> And we need to let the packet iterate the network stack which might have

>> extra latency (qdiscs, eBPF, switch/OVS).

>>

>> So it looks to me it's better to use vhost_net busy polling instead

>> (VHOST_SET_VRING_BUSYLOOP_TIMEOUT).

> Or something very similar, with the *trylock* and bailing out.

>

>> Userspace can detect this feature by validating the existence of the ioctl.

> Yep. Or if we want to get fancy, we could even offer it to the guest.

> As a *different* doorbell register to poke if they want to relinquish

> the physical CPU to process the packet quicker. We wouldn't even *need*

> to go through userspace at all, if we put that into a separate page...

> but that probably *is* overengineering it :)



Yes. Actually, it makes a PV virtio driver which requires an 
architectural way to detect the existence of those kinds of doorbell.

This seems contradict to how virtio want to go as a general 
device/driver interface which is not limited to the world of virtualization.


>

>>> Although actually I'm not *overly* worried about the 'resident' part.

>>> For a transmit packet, especially a short one not a sendpage(), it's

>>> fairly likely the the guest has touched the buffer right before sending

>>> it. And taken the hit of faulting it in then, if necessary. If the host

>>> is paging out memory which is *active* use by a guest, that guest is

>>> screwed anyway :)

>>

>> Right, but there could be workload that is unrelated to the networking.

>> Block vCPU thread in this case seems sub-optimal.

>>

> Right, but the VMM (or the guest, if we're letting the guest choose)

> wouldn't have to use it for those cases.



I'm not sure I get here. If so, simply write to TUN directly would work.


>

>>> Alternatively, there's still the memory map thing I need to fix before

>>> I can commit this in my application:

>>>

>>> #ifdef __x86_64__

>>> 	vmem->regions[0].guest_phys_addr = 4096;

>>> 	vmem->regions[0].memory_size = 0x7fffffffe000;

>>> 	vmem->regions[0].userspace_addr = 4096;

>>> #else

>>> #error FIXME

>>> #endif

>>> 	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {

>>>

>>> Perhaps if we end up with a user-visible feature to deal with that,

>>> then I could use the presence of *that* feature to infer that the tun

>>> bugs are fixed.

>>

>> As we discussed before it could be a new backend feature. VHOST_NET_SVA

>> (shared virtual address)?

> Yeah, I'll take a look.

>

>>> Another random thought as I stare at this... can't we handle checksums

>>> in tun_get_user() / tun_put_user()? We could always set NETIF_F_HW_CSUM

>>> on the tun device, and just do it *while* we're copying the packet to

>>> userspace, if userspace doesn't support it. That would be better than

>>> having the kernel complete the checksum in a separate pass *before*

>>> handing the skb to tun_net_xmit().

>>

>> I'm not sure I get this, but for performance reason we don't do any csum

>> in this case?

> I think we have to; the packets can't leave the box without a valid

> checksum. If the skb isn't CHECKSUM_COMPLETE at the time it's handed

> off to the ->hard_start_xmit of a netdev which doesn't advertise

> hardware checksum support, the network stack will do it manually in an

> extra pass.



Yes.


>

> Which is kind of silly if the tun device is going to do a pass over all

> the data *anyway* as it copies it up to userspace. Even in the normal

> case without vhost-net.



I think the design is to delay the tx checksum as much as possible:

1) host RX -> TAP TX -> Guest RX -> Guest TX -> TX RX -> host TX
2) VM1 TX -> TAP RX -> switch -> TX TX -> VM2 TX

E.g  if the checksum is supported in all those path, we don't need any 
software checksum at all in the above path. And if any part is not 
capable of doing checksum, the checksum will be done by networking core 
before calling the hard_start_xmit of that device.


>

>>> We could similarly do a partial checksum in tun_get_user() and hand it

>>> off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.

>>

>> I think that's is how it is expected to work (via vnet header), see

>> virtio_net_hdr_to_skb().

> But only if the "guest" supports it; it doesn't get handled by the tun

> device. It *could*, since we already have the helpers to checksum *as*

> we copy to/from userspace.

>

> It doesn't help for me to advertise that I support TX checksums in

> userspace because I'd have to do an extra pass for that. I only do one

> pass over the data as I encrypt it, and in many block cipher modes the

> encryption of the early blocks affects the IV for the subsequent

> blocks... do I can't just go back and "fix" the checksum at the start

> of the packet, once I'm finished.

>

> So doing the checksum as the packet is copied up to userspace would be

> very useful.



I think I get this, but it requires a new TUN features (and maybe make 
it userspace controllable via tun_set_csum()).

Thanks
David Woodhouse July 2, 2021, 8:08 a.m. UTC | #16
On Fri, 2021-07-02 at 11:13 +0800, Jason Wang wrote:
> 在 2021/7/2 上午1:39, David Woodhouse 写道:

> > 

> > Right, but the VMM (or the guest, if we're letting the guest choose)

> > wouldn't have to use it for those cases.

> 

> 

> I'm not sure I get here. If so, simply write to TUN directly would work.


As noted, that works nicely for me in OpenConnect; I just write it to
the tun device *instead* of putting it in the vring. My TX latency is
now fine; it's just RX which takes *two* scheduler wakeups (tun wakes
vhost thread, wakes guest).

But it's not clear to me that a VMM could use it. Because the guest has
already put that packet *into* the vring. Now if the VMM is in the path
of all wakeups for that vring, I suppose we *might* be able to contrive
some hackish way to be 'sure' that the kernel isn't servicing it, so we
could try to 'steal' that packet from the ring in order to send it
directly... but no. That's awful :)

I do think it'd be interesting to look at a way to reduce the latency
of the vring wakeup especially for that case of a virtio-net guest with
a single small packet to send. But realistically speaking, I'm unlikely
to get to it any time soon except for showing the numbers with the
userspace equivalent and observing that there's probably a similar win
to be had for guests too.

In the short term, I should focus on what we want to do to finish off
my existing fixes. Did we have a consensus on whether to bother
supporting PI? As I said, I'm mildly inclined to do so just because it
mostly comes out in the wash as we fix everything else, and making it
gracefully *refuse* that setup reliably is just as hard.

I think I'll try to make the vhost-net code much more resilient to
finding that tun_recvmsg() returns a header other than the sock_hlen it
expects, and see how much still actually needs "fixing" if we can do
that.


> I think the design is to delay the tx checksum as much as possible:

> 

> 1) host RX -> TAP TX -> Guest RX -> Guest TX -> TX RX -> host TX

> 2) VM1 TX -> TAP RX -> switch -> TX TX -> VM2 TX

> 

> E.g  if the checksum is supported in all those path, we don't need any 

> software checksum at all in the above path. And if any part is not 

> capable of doing checksum, the checksum will be done by networking core 

> before calling the hard_start_xmit of that device.


Right, but in *any* case where the 'device' is going to memcpy the data
around (like tun_put_user() does), it's a waste of time having the
networking core do a *separate* pass over the data just to checksum it.

> > > > We could similarly do a partial checksum in tun_get_user() and hand it

> > > > off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.

> > > 

> > > I think that's is how it is expected to work (via vnet header), see

> > > virtio_net_hdr_to_skb().

> > 

> > But only if the "guest" supports it; it doesn't get handled by the tun

> > device. It *could*, since we already have the helpers to checksum *as*

> > we copy to/from userspace.

> > 

> > It doesn't help for me to advertise that I support TX checksums in

> > userspace because I'd have to do an extra pass for that. I only do one

> > pass over the data as I encrypt it, and in many block cipher modes the

> > encryption of the early blocks affects the IV for the subsequent

> > blocks... do I can't just go back and "fix" the checksum at the start

> > of the packet, once I'm finished.

> > 

> > So doing the checksum as the packet is copied up to userspace would be

> > very useful.

> 

> 

> I think I get this, but it requires a new TUN features (and maybe make 

> it userspace controllable via tun_set_csum()).


I don't think it's visible to userspace at all; it's purely between the
tun driver and the network stack. We *always* set NETIF_F_HW_CSUM,
regardless of what the user can cope with. And if the user *didn't*
support checksum offload then tun will transparently do the checksum
*during* the copy_to_iter() (in either tun_put_user_xdp() or
tun_put_user()).

Userspace sees precisely what it did before. If it doesn't support
checksum offload then it gets a pre-checksummed packet just as before.
It's just that the kernel will do that checksum *while* it's already
touching the data as it copies it to userspace, instead of in a
separate pass.

Although actually, for my *benchmark* case with iperf3 sending UDP, I
spotted in the perf traces that we actually do the checksum as we're
copying from userspace in the udp_sendmsg() call. There's a check in
__ip_append_data() which looks to see if the destination device has
HW_CSUM|IP_CSUM features, and does the copy-and-checksum if not. There
are definitely use cases which *don't* have that kind of optimisation
though, and packets that would reach tun_net_xmit() with CHECKSUM_NONE.
So I think it's worth looking at.
Jason Wang July 2, 2021, 8:50 a.m. UTC | #17
在 2021/7/2 下午4:08, David Woodhouse 写道:
> On Fri, 2021-07-02 at 11:13 +0800, Jason Wang wrote:

>> 在 2021/7/2 上午1:39, David Woodhouse 写道:

>>> Right, but the VMM (or the guest, if we're letting the guest choose)

>>> wouldn't have to use it for those cases.

>>

>> I'm not sure I get here. If so, simply write to TUN directly would work.

> As noted, that works nicely for me in OpenConnect; I just write it to

> the tun device *instead* of putting it in the vring. My TX latency is

> now fine; it's just RX which takes *two* scheduler wakeups (tun wakes

> vhost thread, wakes guest).



Note that busy polling is used for KVM to improve the latency as well. 
It was enabled by default if I was not wrong.


>

> But it's not clear to me that a VMM could use it. Because the guest has

> already put that packet *into* the vring. Now if the VMM is in the path

> of all wakeups for that vring, I suppose we *might* be able to contrive

> some hackish way to be 'sure' that the kernel isn't servicing it, so we

> could try to 'steal' that packet from the ring in order to send it

> directly... but no. That's awful :)



Yes.


>

> I do think it'd be interesting to look at a way to reduce the latency

> of the vring wakeup especially for that case of a virtio-net guest with

> a single small packet to send. But realistically speaking, I'm unlikely

> to get to it any time soon except for showing the numbers with the

> userspace equivalent and observing that there's probably a similar win

> to be had for guests too.

>

> In the short term, I should focus on what we want to do to finish off

> my existing fixes.



I think so.


> Did we have a consensus on whether to bother

> supporting PI?



Michael, any thought on this?


>   As I said, I'm mildly inclined to do so just because it

> mostly comes out in the wash as we fix everything else, and making it

> gracefully *refuse* that setup reliably is just as hard.

>

> I think I'll try to make the vhost-net code much more resilient to

> finding that tun_recvmsg() returns a header other than the sock_hlen it

> expects, and see how much still actually needs "fixing" if we can do

> that.



Let's see how well it goes.


>

>

>> I think the design is to delay the tx checksum as much as possible:

>>

>> 1) host RX -> TAP TX -> Guest RX -> Guest TX -> TX RX -> host TX

>> 2) VM1 TX -> TAP RX -> switch -> TX TX -> VM2 TX

>>

>> E.g  if the checksum is supported in all those path, we don't need any

>> software checksum at all in the above path. And if any part is not

>> capable of doing checksum, the checksum will be done by networking core

>> before calling the hard_start_xmit of that device.

> Right, but in *any* case where the 'device' is going to memcpy the data

> around (like tun_put_user() does), it's a waste of time having the

> networking core do a *separate* pass over the data just to checksum it.



See below.


>

>>>>> We could similarly do a partial checksum in tun_get_user() and hand it

>>>>> off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.

>>>> I think that's is how it is expected to work (via vnet header), see

>>>> virtio_net_hdr_to_skb().

>>> But only if the "guest" supports it; it doesn't get handled by the tun

>>> device. It *could*, since we already have the helpers to checksum *as*

>>> we copy to/from userspace.

>>>

>>> It doesn't help for me to advertise that I support TX checksums in

>>> userspace because I'd have to do an extra pass for that. I only do one

>>> pass over the data as I encrypt it, and in many block cipher modes the

>>> encryption of the early blocks affects the IV for the subsequent

>>> blocks... do I can't just go back and "fix" the checksum at the start

>>> of the packet, once I'm finished.

>>>

>>> So doing the checksum as the packet is copied up to userspace would be

>>> very useful.

>>

>> I think I get this, but it requires a new TUN features (and maybe make

>> it userspace controllable via tun_set_csum()).

> I don't think it's visible to userspace at all; it's purely between the

> tun driver and the network stack. We *always* set NETIF_F_HW_CSUM,

> regardless of what the user can cope with. And if the user *didn't*

> support checksum offload then tun will transparently do the checksum

> *during* the copy_to_iter() (in either tun_put_user_xdp() or

> tun_put_user()).

>

> Userspace sees precisely what it did before. If it doesn't support

> checksum offload then it gets a pre-checksummed packet just as before.

> It's just that the kernel will do that checksum *while* it's already

> touching the data as it copies it to userspace, instead of in a

> separate pass.



So I kind of get what did you meant:

1) Don't disable NETIF_F_HW_CSUM in tun_set_csum() even if userspace 
clear TUN_F_CSUM.
2) Use csum iov iterator helper in tun_put_user() and tun_put_user_xdp()

It may help for the performance since we get better cache locality if 
userspace doesn't support checksum offload.

But in this case we need to know if userspace can do the checksum 
offload which we don't need to care previously (via NETIF_F_HW_CSUM).

And we probably need to sync with tun_set_offload().


>

> Although actually, for my *benchmark* case with iperf3 sending UDP, I

> spotted in the perf traces that we actually do the checksum as we're

> copying from userspace in the udp_sendmsg() call. There's a check in

> __ip_append_data() which looks to see if the destination device has

> HW_CSUM|IP_CSUM features, and does the copy-and-checksum if not. There

> are definitely use cases which *don't* have that kind of optimisation

> though, and packets that would reach tun_net_xmit() with CHECKSUM_NONE.

> So I think it's worth looking at.



Yes.

Thanks
Eugenio Perez Martin July 9, 2021, 3:04 p.m. UTC | #18
On Fri, Jul 2, 2021 at 10:08 AM David Woodhouse <dwmw2@infradead.org> wrote:
>

> On Fri, 2021-07-02 at 11:13 +0800, Jason Wang wrote:

> > 在 2021/7/2 上午1:39, David Woodhouse 写道:

> > >

> > > Right, but the VMM (or the guest, if we're letting the guest choose)

> > > wouldn't have to use it for those cases.

> >

> >

> > I'm not sure I get here. If so, simply write to TUN directly would work.

>

> As noted, that works nicely for me in OpenConnect; I just write it to

> the tun device *instead* of putting it in the vring. My TX latency is

> now fine; it's just RX which takes *two* scheduler wakeups (tun wakes

> vhost thread, wakes guest).

>


Maybe we can do a small test to see the effect of warming up the userland?
* Make vhost to write irqfd BEFORE add the packet to the ring, not after.
* Make userland (I think your selftest would be fine for this) to spin
reading used idx until it sees at least one buffer.

I think this introduces races in the general virtio ring management
but should work well for the testing. Any thoughts?

We could also check what happens in case of burning the userland CPU
checking for used_idx and disable notifications, and see if it is
worth keeping shaving latency in that direction :).

> But it's not clear to me that a VMM could use it. Because the guest has

> already put that packet *into* the vring. Now if the VMM is in the path

> of all wakeups for that vring, I suppose we *might* be able to contrive

> some hackish way to be 'sure' that the kernel isn't servicing it, so we

> could try to 'steal' that packet from the ring in order to send it

> directly... but no. That's awful :)

>

> I do think it'd be interesting to look at a way to reduce the latency

> of the vring wakeup especially for that case of a virtio-net guest with

> a single small packet to send. But realistically speaking, I'm unlikely

> to get to it any time soon except for showing the numbers with the

> userspace equivalent and observing that there's probably a similar win

> to be had for guests too.

>

> In the short term, I should focus on what we want to do to finish off

> my existing fixes. Did we have a consensus on whether to bother

> supporting PI? As I said, I'm mildly inclined to do so just because it

> mostly comes out in the wash as we fix everything else, and making it

> gracefully *refuse* that setup reliably is just as hard.

>

> I think I'll try to make the vhost-net code much more resilient to

> finding that tun_recvmsg() returns a header other than the sock_hlen it

> expects, and see how much still actually needs "fixing" if we can do

> that.

>

>

> > I think the design is to delay the tx checksum as much as possible:

> >

> > 1) host RX -> TAP TX -> Guest RX -> Guest TX -> TX RX -> host TX

> > 2) VM1 TX -> TAP RX -> switch -> TX TX -> VM2 TX

> >

> > E.g  if the checksum is supported in all those path, we don't need any

> > software checksum at all in the above path. And if any part is not

> > capable of doing checksum, the checksum will be done by networking core

> > before calling the hard_start_xmit of that device.

>

> Right, but in *any* case where the 'device' is going to memcpy the data

> around (like tun_put_user() does), it's a waste of time having the

> networking core do a *separate* pass over the data just to checksum it.

>

> > > > > We could similarly do a partial checksum in tun_get_user() and hand it

> > > > > off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.

> > > >

> > > > I think that's is how it is expected to work (via vnet header), see

> > > > virtio_net_hdr_to_skb().

> > >

> > > But only if the "guest" supports it; it doesn't get handled by the tun

> > > device. It *could*, since we already have the helpers to checksum *as*

> > > we copy to/from userspace.

> > >

> > > It doesn't help for me to advertise that I support TX checksums in

> > > userspace because I'd have to do an extra pass for that. I only do one

> > > pass over the data as I encrypt it, and in many block cipher modes the

> > > encryption of the early blocks affects the IV for the subsequent

> > > blocks... do I can't just go back and "fix" the checksum at the start

> > > of the packet, once I'm finished.

> > >

> > > So doing the checksum as the packet is copied up to userspace would be

> > > very useful.

> >

> >

> > I think I get this, but it requires a new TUN features (and maybe make

> > it userspace controllable via tun_set_csum()).

>

> I don't think it's visible to userspace at all; it's purely between the

> tun driver and the network stack. We *always* set NETIF_F_HW_CSUM,

> regardless of what the user can cope with. And if the user *didn't*

> support checksum offload then tun will transparently do the checksum

> *during* the copy_to_iter() (in either tun_put_user_xdp() or

> tun_put_user()).

>

> Userspace sees precisely what it did before. If it doesn't support

> checksum offload then it gets a pre-checksummed packet just as before.

> It's just that the kernel will do that checksum *while* it's already

> touching the data as it copies it to userspace, instead of in a

> separate pass.

>

> Although actually, for my *benchmark* case with iperf3 sending UDP, I

> spotted in the perf traces that we actually do the checksum as we're

> copying from userspace in the udp_sendmsg() call. There's a check in

> __ip_append_data() which looks to see if the destination device has

> HW_CSUM|IP_CSUM features, and does the copy-and-checksum if not. There

> are definitely use cases which *don't* have that kind of optimisation

> though, and packets that would reach tun_net_xmit() with CHECKSUM_NONE.

> So I think it's worth looking at.

>
diff mbox series

Patch

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 2170a0d3d34c..d1b1f1de374e 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1132,16 +1132,35 @@  static const struct file_operations tap_fops = {
 static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 {
 	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
-	struct virtio_net_hdr *gso = &hdr->gso;
+	struct virtio_net_hdr *gso = NULL;
 	int buflen = hdr->buflen;
 	int vnet_hdr_len = 0;
 	struct tap_dev *tap;
 	struct sk_buff *skb;
 	int err, depth;
 
-	if (q->flags & IFF_VNET_HDR)
+	if (q->flags & IFF_VNET_HDR) {
 		vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
+		if (xdp->data != xdp->data_hard_start + sizeof(*hdr) + vnet_hdr_len) {
+			err = -EINVAL;
+			goto err;
+		}
+
+		gso = (void *)&hdr[1];
 
+		if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+		     tap16_to_cpu(q, gso->csum_start) +
+		     tap16_to_cpu(q, gso->csum_offset) + 2 >
+			     tap16_to_cpu(q, gso->hdr_len))
+			gso->hdr_len = cpu_to_tap16(q,
+				 tap16_to_cpu(q, gso->csum_start) +
+				 tap16_to_cpu(q, gso->csum_offset) + 2);
+
+		if (tap16_to_cpu(q, gso->hdr_len) > xdp->data_end - xdp->data) {
+			err = -EINVAL;
+			goto err;
+		}
+	}
 	skb = build_skb(xdp->data_hard_start, buflen);
 	if (!skb) {
 		err = -ENOMEM;
@@ -1155,7 +1174,7 @@  static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 	skb_reset_mac_header(skb);
 	skb->protocol = eth_hdr(skb)->h_proto;
 
-	if (vnet_hdr_len) {
+	if (gso) {
 		err = virtio_net_hdr_to_skb(skb, gso, tap_is_little_endian(q));
 		if (err)
 			goto err_kfree;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9acd448e6dfc..1b553f79adb0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2331,6 +2331,7 @@  static int tun_xdp_one(struct tun_struct *tun,
 {
 	unsigned int datasize = xdp->data_end - xdp->data;
 	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
+	void *tun_hdr = &hdr[1];
 	struct virtio_net_hdr *gso = NULL;
 	struct bpf_prog *xdp_prog;
 	struct sk_buff *skb = NULL;
@@ -2340,8 +2341,22 @@  static int tun_xdp_one(struct tun_struct *tun,
 	bool skb_xdp = false;
 	struct page *page;
 
-	if (tun->flags & IFF_VNET_HDR)
-		gso = &hdr->gso;
+	if (tun->flags & IFF_VNET_HDR) {
+		gso = tun_hdr;
+		tun_hdr += sizeof(*gso);
+
+		if (tun_hdr > xdp->data) {
+			atomic_long_inc(&tun->rx_frame_errors);
+			return -EINVAL;
+		}
+
+		if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+		    tun16_to_cpu(tun, gso->csum_start) + tun16_to_cpu(tun, gso->csum_offset) + 2 > tun16_to_cpu(tun, gso->hdr_len))
+			gso->hdr_len = cpu_to_tun16(tun, tun16_to_cpu(tun, gso->csum_start) + tun16_to_cpu(tun, gso->csum_offset) + 2);
+
+		if (tun16_to_cpu(tun, gso->hdr_len) > datasize)
+			return -EINVAL;
+	}
 
 	xdp_prog = rcu_dereference(tun->xdp_prog);
 	if (xdp_prog) {
@@ -2389,7 +2404,7 @@  static int tun_xdp_one(struct tun_struct *tun,
 	}
 
 	skb_reserve(skb, xdp->data - xdp->data_hard_start);
-	skb_put(skb, xdp->data_end - xdp->data);
+	skb_put(skb, datasize);
 
 	if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
 		atomic_long_inc(&tun->rx_frame_errors);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b92a7144ed90..7cae18151c60 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -690,7 +690,6 @@  static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 					     dev);
 	struct socket *sock = vhost_vq_get_backend(vq);
 	struct page_frag *alloc_frag = &net->page_frag;
-	struct virtio_net_hdr *gso;
 	struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
 	struct tun_xdp_hdr *hdr;
 	size_t len = iov_iter_count(from);
@@ -715,29 +714,18 @@  static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 		return -ENOMEM;
 
 	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-	copied = copy_page_from_iter(alloc_frag->page,
-				     alloc_frag->offset +
-				     offsetof(struct tun_xdp_hdr, gso),
-				     sock_hlen, from);
-	if (copied != sock_hlen)
-		return -EFAULT;
-
 	hdr = buf;
-	gso = &hdr->gso;
-
-	if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-	    vhost16_to_cpu(vq, gso->csum_start) +
-	    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
-	    vhost16_to_cpu(vq, gso->hdr_len)) {
-		gso->hdr_len = cpu_to_vhost16(vq,
-			       vhost16_to_cpu(vq, gso->csum_start) +
-			       vhost16_to_cpu(vq, gso->csum_offset) + 2);
-
-		if (vhost16_to_cpu(vq, gso->hdr_len) > len)
-			return -EINVAL;
+	if (sock_hlen) {
+		copied = copy_page_from_iter(alloc_frag->page,
+					     alloc_frag->offset +
+					     sizeof(struct tun_xdp_hdr),
+					     sock_hlen, from);
+		if (copied != sock_hlen)
+			return -EFAULT;
+
+		len -= sock_hlen;
 	}
 
-	len -= sock_hlen;
 	copied = copy_page_from_iter(alloc_frag->page,
 				     alloc_frag->offset + pad,
 				     len, from);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 8a7debd3f663..8d78b6bbc228 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -21,7 +21,6 @@  struct tun_msg_ctl {
 
 struct tun_xdp_hdr {
 	int buflen;
-	struct virtio_net_hdr gso;
 };
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)