mbox series

[RFC,v8,00/11] vhost: ring format independence

Message ID 20200611113404.17810-1-mst@redhat.com
Headers show
Series vhost: ring format independence | expand

Message

Michael S. Tsirkin June 11, 2020, 11:34 a.m. UTC
This still causes corruption issues for people so don't try
to use in production please. Posting to expedite debugging.

This adds infrastructure required for supporting
multiple ring formats.

The idea is as follows: we convert descriptors to an
independent format first, and process that converting to
iov later.

Used ring is similar: we fetch into an independent struct first,
convert that to IOV later.

The point is that we have a tight loop that fetches
descriptors, which is good for cache utilization.
This will also allow all kind of batching tricks -
e.g. it seems possible to keep SMAP disabled while
we are fetching multiple descriptors.

For used descriptors, this allows keeping track of the buffer length
without need to rescan IOV.

This seems to perform exactly the same as the original
code based on a microbenchmark.
Lightly tested.
More testing would be very much appreciated.

changes from v8:
	- squashed in fixes. no longer hangs but still known
	  to cause data corruption for some people. under debug.

changes from v6:
	- fixes some bugs introduced in v6 and v5

changes from v5:
	- addressed comments by Jason: squashed API changes, fixed up discard

changes from v4:
	- added used descriptor format independence
	- addressed comments by jason
	- fixed a crash detected by the lkp robot.

changes from v3:
        - fixed error handling in case of indirect descriptors
        - add BUG_ON to detect buffer overflow in case of bugs
                in response to comment by Jason Wang
        - minor code tweaks

Changes from v2:
	- fixed indirect descriptor batching
                reported by Jason Wang

Changes from v1:
	- typo fixes


Michael S. Tsirkin (14):
  vhost: option to fetch descriptors through an independent struct
  fixup! vhost: option to fetch descriptors through an independent
    struct


Michael S. Tsirkin (11):
  vhost: option to fetch descriptors through an independent struct
  vhost: use batched get_vq_desc version
  vhost/net: pass net specific struct pointer
  vhost: reorder functions
  vhost: format-independent API for used buffers
  vhost/net: convert to new API: heads->bufs
  vhost/net: avoid iov length math
  vhost/test: convert to the buf API
  vhost/scsi: switch to buf APIs
  vhost/vsock: switch to the buf API
  vhost: drop head based APIs

 drivers/vhost/net.c   | 174 +++++++++----------
 drivers/vhost/scsi.c  |  73 ++++----
 drivers/vhost/test.c  |  22 +--
 drivers/vhost/vhost.c | 378 +++++++++++++++++++++++++++---------------
 drivers/vhost/vhost.h |  44 +++--
 drivers/vhost/vsock.c |  30 ++--
 6 files changed, 439 insertions(+), 282 deletions(-)

Comments

Eugenio Perez Martin July 9, 2020, 4:46 p.m. UTC | #1
On Wed, Jul 1, 2020 at 4:10 PM Jason Wang <jasowang@redhat.com> wrote:
>

>

> On 2020/7/1 下午9:04, Eugenio Perez Martin wrote:

> > On Wed, Jul 1, 2020 at 2:40 PM Jason Wang <jasowang@redhat.com> wrote:

> >>

> >> On 2020/7/1 下午6:43, Eugenio Perez Martin wrote:

> >>> On Tue, Jun 23, 2020 at 6:15 PM Eugenio Perez Martin

> >>> <eperezma@redhat.com> wrote:

> >>>> On Mon, Jun 22, 2020 at 6:29 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> >>>>> On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote:

> >>>>>> On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> >>>>>>> On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote:

> >>>>>>>> On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin

> >>>>>>>> <eperezma@redhat.com> wrote:

> >>>>>>>>> On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk

> >>>>>>>>> <konrad.wilk@oracle.com> wrote:

> >>>>>>>>>> On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:

> >>>>>>>>>>> As testing shows no performance change, switch to that now.

> >>>>>>>>>> What kind of testing? 100GiB? Low latency?

> >>>>>>>>>>

> >>>>>>>>> Hi Konrad.

> >>>>>>>>>

> >>>>>>>>> I tested this version of the patch:

> >>>>>>>>> https://lkml.org/lkml/2019/10/13/42

> >>>>>>>>>

> >>>>>>>>> It was tested for throughput with DPDK's testpmd (as described in

> >>>>>>>>> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)

> >>>>>>>>> and kernel pktgen. No latency tests were performed by me. Maybe it is

> >>>>>>>>> interesting to perform a latency test or just a different set of tests

> >>>>>>>>> over a recent version.

> >>>>>>>>>

> >>>>>>>>> Thanks!

> >>>>>>>> I have repeated the tests with v9, and results are a little bit different:

> >>>>>>>> * If I test opening it with testpmd, I see no change between versions

> >>>>>>> OK that is testpmd on guest, right? And vhost-net on the host?

> >>>>>>>

> >>>>>> Hi Michael.

> >>>>>>

> >>>>>> No, sorry, as described in

> >>>>>> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html.

> >>>>>> But I could add to test it in the guest too.

> >>>>>>

> >>>>>> These kinds of raw packets "bursts" do not show performance

> >>>>>> differences, but I could test deeper if you think it would be worth

> >>>>>> it.

> >>>>> Oh ok, so this is without guest, with virtio-user.

> >>>>> It might be worth checking dpdk within guest too just

> >>>>> as another data point.

> >>>>>

> >>>> Ok, I will do it!

> >>>>

> >>>>>>>> * If I forward packets between two vhost-net interfaces in the guest

> >>>>>>>> using a linux bridge in the host:

> >>>>>>> And here I guess you mean virtio-net in the guest kernel?

> >>>>>> Yes, sorry: Two virtio-net interfaces connected with a linux bridge in

> >>>>>> the host. More precisely:

> >>>>>> * Adding one of the interfaces to another namespace, assigning it an

> >>>>>> IP, and starting netserver there.

> >>>>>> * Assign another IP in the range manually to the other virtual net

> >>>>>> interface, and start the desired test there.

> >>>>>>

> >>>>>> If you think it would be better to perform then differently please let me know.

> >>>>> Not sure why you bother with namespaces since you said you are

> >>>>> using L2 bridging. I guess it's unimportant.

> >>>>>

> >>>> Sorry, I think I should have provided more context about that.

> >>>>

> >>>> The only reason to use namespaces is to force the traffic of these

> >>>> netperf tests to go through the external bridge. To test netperf

> >>>> different possibilities than the testpmd (or pktgen or others "blast

> >>>> of frames unconditionally" tests).

> >>>>

> >>>> This way, I make sure that is the same version of everything in the

> >>>> guest, and is a little bit easier to manage cpu affinity, start and

> >>>> stop testing...

> >>>>

> >>>> I could use a different VM for sending and receiving, but I find this

> >>>> way a faster one and it should not introduce a lot of noise. I can

> >>>> test with two VM if you think that this use of network namespace

> >>>> introduces too much noise.

> >>>>

> >>>> Thanks!

> >>>>

> >>>>>>>>     - netperf UDP_STREAM shows a performance increase of 1.8, almost

> >>>>>>>> doubling performance. This gets lower as frame size increase.

> >>> Regarding UDP_STREAM:

> >>> * with event_idx=on: The performance difference is reduced a lot if

> >>> applied affinity properly (manually assigning CPU on host/guest and

> >>> setting IRQs on guest), making them perform equally with and without

> >>> the patch again. Maybe the batching makes the scheduler perform

> >>> better.

> >>

> >> Note that for UDP_STREAM, the result is pretty trick to be analyzed. E.g

> >> setting a sndbuf for TAP may help for the performance (reduce the drop).

> >>

> > Ok, will add that to the test. Thanks!

>

>

> Actually, it's better to skip the UDP_STREAM test since:

>

> - My understanding is very few application is using raw UDP stream

> - It's hard to analyze (usually you need to count the drop ratio etc)

>

>

> >

> >>>>>>>>     - rests of the test goes noticeably worse: UDP_RR goes from ~6347

> >>>>>>>> transactions/sec to 5830

> >>> * Regarding UDP_RR, TCP_STREAM, and TCP_RR, proper CPU pinning makes

> >>> them perform similarly again, only a very small performance drop

> >>> observed. It could be just noise.

> >>> ** All of them perform better than vanilla if event_idx=off, not sure

> >>> why. I can try to repeat them if you suspect that can be a test

> >>> failure.

> >>>

> >>> * With testpmd and event_idx=off, if I send from the VM to host, I see

> >>> a performance increment especially in small packets. The buf api also

> >>> increases performance compared with only batching: Sending the minimum

> >>> packet size in testpmd makes pps go from 356kpps to 473 kpps.

> >>

> >> What's your setup for this. The number looks rather low. I'd expected

> >> 1-2 Mpps at least.

> >>

> > Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2 NUMA nodes of 16G memory

> > each, and no device assigned to the NUMA node I'm testing in. Too low

> > for testpmd AF_PACKET driver too?

>

>

> I don't test AF_PACKET, I guess it should use the V3 which mmap based

> zerocopy interface.

>

> And it might worth to check the cpu utilization of vhost thread. It's

> required to stress it as 100% otherwise there could be a bottleneck

> somewhere.

>

>

> >

> >>> Sending

> >>> 1024 length UDP-PDU makes it go from 570kpps to 64 kpps.

> >>>

> >>> Something strange I observe in these tests: I get more pps the bigger

> >>> the transmitted buffer size is. Not sure why.

> >>>

> >>> ** Sending from the host to the VM does not make a big change with the

> >>> patches in small packets scenario (minimum, 64 bytes, about 645

> >>> without the patch, ~625 with batch and batch+buf api). If the packets

> >>> are bigger, I can see a performance increase: with 256 bits,

> >>

> >> I think you meant bytes?

> >>

> > Yes, sorry.

> >

> >>>    it goes

> >>> from 590kpps to about 600kpps, and in case of 1500 bytes payload it

> >>> gets from 348kpps to 528kpps, so it is clearly an improvement.

> >>>

> >>> * with testpmd and event_idx=on, batching+buf api perform similarly in

> >>> both directions.

> >>>

> >>> All of testpmd tests were performed with no linux bridge, just a

> >>> host's tap interface (<interface type='ethernet'> in xml),

> >>

> >> What DPDK driver did you use in the test (AF_PACKET?).

> >>

> > Yes, both testpmd are using AF_PACKET driver.

>

>

> I see, using AF_PACKET means extra layers of issues need to be analyzed

> which is probably not good.

>

>

> >

> >>> with a

> >>> testpmd txonly and another in rxonly forward mode, and using the

> >>> receiving side packets/bytes data. Guest's rps, xps and interrupts,

> >>> and host's vhost threads affinity were also tuned in each test to

> >>> schedule both testpmd and vhost in different processors.

> >>

> >> My feeling is that if we start from simple setup, it would be more

> >> easier as a start. E.g start without an VM.

> >>

> >> 1) TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP

> >> 2) RX: pkgetn -> TAP -> vhost_net -> testpmd(rxonly)

> >>

> > Got it. Is there a reason to prefer pktgen over testpmd?

>

>

> I think the reason is using testpmd you must use a userspace kernel

> interface (AF_PACKET), and it could not be as fast as pktgen since:

>

> - it talks directly to xmit of TAP

> - skb can be cloned

>


Hi!

Here it is the result of the tests. Details on [1].

Tx:
===

For tx packets it seems that the batching patch makes things a little
bit worse, but the buf_api outperforms baseline by a 7%:

* We start with a baseline of 4208772.571 pps and 269361444.6 bytes/s [2].
* When we add the batching, I see a small performance decrease:
4133292.308 and 264530707.7 bytes/s.
* However, the buf api it outperform the baseline: 4551319.631pps,
291205178.1 bytes/s

I don't have numbers on the receiver side since it is just a XDP_DROP.
I think it would be interesting to see them.

Rx:
===

Regarding Rx, the reverse is observed: a small performance increase is
observed with batching (~2%), but buf_api makes tests perform equally
to baseline.

pktgen was called using pktgen_sample01_simple.sh, with the environment:
DEV="$tap_name" F_THREAD=1 DST_MAC=$MAC_ADDR COUNT=$((2500000*25))
SKB_CLONE=$((2**31))

And testpmd is the same as Tx but with forward-mode=rxonly.

Pktgen reports:
Baseline: 1853025pps 622Mb/sec (622616400bps) errors: 7915231
Batch: 1891404pps 635Mb/sec (635511744bps) errors: 4926093
Buf_api: 1844008pps 619Mb/sec (619586688bps) errors: 47766692

Testpmd reports:
Baseline: 1854448pps, 860464156 bps. [3]
Batch: 1892844.25pps, 878280070bps.
Buf_api: 1846139.75pps, 856609120bps.

Any thoughts?

Thanks!

[1]
Testpmd options: -l 1,3
--vdev=virtio_user0,mac=01:02:03:04:05:06,path=/dev/vhost-net,queue_size=1024
-- --auto-start --stats-period 5 --tx-offloads="$TX_OFFLOADS"
--rx-offloads="$RX_OFFLOADS" --txd=4096 --rxd=4096 --burst=512
--forward-mode=txonly

Where offloads were obtained manually running with
--[tr]x-offloads=0x8fff and examining testpmd response:
declare -r RX_OFFLOADS=0x81d
declare -r TX_OFFLOADS=0x802d

All of the tests results are an average of at least 3 samples of
testpmd, discarding the obvious deviations at start/end (like warming
up or waiting for pktgen to start). The result of pktgen is directly
c&p from its output.

The numbers do not change very much from one stats printing to another
of testpmd.

[2] Obtained subtracting each accumulated tx-packets from one stats
print to the previous one. If we attend testpmd output about Tx-pps,
it counts a little bit less performance, but it follows the same
pattern:

Testpmd pps/bps stats:
Baseline: 3510826.25 pps, 1797887912bps = 224735989bytes/sec
Batch: 3448515.571pps, 1765640226bps = 220705028.3bytes/sec
Buf api: 3794115.333pps, 1942587286bps = 242823410.8bytes/sec

[3] This is obtained using the rx-pps/rx-bps report of testpmd.

Seems strange to me that the relation between pps/bps is ~336 this
time, and between accumulated pkts/accumulated bytes is ~58. Also, the
relation between them is not even close to 8.

However, testpmd shows a lot of absolute packets received. If we see
the received packets in a period subtracting from the previous one,
testpmd tells that receive more pps than pktgen tx-pps:
Baseline: ~2222668.667pps 128914784.3bps.
Batch: 2269260.933pps, 131617134.9bps
Buf_api: 2213226.467pps, 128367135.9bp
Michael S. Tsirkin July 9, 2020, 5:37 p.m. UTC | #2
On Thu, Jul 09, 2020 at 06:46:13PM +0200, Eugenio Perez Martin wrote:
> On Wed, Jul 1, 2020 at 4:10 PM Jason Wang <jasowang@redhat.com> wrote:

> >

> >

> > On 2020/7/1 下午9:04, Eugenio Perez Martin wrote:

> > > On Wed, Jul 1, 2020 at 2:40 PM Jason Wang <jasowang@redhat.com> wrote:

> > >>

> > >> On 2020/7/1 下午6:43, Eugenio Perez Martin wrote:

> > >>> On Tue, Jun 23, 2020 at 6:15 PM Eugenio Perez Martin

> > >>> <eperezma@redhat.com> wrote:

> > >>>> On Mon, Jun 22, 2020 at 6:29 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> > >>>>> On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote:

> > >>>>>> On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> > >>>>>>> On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote:

> > >>>>>>>> On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin

> > >>>>>>>> <eperezma@redhat.com> wrote:

> > >>>>>>>>> On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk

> > >>>>>>>>> <konrad.wilk@oracle.com> wrote:

> > >>>>>>>>>> On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:

> > >>>>>>>>>>> As testing shows no performance change, switch to that now.

> > >>>>>>>>>> What kind of testing? 100GiB? Low latency?

> > >>>>>>>>>>

> > >>>>>>>>> Hi Konrad.

> > >>>>>>>>>

> > >>>>>>>>> I tested this version of the patch:

> > >>>>>>>>> https://lkml.org/lkml/2019/10/13/42

> > >>>>>>>>>

> > >>>>>>>>> It was tested for throughput with DPDK's testpmd (as described in

> > >>>>>>>>> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)

> > >>>>>>>>> and kernel pktgen. No latency tests were performed by me. Maybe it is

> > >>>>>>>>> interesting to perform a latency test or just a different set of tests

> > >>>>>>>>> over a recent version.

> > >>>>>>>>>

> > >>>>>>>>> Thanks!

> > >>>>>>>> I have repeated the tests with v9, and results are a little bit different:

> > >>>>>>>> * If I test opening it with testpmd, I see no change between versions

> > >>>>>>> OK that is testpmd on guest, right? And vhost-net on the host?

> > >>>>>>>

> > >>>>>> Hi Michael.

> > >>>>>>

> > >>>>>> No, sorry, as described in

> > >>>>>> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html.

> > >>>>>> But I could add to test it in the guest too.

> > >>>>>>

> > >>>>>> These kinds of raw packets "bursts" do not show performance

> > >>>>>> differences, but I could test deeper if you think it would be worth

> > >>>>>> it.

> > >>>>> Oh ok, so this is without guest, with virtio-user.

> > >>>>> It might be worth checking dpdk within guest too just

> > >>>>> as another data point.

> > >>>>>

> > >>>> Ok, I will do it!

> > >>>>

> > >>>>>>>> * If I forward packets between two vhost-net interfaces in the guest

> > >>>>>>>> using a linux bridge in the host:

> > >>>>>>> And here I guess you mean virtio-net in the guest kernel?

> > >>>>>> Yes, sorry: Two virtio-net interfaces connected with a linux bridge in

> > >>>>>> the host. More precisely:

> > >>>>>> * Adding one of the interfaces to another namespace, assigning it an

> > >>>>>> IP, and starting netserver there.

> > >>>>>> * Assign another IP in the range manually to the other virtual net

> > >>>>>> interface, and start the desired test there.

> > >>>>>>

> > >>>>>> If you think it would be better to perform then differently please let me know.

> > >>>>> Not sure why you bother with namespaces since you said you are

> > >>>>> using L2 bridging. I guess it's unimportant.

> > >>>>>

> > >>>> Sorry, I think I should have provided more context about that.

> > >>>>

> > >>>> The only reason to use namespaces is to force the traffic of these

> > >>>> netperf tests to go through the external bridge. To test netperf

> > >>>> different possibilities than the testpmd (or pktgen or others "blast

> > >>>> of frames unconditionally" tests).

> > >>>>

> > >>>> This way, I make sure that is the same version of everything in the

> > >>>> guest, and is a little bit easier to manage cpu affinity, start and

> > >>>> stop testing...

> > >>>>

> > >>>> I could use a different VM for sending and receiving, but I find this

> > >>>> way a faster one and it should not introduce a lot of noise. I can

> > >>>> test with two VM if you think that this use of network namespace

> > >>>> introduces too much noise.

> > >>>>

> > >>>> Thanks!

> > >>>>

> > >>>>>>>>     - netperf UDP_STREAM shows a performance increase of 1.8, almost

> > >>>>>>>> doubling performance. This gets lower as frame size increase.

> > >>> Regarding UDP_STREAM:

> > >>> * with event_idx=on: The performance difference is reduced a lot if

> > >>> applied affinity properly (manually assigning CPU on host/guest and

> > >>> setting IRQs on guest), making them perform equally with and without

> > >>> the patch again. Maybe the batching makes the scheduler perform

> > >>> better.

> > >>

> > >> Note that for UDP_STREAM, the result is pretty trick to be analyzed. E.g

> > >> setting a sndbuf for TAP may help for the performance (reduce the drop).

> > >>

> > > Ok, will add that to the test. Thanks!

> >

> >

> > Actually, it's better to skip the UDP_STREAM test since:

> >

> > - My understanding is very few application is using raw UDP stream

> > - It's hard to analyze (usually you need to count the drop ratio etc)

> >

> >

> > >

> > >>>>>>>>     - rests of the test goes noticeably worse: UDP_RR goes from ~6347

> > >>>>>>>> transactions/sec to 5830

> > >>> * Regarding UDP_RR, TCP_STREAM, and TCP_RR, proper CPU pinning makes

> > >>> them perform similarly again, only a very small performance drop

> > >>> observed. It could be just noise.

> > >>> ** All of them perform better than vanilla if event_idx=off, not sure

> > >>> why. I can try to repeat them if you suspect that can be a test

> > >>> failure.

> > >>>

> > >>> * With testpmd and event_idx=off, if I send from the VM to host, I see

> > >>> a performance increment especially in small packets. The buf api also

> > >>> increases performance compared with only batching: Sending the minimum

> > >>> packet size in testpmd makes pps go from 356kpps to 473 kpps.

> > >>

> > >> What's your setup for this. The number looks rather low. I'd expected

> > >> 1-2 Mpps at least.

> > >>

> > > Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2 NUMA nodes of 16G memory

> > > each, and no device assigned to the NUMA node I'm testing in. Too low

> > > for testpmd AF_PACKET driver too?

> >

> >

> > I don't test AF_PACKET, I guess it should use the V3 which mmap based

> > zerocopy interface.

> >

> > And it might worth to check the cpu utilization of vhost thread. It's

> > required to stress it as 100% otherwise there could be a bottleneck

> > somewhere.

> >

> >

> > >

> > >>> Sending

> > >>> 1024 length UDP-PDU makes it go from 570kpps to 64 kpps.

> > >>>

> > >>> Something strange I observe in these tests: I get more pps the bigger

> > >>> the transmitted buffer size is. Not sure why.

> > >>>

> > >>> ** Sending from the host to the VM does not make a big change with the

> > >>> patches in small packets scenario (minimum, 64 bytes, about 645

> > >>> without the patch, ~625 with batch and batch+buf api). If the packets

> > >>> are bigger, I can see a performance increase: with 256 bits,

> > >>

> > >> I think you meant bytes?

> > >>

> > > Yes, sorry.

> > >

> > >>>    it goes

> > >>> from 590kpps to about 600kpps, and in case of 1500 bytes payload it

> > >>> gets from 348kpps to 528kpps, so it is clearly an improvement.

> > >>>

> > >>> * with testpmd and event_idx=on, batching+buf api perform similarly in

> > >>> both directions.

> > >>>

> > >>> All of testpmd tests were performed with no linux bridge, just a

> > >>> host's tap interface (<interface type='ethernet'> in xml),

> > >>

> > >> What DPDK driver did you use in the test (AF_PACKET?).

> > >>

> > > Yes, both testpmd are using AF_PACKET driver.

> >

> >

> > I see, using AF_PACKET means extra layers of issues need to be analyzed

> > which is probably not good.

> >

> >

> > >

> > >>> with a

> > >>> testpmd txonly and another in rxonly forward mode, and using the

> > >>> receiving side packets/bytes data. Guest's rps, xps and interrupts,

> > >>> and host's vhost threads affinity were also tuned in each test to

> > >>> schedule both testpmd and vhost in different processors.

> > >>

> > >> My feeling is that if we start from simple setup, it would be more

> > >> easier as a start. E.g start without an VM.

> > >>

> > >> 1) TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP

> > >> 2) RX: pkgetn -> TAP -> vhost_net -> testpmd(rxonly)

> > >>

> > > Got it. Is there a reason to prefer pktgen over testpmd?

> >

> >

> > I think the reason is using testpmd you must use a userspace kernel

> > interface (AF_PACKET), and it could not be as fast as pktgen since:

> >

> > - it talks directly to xmit of TAP

> > - skb can be cloned

> >

> 

> Hi!

> 

> Here it is the result of the tests. Details on [1].

> 

> Tx:

> ===

> 

> For tx packets it seems that the batching patch makes things a little

> bit worse, but the buf_api outperforms baseline by a 7%:

> 

> * We start with a baseline of 4208772.571 pps and 269361444.6 bytes/s [2].

> * When we add the batching, I see a small performance decrease:

> 4133292.308 and 264530707.7 bytes/s.

> * However, the buf api it outperform the baseline: 4551319.631pps,

> 291205178.1 bytes/s

> 

> I don't have numbers on the receiver side since it is just a XDP_DROP.

> I think it would be interesting to see them.

> 

> Rx:

> ===

> 

> Regarding Rx, the reverse is observed: a small performance increase is

> observed with batching (~2%), but buf_api makes tests perform equally

> to baseline.

> 

> pktgen was called using pktgen_sample01_simple.sh, with the environment:

> DEV="$tap_name" F_THREAD=1 DST_MAC=$MAC_ADDR COUNT=$((2500000*25))

> SKB_CLONE=$((2**31))

> 

> And testpmd is the same as Tx but with forward-mode=rxonly.

> 

> Pktgen reports:

> Baseline: 1853025pps 622Mb/sec (622616400bps) errors: 7915231

> Batch: 1891404pps 635Mb/sec (635511744bps) errors: 4926093

> Buf_api: 1844008pps 619Mb/sec (619586688bps) errors: 47766692

> 

> Testpmd reports:

> Baseline: 1854448pps, 860464156 bps. [3]

> Batch: 1892844.25pps, 878280070bps.

> Buf_api: 1846139.75pps, 856609120bps.

> 

> Any thoughts?

> 

> Thanks!

> 

> [1]

> Testpmd options: -l 1,3

> --vdev=virtio_user0,mac=01:02:03:04:05:06,path=/dev/vhost-net,queue_size=1024

> -- --auto-start --stats-period 5 --tx-offloads="$TX_OFFLOADS"

> --rx-offloads="$RX_OFFLOADS" --txd=4096 --rxd=4096 --burst=512

> --forward-mode=txonly

> 

> Where offloads were obtained manually running with

> --[tr]x-offloads=0x8fff and examining testpmd response:

> declare -r RX_OFFLOADS=0x81d

> declare -r TX_OFFLOADS=0x802d

> 

> All of the tests results are an average of at least 3 samples of

> testpmd, discarding the obvious deviations at start/end (like warming

> up or waiting for pktgen to start). The result of pktgen is directly

> c&p from its output.

> 

> The numbers do not change very much from one stats printing to another

> of testpmd.

> 

> [2] Obtained subtracting each accumulated tx-packets from one stats

> print to the previous one. If we attend testpmd output about Tx-pps,

> it counts a little bit less performance, but it follows the same

> pattern:

> 

> Testpmd pps/bps stats:

> Baseline: 3510826.25 pps, 1797887912bps = 224735989bytes/sec

> Batch: 3448515.571pps, 1765640226bps = 220705028.3bytes/sec

> Buf api: 3794115.333pps, 1942587286bps = 242823410.8bytes/sec

> 

> [3] This is obtained using the rx-pps/rx-bps report of testpmd.

> 

> Seems strange to me that the relation between pps/bps is ~336 this

> time, and between accumulated pkts/accumulated bytes is ~58. Also, the

> relation between them is not even close to 8.

> 

> However, testpmd shows a lot of absolute packets received. If we see

> the received packets in a period subtracting from the previous one,

> testpmd tells that receive more pps than pktgen tx-pps:

> Baseline: ~2222668.667pps 128914784.3bps.

> Batch: 2269260.933pps, 131617134.9bps

> Buf_api: 2213226.467pps, 128367135.9bp


How about playing with the batch size? Make it a mod parameter instead
of the hard coded 64, and measure for all values 1 to 64 ...
Jason Wang July 10, 2020, 3:56 a.m. UTC | #3
On 2020/7/10 上午1:37, Michael S. Tsirkin wrote:
> On Thu, Jul 09, 2020 at 06:46:13PM +0200, Eugenio Perez Martin wrote:

>> On Wed, Jul 1, 2020 at 4:10 PM Jason Wang <jasowang@redhat.com> wrote:

>>>

>>> On 2020/7/1 下午9:04, Eugenio Perez Martin wrote:

>>>> On Wed, Jul 1, 2020 at 2:40 PM Jason Wang <jasowang@redhat.com> wrote:

>>>>> On 2020/7/1 下午6:43, Eugenio Perez Martin wrote:

>>>>>> On Tue, Jun 23, 2020 at 6:15 PM Eugenio Perez Martin

>>>>>> <eperezma@redhat.com> wrote:

>>>>>>> On Mon, Jun 22, 2020 at 6:29 PM Michael S. Tsirkin <mst@redhat.com> wrote:

>>>>>>>> On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote:

>>>>>>>>> On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:

>>>>>>>>>> On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote:

>>>>>>>>>>> On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin

>>>>>>>>>>> <eperezma@redhat.com> wrote:

>>>>>>>>>>>> On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk

>>>>>>>>>>>> <konrad.wilk@oracle.com> wrote:

>>>>>>>>>>>>> On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:

>>>>>>>>>>>>>> As testing shows no performance change, switch to that now.

>>>>>>>>>>>>> What kind of testing? 100GiB? Low latency?

>>>>>>>>>>>>>

>>>>>>>>>>>> Hi Konrad.

>>>>>>>>>>>>

>>>>>>>>>>>> I tested this version of the patch:

>>>>>>>>>>>> https://lkml.org/lkml/2019/10/13/42

>>>>>>>>>>>>

>>>>>>>>>>>> It was tested for throughput with DPDK's testpmd (as described in

>>>>>>>>>>>> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)

>>>>>>>>>>>> and kernel pktgen. No latency tests were performed by me. Maybe it is

>>>>>>>>>>>> interesting to perform a latency test or just a different set of tests

>>>>>>>>>>>> over a recent version.

>>>>>>>>>>>>

>>>>>>>>>>>> Thanks!

>>>>>>>>>>> I have repeated the tests with v9, and results are a little bit different:

>>>>>>>>>>> * If I test opening it with testpmd, I see no change between versions

>>>>>>>>>> OK that is testpmd on guest, right? And vhost-net on the host?

>>>>>>>>>>

>>>>>>>>> Hi Michael.

>>>>>>>>>

>>>>>>>>> No, sorry, as described in

>>>>>>>>> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html.

>>>>>>>>> But I could add to test it in the guest too.

>>>>>>>>>

>>>>>>>>> These kinds of raw packets "bursts" do not show performance

>>>>>>>>> differences, but I could test deeper if you think it would be worth

>>>>>>>>> it.

>>>>>>>> Oh ok, so this is without guest, with virtio-user.

>>>>>>>> It might be worth checking dpdk within guest too just

>>>>>>>> as another data point.

>>>>>>>>

>>>>>>> Ok, I will do it!

>>>>>>>

>>>>>>>>>>> * If I forward packets between two vhost-net interfaces in the guest

>>>>>>>>>>> using a linux bridge in the host:

>>>>>>>>>> And here I guess you mean virtio-net in the guest kernel?

>>>>>>>>> Yes, sorry: Two virtio-net interfaces connected with a linux bridge in

>>>>>>>>> the host. More precisely:

>>>>>>>>> * Adding one of the interfaces to another namespace, assigning it an

>>>>>>>>> IP, and starting netserver there.

>>>>>>>>> * Assign another IP in the range manually to the other virtual net

>>>>>>>>> interface, and start the desired test there.

>>>>>>>>>

>>>>>>>>> If you think it would be better to perform then differently please let me know.

>>>>>>>> Not sure why you bother with namespaces since you said you are

>>>>>>>> using L2 bridging. I guess it's unimportant.

>>>>>>>>

>>>>>>> Sorry, I think I should have provided more context about that.

>>>>>>>

>>>>>>> The only reason to use namespaces is to force the traffic of these

>>>>>>> netperf tests to go through the external bridge. To test netperf

>>>>>>> different possibilities than the testpmd (or pktgen or others "blast

>>>>>>> of frames unconditionally" tests).

>>>>>>>

>>>>>>> This way, I make sure that is the same version of everything in the

>>>>>>> guest, and is a little bit easier to manage cpu affinity, start and

>>>>>>> stop testing...

>>>>>>>

>>>>>>> I could use a different VM for sending and receiving, but I find this

>>>>>>> way a faster one and it should not introduce a lot of noise. I can

>>>>>>> test with two VM if you think that this use of network namespace

>>>>>>> introduces too much noise.

>>>>>>>

>>>>>>> Thanks!

>>>>>>>

>>>>>>>>>>>      - netperf UDP_STREAM shows a performance increase of 1.8, almost

>>>>>>>>>>> doubling performance. This gets lower as frame size increase.

>>>>>> Regarding UDP_STREAM:

>>>>>> * with event_idx=on: The performance difference is reduced a lot if

>>>>>> applied affinity properly (manually assigning CPU on host/guest and

>>>>>> setting IRQs on guest), making them perform equally with and without

>>>>>> the patch again. Maybe the batching makes the scheduler perform

>>>>>> better.

>>>>> Note that for UDP_STREAM, the result is pretty trick to be analyzed. E.g

>>>>> setting a sndbuf for TAP may help for the performance (reduce the drop).

>>>>>

>>>> Ok, will add that to the test. Thanks!

>>>

>>> Actually, it's better to skip the UDP_STREAM test since:

>>>

>>> - My understanding is very few application is using raw UDP stream

>>> - It's hard to analyze (usually you need to count the drop ratio etc)

>>>

>>>

>>>>>>>>>>>      - rests of the test goes noticeably worse: UDP_RR goes from ~6347

>>>>>>>>>>> transactions/sec to 5830

>>>>>> * Regarding UDP_RR, TCP_STREAM, and TCP_RR, proper CPU pinning makes

>>>>>> them perform similarly again, only a very small performance drop

>>>>>> observed. It could be just noise.

>>>>>> ** All of them perform better than vanilla if event_idx=off, not sure

>>>>>> why. I can try to repeat them if you suspect that can be a test

>>>>>> failure.

>>>>>>

>>>>>> * With testpmd and event_idx=off, if I send from the VM to host, I see

>>>>>> a performance increment especially in small packets. The buf api also

>>>>>> increases performance compared with only batching: Sending the minimum

>>>>>> packet size in testpmd makes pps go from 356kpps to 473 kpps.

>>>>> What's your setup for this. The number looks rather low. I'd expected

>>>>> 1-2 Mpps at least.

>>>>>

>>>> Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2 NUMA nodes of 16G memory

>>>> each, and no device assigned to the NUMA node I'm testing in. Too low

>>>> for testpmd AF_PACKET driver too?

>>>

>>> I don't test AF_PACKET, I guess it should use the V3 which mmap based

>>> zerocopy interface.

>>>

>>> And it might worth to check the cpu utilization of vhost thread. It's

>>> required to stress it as 100% otherwise there could be a bottleneck

>>> somewhere.

>>>

>>>

>>>>>> Sending

>>>>>> 1024 length UDP-PDU makes it go from 570kpps to 64 kpps.

>>>>>>

>>>>>> Something strange I observe in these tests: I get more pps the bigger

>>>>>> the transmitted buffer size is. Not sure why.

>>>>>>

>>>>>> ** Sending from the host to the VM does not make a big change with the

>>>>>> patches in small packets scenario (minimum, 64 bytes, about 645

>>>>>> without the patch, ~625 with batch and batch+buf api). If the packets

>>>>>> are bigger, I can see a performance increase: with 256 bits,

>>>>> I think you meant bytes?

>>>>>

>>>> Yes, sorry.

>>>>

>>>>>>     it goes

>>>>>> from 590kpps to about 600kpps, and in case of 1500 bytes payload it

>>>>>> gets from 348kpps to 528kpps, so it is clearly an improvement.

>>>>>>

>>>>>> * with testpmd and event_idx=on, batching+buf api perform similarly in

>>>>>> both directions.

>>>>>>

>>>>>> All of testpmd tests were performed with no linux bridge, just a

>>>>>> host's tap interface (<interface type='ethernet'> in xml),

>>>>> What DPDK driver did you use in the test (AF_PACKET?).

>>>>>

>>>> Yes, both testpmd are using AF_PACKET driver.

>>>

>>> I see, using AF_PACKET means extra layers of issues need to be analyzed

>>> which is probably not good.

>>>

>>>

>>>>>> with a

>>>>>> testpmd txonly and another in rxonly forward mode, and using the

>>>>>> receiving side packets/bytes data. Guest's rps, xps and interrupts,

>>>>>> and host's vhost threads affinity were also tuned in each test to

>>>>>> schedule both testpmd and vhost in different processors.

>>>>> My feeling is that if we start from simple setup, it would be more

>>>>> easier as a start. E.g start without an VM.

>>>>>

>>>>> 1) TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP

>>>>> 2) RX: pkgetn -> TAP -> vhost_net -> testpmd(rxonly)

>>>>>

>>>> Got it. Is there a reason to prefer pktgen over testpmd?

>>>

>>> I think the reason is using testpmd you must use a userspace kernel

>>> interface (AF_PACKET), and it could not be as fast as pktgen since:

>>>

>>> - it talks directly to xmit of TAP

>>> - skb can be cloned

>>>

>> Hi!

>>

>> Here it is the result of the tests. Details on [1].

>>

>> Tx:

>> ===

>>

>> For tx packets it seems that the batching patch makes things a little

>> bit worse, but the buf_api outperforms baseline by a 7%:

>>

>> * We start with a baseline of 4208772.571 pps and 269361444.6 bytes/s [2].

>> * When we add the batching, I see a small performance decrease:

>> 4133292.308 and 264530707.7 bytes/s.

>> * However, the buf api it outperform the baseline: 4551319.631pps,

>> 291205178.1 bytes/s

>>

>> I don't have numbers on the receiver side since it is just a XDP_DROP.

>> I think it would be interesting to see them.

>>

>> Rx:

>> ===

>>

>> Regarding Rx, the reverse is observed: a small performance increase is

>> observed with batching (~2%), but buf_api makes tests perform equally

>> to baseline.

>>

>> pktgen was called using pktgen_sample01_simple.sh, with the environment:

>> DEV="$tap_name" F_THREAD=1 DST_MAC=$MAC_ADDR COUNT=$((2500000*25))

>> SKB_CLONE=$((2**31))

>>

>> And testpmd is the same as Tx but with forward-mode=rxonly.

>>

>> Pktgen reports:

>> Baseline: 1853025pps 622Mb/sec (622616400bps) errors: 7915231

>> Batch: 1891404pps 635Mb/sec (635511744bps) errors: 4926093

>> Buf_api: 1844008pps 619Mb/sec (619586688bps) errors: 47766692

>>

>> Testpmd reports:

>> Baseline: 1854448pps, 860464156 bps. [3]

>> Batch: 1892844.25pps, 878280070bps.

>> Buf_api: 1846139.75pps, 856609120bps.

>>

>> Any thoughts?

>>

>> Thanks!

>>

>> [1]

>> Testpmd options: -l 1,3

>> --vdev=virtio_user0,mac=01:02:03:04:05:06,path=/dev/vhost-net,queue_size=1024

>> -- --auto-start --stats-period 5 --tx-offloads="$TX_OFFLOADS"

>> --rx-offloads="$RX_OFFLOADS" --txd=4096 --rxd=4096 --burst=512

>> --forward-mode=txonly

>>

>> Where offloads were obtained manually running with

>> --[tr]x-offloads=0x8fff and examining testpmd response:

>> declare -r RX_OFFLOADS=0x81d

>> declare -r TX_OFFLOADS=0x802d

>>

>> All of the tests results are an average of at least 3 samples of

>> testpmd, discarding the obvious deviations at start/end (like warming

>> up or waiting for pktgen to start). The result of pktgen is directly

>> c&p from its output.

>>

>> The numbers do not change very much from one stats printing to another

>> of testpmd.

>>

>> [2] Obtained subtracting each accumulated tx-packets from one stats

>> print to the previous one. If we attend testpmd output about Tx-pps,

>> it counts a little bit less performance, but it follows the same

>> pattern:

>>

>> Testpmd pps/bps stats:

>> Baseline: 3510826.25 pps, 1797887912bps = 224735989bytes/sec

>> Batch: 3448515.571pps, 1765640226bps = 220705028.3bytes/sec

>> Buf api: 3794115.333pps, 1942587286bps = 242823410.8bytes/sec

>>

>> [3] This is obtained using the rx-pps/rx-bps report of testpmd.

>>

>> Seems strange to me that the relation between pps/bps is ~336 this

>> time, and between accumulated pkts/accumulated bytes is ~58. Also, the

>> relation between them is not even close to 8.

>>

>> However, testpmd shows a lot of absolute packets received. If we see

>> the received packets in a period subtracting from the previous one,

>> testpmd tells that receive more pps than pktgen tx-pps:

>> Baseline: ~2222668.667pps 128914784.3bps.

>> Batch: 2269260.933pps, 131617134.9bps

>> Buf_api: 2213226.467pps, 128367135.9bp

> How about playing with the batch size? Make it a mod parameter instead

> of the hard coded 64, and measure for all values 1 to 64 ...



Right, according to the test result, 64 seems to be too aggressive in 
the case of TX.

And it might also be worth to check:

1) Whether vhost thread is stressed as 100% CPU utilization, if not, 
there's bottleneck elsewhere
2) For RX test, make sure pktgen kthread is running in the same NUMA 
node with virtio-user

Thanks


>
Eugenio Perez Martin July 10, 2020, 5:39 a.m. UTC | #4
On Fri, Jul 10, 2020 at 5:56 AM Jason Wang <jasowang@redhat.com> wrote:
>

>

> On 2020/7/10 上午1:37, Michael S. Tsirkin wrote:

> > On Thu, Jul 09, 2020 at 06:46:13PM +0200, Eugenio Perez Martin wrote:

> >> On Wed, Jul 1, 2020 at 4:10 PM Jason Wang <jasowang@redhat.com> wrote:

> >>>

> >>> On 2020/7/1 下午9:04, Eugenio Perez Martin wrote:

> >>>> On Wed, Jul 1, 2020 at 2:40 PM Jason Wang <jasowang@redhat.com> wrote:

> >>>>> On 2020/7/1 下午6:43, Eugenio Perez Martin wrote:

> >>>>>> On Tue, Jun 23, 2020 at 6:15 PM Eugenio Perez Martin

> >>>>>> <eperezma@redhat.com> wrote:

> >>>>>>> On Mon, Jun 22, 2020 at 6:29 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> >>>>>>>> On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote:

> >>>>>>>>> On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> >>>>>>>>>> On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote:

> >>>>>>>>>>> On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin

> >>>>>>>>>>> <eperezma@redhat.com> wrote:

> >>>>>>>>>>>> On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk

> >>>>>>>>>>>> <konrad.wilk@oracle.com> wrote:

> >>>>>>>>>>>>> On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:

> >>>>>>>>>>>>>> As testing shows no performance change, switch to that now.

> >>>>>>>>>>>>> What kind of testing? 100GiB? Low latency?

> >>>>>>>>>>>>>

> >>>>>>>>>>>> Hi Konrad.

> >>>>>>>>>>>>

> >>>>>>>>>>>> I tested this version of the patch:

> >>>>>>>>>>>> https://lkml.org/lkml/2019/10/13/42

> >>>>>>>>>>>>

> >>>>>>>>>>>> It was tested for throughput with DPDK's testpmd (as described in

> >>>>>>>>>>>> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)

> >>>>>>>>>>>> and kernel pktgen. No latency tests were performed by me. Maybe it is

> >>>>>>>>>>>> interesting to perform a latency test or just a different set of tests

> >>>>>>>>>>>> over a recent version.

> >>>>>>>>>>>>

> >>>>>>>>>>>> Thanks!

> >>>>>>>>>>> I have repeated the tests with v9, and results are a little bit different:

> >>>>>>>>>>> * If I test opening it with testpmd, I see no change between versions

> >>>>>>>>>> OK that is testpmd on guest, right? And vhost-net on the host?

> >>>>>>>>>>

> >>>>>>>>> Hi Michael.

> >>>>>>>>>

> >>>>>>>>> No, sorry, as described in

> >>>>>>>>> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html.

> >>>>>>>>> But I could add to test it in the guest too.

> >>>>>>>>>

> >>>>>>>>> These kinds of raw packets "bursts" do not show performance

> >>>>>>>>> differences, but I could test deeper if you think it would be worth

> >>>>>>>>> it.

> >>>>>>>> Oh ok, so this is without guest, with virtio-user.

> >>>>>>>> It might be worth checking dpdk within guest too just

> >>>>>>>> as another data point.

> >>>>>>>>

> >>>>>>> Ok, I will do it!

> >>>>>>>

> >>>>>>>>>>> * If I forward packets between two vhost-net interfaces in the guest

> >>>>>>>>>>> using a linux bridge in the host:

> >>>>>>>>>> And here I guess you mean virtio-net in the guest kernel?

> >>>>>>>>> Yes, sorry: Two virtio-net interfaces connected with a linux bridge in

> >>>>>>>>> the host. More precisely:

> >>>>>>>>> * Adding one of the interfaces to another namespace, assigning it an

> >>>>>>>>> IP, and starting netserver there.

> >>>>>>>>> * Assign another IP in the range manually to the other virtual net

> >>>>>>>>> interface, and start the desired test there.

> >>>>>>>>>

> >>>>>>>>> If you think it would be better to perform then differently please let me know.

> >>>>>>>> Not sure why you bother with namespaces since you said you are

> >>>>>>>> using L2 bridging. I guess it's unimportant.

> >>>>>>>>

> >>>>>>> Sorry, I think I should have provided more context about that.

> >>>>>>>

> >>>>>>> The only reason to use namespaces is to force the traffic of these

> >>>>>>> netperf tests to go through the external bridge. To test netperf

> >>>>>>> different possibilities than the testpmd (or pktgen or others "blast

> >>>>>>> of frames unconditionally" tests).

> >>>>>>>

> >>>>>>> This way, I make sure that is the same version of everything in the

> >>>>>>> guest, and is a little bit easier to manage cpu affinity, start and

> >>>>>>> stop testing...

> >>>>>>>

> >>>>>>> I could use a different VM for sending and receiving, but I find this

> >>>>>>> way a faster one and it should not introduce a lot of noise. I can

> >>>>>>> test with two VM if you think that this use of network namespace

> >>>>>>> introduces too much noise.

> >>>>>>>

> >>>>>>> Thanks!

> >>>>>>>

> >>>>>>>>>>>      - netperf UDP_STREAM shows a performance increase of 1.8, almost

> >>>>>>>>>>> doubling performance. This gets lower as frame size increase.

> >>>>>> Regarding UDP_STREAM:

> >>>>>> * with event_idx=on: The performance difference is reduced a lot if

> >>>>>> applied affinity properly (manually assigning CPU on host/guest and

> >>>>>> setting IRQs on guest), making them perform equally with and without

> >>>>>> the patch again. Maybe the batching makes the scheduler perform

> >>>>>> better.

> >>>>> Note that for UDP_STREAM, the result is pretty trick to be analyzed. E.g

> >>>>> setting a sndbuf for TAP may help for the performance (reduce the drop).

> >>>>>

> >>>> Ok, will add that to the test. Thanks!

> >>>

> >>> Actually, it's better to skip the UDP_STREAM test since:

> >>>

> >>> - My understanding is very few application is using raw UDP stream

> >>> - It's hard to analyze (usually you need to count the drop ratio etc)

> >>>

> >>>

> >>>>>>>>>>>      - rests of the test goes noticeably worse: UDP_RR goes from ~6347

> >>>>>>>>>>> transactions/sec to 5830

> >>>>>> * Regarding UDP_RR, TCP_STREAM, and TCP_RR, proper CPU pinning makes

> >>>>>> them perform similarly again, only a very small performance drop

> >>>>>> observed. It could be just noise.

> >>>>>> ** All of them perform better than vanilla if event_idx=off, not sure

> >>>>>> why. I can try to repeat them if you suspect that can be a test

> >>>>>> failure.

> >>>>>>

> >>>>>> * With testpmd and event_idx=off, if I send from the VM to host, I see

> >>>>>> a performance increment especially in small packets. The buf api also

> >>>>>> increases performance compared with only batching: Sending the minimum

> >>>>>> packet size in testpmd makes pps go from 356kpps to 473 kpps.

> >>>>> What's your setup for this. The number looks rather low. I'd expected

> >>>>> 1-2 Mpps at least.

> >>>>>

> >>>> Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2 NUMA nodes of 16G memory

> >>>> each, and no device assigned to the NUMA node I'm testing in. Too low

> >>>> for testpmd AF_PACKET driver too?

> >>>

> >>> I don't test AF_PACKET, I guess it should use the V3 which mmap based

> >>> zerocopy interface.

> >>>

> >>> And it might worth to check the cpu utilization of vhost thread. It's

> >>> required to stress it as 100% otherwise there could be a bottleneck

> >>> somewhere.

> >>>

> >>>

> >>>>>> Sending

> >>>>>> 1024 length UDP-PDU makes it go from 570kpps to 64 kpps.

> >>>>>>

> >>>>>> Something strange I observe in these tests: I get more pps the bigger

> >>>>>> the transmitted buffer size is. Not sure why.

> >>>>>>

> >>>>>> ** Sending from the host to the VM does not make a big change with the

> >>>>>> patches in small packets scenario (minimum, 64 bytes, about 645

> >>>>>> without the patch, ~625 with batch and batch+buf api). If the packets

> >>>>>> are bigger, I can see a performance increase: with 256 bits,

> >>>>> I think you meant bytes?

> >>>>>

> >>>> Yes, sorry.

> >>>>

> >>>>>>     it goes

> >>>>>> from 590kpps to about 600kpps, and in case of 1500 bytes payload it

> >>>>>> gets from 348kpps to 528kpps, so it is clearly an improvement.

> >>>>>>

> >>>>>> * with testpmd and event_idx=on, batching+buf api perform similarly in

> >>>>>> both directions.

> >>>>>>

> >>>>>> All of testpmd tests were performed with no linux bridge, just a

> >>>>>> host's tap interface (<interface type='ethernet'> in xml),

> >>>>> What DPDK driver did you use in the test (AF_PACKET?).

> >>>>>

> >>>> Yes, both testpmd are using AF_PACKET driver.

> >>>

> >>> I see, using AF_PACKET means extra layers of issues need to be analyzed

> >>> which is probably not good.

> >>>

> >>>

> >>>>>> with a

> >>>>>> testpmd txonly and another in rxonly forward mode, and using the

> >>>>>> receiving side packets/bytes data. Guest's rps, xps and interrupts,

> >>>>>> and host's vhost threads affinity were also tuned in each test to

> >>>>>> schedule both testpmd and vhost in different processors.

> >>>>> My feeling is that if we start from simple setup, it would be more

> >>>>> easier as a start. E.g start without an VM.

> >>>>>

> >>>>> 1) TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP

> >>>>> 2) RX: pkgetn -> TAP -> vhost_net -> testpmd(rxonly)

> >>>>>

> >>>> Got it. Is there a reason to prefer pktgen over testpmd?

> >>>

> >>> I think the reason is using testpmd you must use a userspace kernel

> >>> interface (AF_PACKET), and it could not be as fast as pktgen since:

> >>>

> >>> - it talks directly to xmit of TAP

> >>> - skb can be cloned

> >>>

> >> Hi!

> >>

> >> Here it is the result of the tests. Details on [1].

> >>

> >> Tx:

> >> ===

> >>

> >> For tx packets it seems that the batching patch makes things a little

> >> bit worse, but the buf_api outperforms baseline by a 7%:

> >>

> >> * We start with a baseline of 4208772.571 pps and 269361444.6 bytes/s [2].

> >> * When we add the batching, I see a small performance decrease:

> >> 4133292.308 and 264530707.7 bytes/s.

> >> * However, the buf api it outperform the baseline: 4551319.631pps,

> >> 291205178.1 bytes/s

> >>

> >> I don't have numbers on the receiver side since it is just a XDP_DROP.

> >> I think it would be interesting to see them.

> >>

> >> Rx:

> >> ===

> >>

> >> Regarding Rx, the reverse is observed: a small performance increase is

> >> observed with batching (~2%), but buf_api makes tests perform equally

> >> to baseline.

> >>

> >> pktgen was called using pktgen_sample01_simple.sh, with the environment:

> >> DEV="$tap_name" F_THREAD=1 DST_MAC=$MAC_ADDR COUNT=$((2500000*25))

> >> SKB_CLONE=$((2**31))

> >>

> >> And testpmd is the same as Tx but with forward-mode=rxonly.

> >>

> >> Pktgen reports:

> >> Baseline: 1853025pps 622Mb/sec (622616400bps) errors: 7915231

> >> Batch: 1891404pps 635Mb/sec (635511744bps) errors: 4926093

> >> Buf_api: 1844008pps 619Mb/sec (619586688bps) errors: 47766692

> >>

> >> Testpmd reports:

> >> Baseline: 1854448pps, 860464156 bps. [3]

> >> Batch: 1892844.25pps, 878280070bps.

> >> Buf_api: 1846139.75pps, 856609120bps.

> >>

> >> Any thoughts?

> >>

> >> Thanks!

> >>

> >> [1]

> >> Testpmd options: -l 1,3

> >> --vdev=virtio_user0,mac=01:02:03:04:05:06,path=/dev/vhost-net,queue_size=1024

> >> -- --auto-start --stats-period 5 --tx-offloads="$TX_OFFLOADS"

> >> --rx-offloads="$RX_OFFLOADS" --txd=4096 --rxd=4096 --burst=512

> >> --forward-mode=txonly

> >>

> >> Where offloads were obtained manually running with

> >> --[tr]x-offloads=0x8fff and examining testpmd response:

> >> declare -r RX_OFFLOADS=0x81d

> >> declare -r TX_OFFLOADS=0x802d

> >>

> >> All of the tests results are an average of at least 3 samples of

> >> testpmd, discarding the obvious deviations at start/end (like warming

> >> up or waiting for pktgen to start). The result of pktgen is directly

> >> c&p from its output.

> >>

> >> The numbers do not change very much from one stats printing to another

> >> of testpmd.

> >>

> >> [2] Obtained subtracting each accumulated tx-packets from one stats

> >> print to the previous one. If we attend testpmd output about Tx-pps,

> >> it counts a little bit less performance, but it follows the same

> >> pattern:

> >>

> >> Testpmd pps/bps stats:

> >> Baseline: 3510826.25 pps, 1797887912bps = 224735989bytes/sec

> >> Batch: 3448515.571pps, 1765640226bps = 220705028.3bytes/sec

> >> Buf api: 3794115.333pps, 1942587286bps = 242823410.8bytes/sec

> >>

> >> [3] This is obtained using the rx-pps/rx-bps report of testpmd.

> >>

> >> Seems strange to me that the relation between pps/bps is ~336 this

> >> time, and between accumulated pkts/accumulated bytes is ~58. Also, the

> >> relation between them is not even close to 8.

> >>

> >> However, testpmd shows a lot of absolute packets received. If we see

> >> the received packets in a period subtracting from the previous one,

> >> testpmd tells that receive more pps than pktgen tx-pps:

> >> Baseline: ~2222668.667pps 128914784.3bps.

> >> Batch: 2269260.933pps, 131617134.9bps

> >> Buf_api: 2213226.467pps, 128367135.9bp

> > How about playing with the batch size? Make it a mod parameter instead

> > of the hard coded 64, and measure for all values 1 to 64 ...

>

>

> Right, according to the test result, 64 seems to be too aggressive in

> the case of TX.

>


Got it, thanks both!

> And it might also be worth to check:

>

> 1) Whether vhost thread is stressed as 100% CPU utilization, if not,

> there's bottleneck elsewhere


I forgot to check this, sorry. Will check in the next test.

> 2) For RX test, make sure pktgen kthread is running in the same NUMA

> node with virtio-user

>


It is allocated 1 thread in lcore 1 (F_THREAD=1) which belongs to the
same NUMA as testpmd. Actually, it is the testpmd master core, so it
should be a good idea to move it to another lcore of the same NUMA
node.

Is this enough for pktgen to allocate the memory in that numa node?
Since the script only write parameters to /proc, I assume that it has
no effect to run it under numactl/taskset, and pktgen will allocate
memory based on the lcore is running. Am I right?

Thanks!

> Thanks

>

>

> >

>
Michael S. Tsirkin July 10, 2020, 5:58 a.m. UTC | #5
On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:
> > > How about playing with the batch size? Make it a mod parameter instead

> > > of the hard coded 64, and measure for all values 1 to 64 ...

> >

> >

> > Right, according to the test result, 64 seems to be too aggressive in

> > the case of TX.

> >

> 

> Got it, thanks both!


In particular I wonder whether with batch size 1
we get same performance as without batching
(would indicate 64 is too aggressive)
or not (would indicate one of the code changes
affects performance in an unexpected way).

-- 
MST
Jason Wang July 10, 2020, 6:44 a.m. UTC | #6
On 2020/7/10 下午1:39, Eugenio Perez Martin wrote:
> It is allocated 1 thread in lcore 1 (F_THREAD=1) which belongs to the

> same NUMA as testpmd. Actually, it is the testpmd master core, so it

> should be a good idea to move it to another lcore of the same NUMA

> node.

>

> Is this enough for pktgen to allocate the memory in that numa node?

> Since the script only write parameters to /proc, I assume that it has

> no effect to run it under numactl/taskset, and pktgen will allocate

> memory based on the lcore is running. Am I right?

>

> Thanks!

>


I think you're right.

Thanks
Eugenio Perez Martin July 16, 2020, 5:16 p.m. UTC | #7
On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>

> On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:

> > > > How about playing with the batch size? Make it a mod parameter instead

> > > > of the hard coded 64, and measure for all values 1 to 64 ...

> > >

> > >

> > > Right, according to the test result, 64 seems to be too aggressive in

> > > the case of TX.

> > >

> >

> > Got it, thanks both!

>

> In particular I wonder whether with batch size 1

> we get same performance as without batching

> (would indicate 64 is too aggressive)

> or not (would indicate one of the code changes

> affects performance in an unexpected way).

>

> --

> MST

>


Hi!

Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH, and testing
the pps as previous mail says. This means that we have either only
vhost_net batching (in base testing, like previously to apply this
patch) or both batching sizes the same.

I've checked that vhost process (and pktgen) goes 100% cpu also.

For tx: Batching decrements always the performance, in all cases. Not
sure why bufapi made things better the last time.

Batching makes improvements until 64 bufs, I see increments of pps but like 1%.

For rx: Batching always improves performance. It seems that if we
batch little, bufapi decreases performance, but beyond 64, bufapi is
much better. The bufapi version keeps improving until I set a batching
of 1024. So I guess it is super good to have a bunch of buffers to
receive.

Since with this test I cannot disable event_idx or things like that,
what would be the next step for testing?

Thanks!

--
Results:
# Buf size: 1,16,32,64,128,256,512

# Tx
# ===
# Base
2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820
# Batch
2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286
# Batch + Bufapi
2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538

# Rx
# ==
# pktgen results (pps)
1223275,1668868,1728794,1769261,1808574,1837252,1846436
1456924,1797901,1831234,1868746,1877508,1931598,1936402
1368923,1719716,1794373,1865170,1884803,1916021,1975160

# Testpmd pps results
1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75
1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034
1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316

pktgen was run again for rx with 1024 and 2048 buf size, giving
1988760.75 and 1978316 pps. Testpmd goes the same way.
Jason Wang July 20, 2020, 8:55 a.m. UTC | #8
On 2020/7/17 上午1:16, Eugenio Perez Martin wrote:
> On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin <mst@redhat.com> wrote:

>> On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:

>>>>> How about playing with the batch size? Make it a mod parameter instead

>>>>> of the hard coded 64, and measure for all values 1 to 64 ...

>>>>

>>>> Right, according to the test result, 64 seems to be too aggressive in

>>>> the case of TX.

>>>>

>>> Got it, thanks both!

>> In particular I wonder whether with batch size 1

>> we get same performance as without batching

>> (would indicate 64 is too aggressive)

>> or not (would indicate one of the code changes

>> affects performance in an unexpected way).

>>

>> --

>> MST

>>

> Hi!

>

> Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,



Did you mean varying the value of VHOST_NET_BATCH itself or the number 
of batched descriptors?


> and testing

> the pps as previous mail says. This means that we have either only

> vhost_net batching (in base testing, like previously to apply this

> patch) or both batching sizes the same.

>

> I've checked that vhost process (and pktgen) goes 100% cpu also.

>

> For tx: Batching decrements always the performance, in all cases. Not

> sure why bufapi made things better the last time.

>

> Batching makes improvements until 64 bufs, I see increments of pps but like 1%.

>

> For rx: Batching always improves performance. It seems that if we

> batch little, bufapi decreases performance, but beyond 64, bufapi is

> much better. The bufapi version keeps improving until I set a batching

> of 1024. So I guess it is super good to have a bunch of buffers to

> receive.

>

> Since with this test I cannot disable event_idx or things like that,

> what would be the next step for testing?

>

> Thanks!

>

> --

> Results:

> # Buf size: 1,16,32,64,128,256,512

>

> # Tx

> # ===

> # Base

> 2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820



What's the meaning of buf size in the context of "base"?

And I wonder maybe perf diff can help.

Thanks


> # Batch

> 2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286

> # Batch + Bufapi

> 2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538

>

> # Rx

> # ==

> # pktgen results (pps)

> 1223275,1668868,1728794,1769261,1808574,1837252,1846436

> 1456924,1797901,1831234,1868746,1877508,1931598,1936402

> 1368923,1719716,1794373,1865170,1884803,1916021,1975160

>

> # Testpmd pps results

> 1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75

> 1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034

> 1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316

>

> pktgen was run again for rx with 1024 and 2048 buf size, giving

> 1988760.75 and 1978316 pps. Testpmd goes the same way.

>
Michael S. Tsirkin July 20, 2020, 9:27 a.m. UTC | #9
On Thu, Jul 16, 2020 at 07:16:27PM +0200, Eugenio Perez Martin wrote:
> On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin <mst@redhat.com> wrote:

> >

> > On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:

> > > > > How about playing with the batch size? Make it a mod parameter instead

> > > > > of the hard coded 64, and measure for all values 1 to 64 ...

> > > >

> > > >

> > > > Right, according to the test result, 64 seems to be too aggressive in

> > > > the case of TX.

> > > >

> > >

> > > Got it, thanks both!

> >

> > In particular I wonder whether with batch size 1

> > we get same performance as without batching

> > (would indicate 64 is too aggressive)

> > or not (would indicate one of the code changes

> > affects performance in an unexpected way).

> >

> > --

> > MST

> >

> 

> Hi!

> 

> Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,


sorry this is not what I meant.

I mean something like this:


diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 0b509be8d7b1..b94680e5721d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1279,6 +1279,10 @@ static void handle_rx_net(struct vhost_work *work)
 	handle_rx(net);
 }
 
+MODULE_PARM_DESC(batch_num, "Number of batched descriptors. (offset from 64)");
+module_param(batch_num, int, 0644);
+static int batch_num = 0;
+
 static int vhost_net_open(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n;
@@ -1333,7 +1337,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		vhost_net_buf_init(&n->vqs[i].rxq);
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
-		       UIO_MAXIOV + VHOST_NET_BATCH,
+		       UIO_MAXIOV + VHOST_NET_BATCH + batch_num,
 		       VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
 		       NULL);
 

then you can try tweaking batching and playing with mod parameter without
recompiling.


VHOST_NET_BATCH affects lots of other things.


> and testing

> the pps as previous mail says. This means that we have either only

> vhost_net batching (in base testing, like previously to apply this

> patch) or both batching sizes the same.

> 

> I've checked that vhost process (and pktgen) goes 100% cpu also.

> 

> For tx: Batching decrements always the performance, in all cases. Not

> sure why bufapi made things better the last time.

> 

> Batching makes improvements until 64 bufs, I see increments of pps but like 1%.

> 

> For rx: Batching always improves performance. It seems that if we

> batch little, bufapi decreases performance, but beyond 64, bufapi is

> much better. The bufapi version keeps improving until I set a batching

> of 1024. So I guess it is super good to have a bunch of buffers to

> receive.

> 

> Since with this test I cannot disable event_idx or things like that,

> what would be the next step for testing?

> 

> Thanks!

> 

> --

> Results:

> # Buf size: 1,16,32,64,128,256,512

> 

> # Tx

> # ===

> # Base

> 2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820

> # Batch

> 2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286

> # Batch + Bufapi

> 2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538

> 

> # Rx

> # ==

> # pktgen results (pps)

> 1223275,1668868,1728794,1769261,1808574,1837252,1846436

> 1456924,1797901,1831234,1868746,1877508,1931598,1936402

> 1368923,1719716,1794373,1865170,1884803,1916021,1975160

> 

> # Testpmd pps results

> 1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75

> 1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034

> 1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316

> 

> pktgen was run again for rx with 1024 and 2048 buf size, giving

> 1988760.75 and 1978316 pps. Testpmd goes the same way.


Don't really understand what does this data mean.
Which number of descs is batched for each run?

-- 
MST
Eugenio Perez Martin July 20, 2020, 11:16 a.m. UTC | #10
On Mon, Jul 20, 2020 at 11:27 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Jul 16, 2020 at 07:16:27PM +0200, Eugenio Perez Martin wrote:

> > On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin <mst@redhat.com> wrote:

> > > On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:

> > > > > > How about playing with the batch size? Make it a mod parameter instead

> > > > > > of the hard coded 64, and measure for all values 1 to 64 ...

> > > > > 

> > > > > Right, according to the test result, 64 seems to be too aggressive in

> > > > > the case of TX.

> > > > > 

> > > > 

> > > > Got it, thanks both!

> > > 

> > > In particular I wonder whether with batch size 1

> > > we get same performance as without batching

> > > (would indicate 64 is too aggressive)

> > > or not (would indicate one of the code changes

> > > affects performance in an unexpected way).

> > > 

> > > --

> > > MST

> > > 

> > 

> > Hi!

> > 

> > Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,

> 

> sorry this is not what I meant.

> 

> I mean something like this:

> 

> 

> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c

> index 0b509be8d7b1..b94680e5721d 100644

> --- a/drivers/vhost/net.c

> +++ b/drivers/vhost/net.c

> @@ -1279,6 +1279,10 @@ static void handle_rx_net(struct vhost_work *work)

>         handle_rx(net);

>  }

> 

> +MODULE_PARM_DESC(batch_num, "Number of batched descriptors. (offset from 64)");

> +module_param(batch_num, int, 0644);

> +static int batch_num = 0;

> +

>  static int vhost_net_open(struct inode *inode, struct file *f)

>  {

>         struct vhost_net *n;

> @@ -1333,7 +1337,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)

>                 vhost_net_buf_init(&n->vqs[i].rxq);

>         }

>         vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,

> -                      UIO_MAXIOV + VHOST_NET_BATCH,

> +                      UIO_MAXIOV + VHOST_NET_BATCH + batch_num,

>                        VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,

>                        NULL);

> 

> 

> then you can try tweaking batching and playing with mod parameter without

> recompiling.

> 

> 

> VHOST_NET_BATCH affects lots of other things.

> 


Ok, got it. Since they were aligned from the start, I thought it was a good idea to maintain them in-sync.

> > and testing

> > the pps as previous mail says. This means that we have either only

> > vhost_net batching (in base testing, like previously to apply this

> > patch) or both batching sizes the same.

> > 

> > I've checked that vhost process (and pktgen) goes 100% cpu also.

> > 

> > For tx: Batching decrements always the performance, in all cases. Not

> > sure why bufapi made things better the last time.

> > 

> > Batching makes improvements until 64 bufs, I see increments of pps but like 1%.

> > 

> > For rx: Batching always improves performance. It seems that if we

> > batch little, bufapi decreases performance, but beyond 64, bufapi is

> > much better. The bufapi version keeps improving until I set a batching

> > of 1024. So I guess it is super good to have a bunch of buffers to

> > receive.

> > 

> > Since with this test I cannot disable event_idx or things like that,

> > what would be the next step for testing?

> > 

> > Thanks!

> > 

> > --

> > Results:

> > # Buf size: 1,16,32,64,128,256,512

> > 

> > # Tx

> > # ===

> > # Base

> > 2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820

> > # Batch

> > 2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286

> > # Batch + Bufapi

> > 2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538

> > 

> > # Rx

> > # ==

> > # pktgen results (pps)

> > 1223275,1668868,1728794,1769261,1808574,1837252,1846436

> > 1456924,1797901,1831234,1868746,1877508,1931598,1936402

> > 1368923,1719716,1794373,1865170,1884803,1916021,1975160

> > 

> > # Testpmd pps results

> > 1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75

> > 1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034

> > 1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316

> > 

> > pktgen was run again for rx with 1024 and 2048 buf size, giving

> > 1988760.75 and 1978316 pps. Testpmd goes the same way.

> 

> Don't really understand what does this data mean.

> Which number of descs is batched for each run?

> 


Sorry, I should have explained better. I will expand here, but feel free to skip it since we are going to discard the
data anyway. Or to propose a better way to tell them.

Is a CSV with the values I've obtained, in pps, from pktgen and testpmd. This way is easy to plot them.

Maybe is easier as tables, if mail readers/gmail does not misalign them.

> > # Tx

> > # ===


Base: With the previous code, not integrating any patch. testpmd is txonly mode, tap interface is XDP_DROP everything.
We vary VHOST_NET_BATCH (1, 16, 32, ...). As Jason put in a previous mail:

TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP


     1     |     16     |     32     |     64     |     128    |    256     |   512  |
2293304.308| 3396057.769| 3540860.615| 3636056.077| 3332950.846| 3694276.154| 3689820|

If we add the batching part of the series, but not the bufapi:

      1     |     16     |     32     |     64     |     128    |    256    |     512    |
2286723.857 | 3307191.643| 3400346.571| 3452527.786| 3460766.857| 3431042.5 | 3440722.286|

And if we add the bufapi part, i.e., all the series:

      1    |     16     |     32     |     64     |     128    |     256    |     512    |    1024
2257970.769| 3151268.385| 3260150.538| 3379383.846| 3424028.846| 3433384.308| 3385635.231| 3406554.538

For easier treatment, all in the same table:

     1      |     16      |     32      |      64     |     128     |    256      |   512      |    1024
------------+-------------+-------------+-------------+-------------+-------------+------------+------------
2293304.308 | 3396057.769 | 3540860.615 | 3636056.077 | 3332950.846 | 3694276.154 | 3689820    |
2286723.857 | 3307191.643 | 3400346.571 | 3452527.786 | 3460766.857 | 3431042.5   | 3440722.286|
2257970.769 | 3151268.385 | 3260150.538 | 3379383.846 | 3424028.846 | 3433384.308 | 3385635.231| 3406554.538
 
> > # Rx

> > # ==


The rx tests are done with pktgen injecting packets in tap interface, and testpmd in rxonly forward mode. Again, each
column is a different value of VHOST_NET_BATCH, and each row is base, +batching, and +buf_api:

> > # pktgen results (pps)


(Didn't record extreme cases like >512 bufs batching)

   1   |   16   |   32   |   64   |   128  |  256   |   512
-------+--------+--------+--------+--------+--------+--------
1223275| 1668868| 1728794| 1769261| 1808574| 1837252| 1846436
1456924| 1797901| 1831234| 1868746| 1877508| 1931598| 1936402
1368923| 1719716| 1794373| 1865170| 1884803| 1916021| 1975160

> > # Testpmd pps results


      1     |     16     |     32     |     64    |    128    |    256     |    512     |    1024    |   2048
------------+------------+------------+-----------+-----------+------------+------------+------------+---------
1222698.143 | 1670604    | 1731040.6  | 1769218   | 1811206   | 1839308.75 | 1848478.75 |
1450140.5   | 1799985.75 | 1834089.75 | 1871290   | 1880005.5 | 1934147.25 | 1939034    |
1370621     | 1721858    | 1796287.75 | 1866618.5 | 1885466.5 | 1918670.75 | 1976173.5  | 1988760.75 | 1978316

The last extreme cases (>512 bufs batched) were recorded just for the bufapi case.

Does that make sense now?

Thanks!
Michael S. Tsirkin July 20, 2020, 11:45 a.m. UTC | #11
On Mon, Jul 20, 2020 at 01:16:47PM +0200, Eugenio Pérez wrote:
> 

> On Mon, Jul 20, 2020 at 11:27 AM Michael S. Tsirkin <mst@redhat.com> wrote:

> > On Thu, Jul 16, 2020 at 07:16:27PM +0200, Eugenio Perez Martin wrote:

> > > On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin <mst@redhat.com> wrote:

> > > > On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:

> > > > > > > How about playing with the batch size? Make it a mod parameter instead

> > > > > > > of the hard coded 64, and measure for all values 1 to 64 ...

> > > > > > 

> > > > > > Right, according to the test result, 64 seems to be too aggressive in

> > > > > > the case of TX.

> > > > > > 

> > > > > 

> > > > > Got it, thanks both!

> > > > 

> > > > In particular I wonder whether with batch size 1

> > > > we get same performance as without batching

> > > > (would indicate 64 is too aggressive)

> > > > or not (would indicate one of the code changes

> > > > affects performance in an unexpected way).

> > > > 

> > > > --

> > > > MST

> > > > 

> > > 

> > > Hi!

> > > 

> > > Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,

> > 

> > sorry this is not what I meant.

> > 

> > I mean something like this:

> > 

> > 

> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c

> > index 0b509be8d7b1..b94680e5721d 100644

> > --- a/drivers/vhost/net.c

> > +++ b/drivers/vhost/net.c

> > @@ -1279,6 +1279,10 @@ static void handle_rx_net(struct vhost_work *work)

> >         handle_rx(net);

> >  }

> > 

> > +MODULE_PARM_DESC(batch_num, "Number of batched descriptors. (offset from 64)");

> > +module_param(batch_num, int, 0644);

> > +static int batch_num = 0;

> > +

> >  static int vhost_net_open(struct inode *inode, struct file *f)

> >  {

> >         struct vhost_net *n;

> > @@ -1333,7 +1337,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)

> >                 vhost_net_buf_init(&n->vqs[i].rxq);

> >         }

> >         vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,

> > -                      UIO_MAXIOV + VHOST_NET_BATCH,

> > +                      UIO_MAXIOV + VHOST_NET_BATCH + batch_num,

> >                        VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,

> >                        NULL);

> > 

> > 

> > then you can try tweaking batching and playing with mod parameter without

> > recompiling.

> > 

> > 

> > VHOST_NET_BATCH affects lots of other things.

> > 

> 

> Ok, got it. Since they were aligned from the start, I thought it was a good idea to maintain them in-sync.

> 

> > > and testing

> > > the pps as previous mail says. This means that we have either only

> > > vhost_net batching (in base testing, like previously to apply this

> > > patch) or both batching sizes the same.

> > > 

> > > I've checked that vhost process (and pktgen) goes 100% cpu also.

> > > 

> > > For tx: Batching decrements always the performance, in all cases. Not

> > > sure why bufapi made things better the last time.

> > > 

> > > Batching makes improvements until 64 bufs, I see increments of pps but like 1%.

> > > 

> > > For rx: Batching always improves performance. It seems that if we

> > > batch little, bufapi decreases performance, but beyond 64, bufapi is

> > > much better. The bufapi version keeps improving until I set a batching

> > > of 1024. So I guess it is super good to have a bunch of buffers to

> > > receive.

> > > 

> > > Since with this test I cannot disable event_idx or things like that,

> > > what would be the next step for testing?

> > > 

> > > Thanks!

> > > 

> > > --

> > > Results:

> > > # Buf size: 1,16,32,64,128,256,512

> > > 

> > > # Tx

> > > # ===

> > > # Base

> > > 2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820

> > > # Batch

> > > 2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286

> > > # Batch + Bufapi

> > > 2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538

> > > 

> > > # Rx

> > > # ==

> > > # pktgen results (pps)

> > > 1223275,1668868,1728794,1769261,1808574,1837252,1846436

> > > 1456924,1797901,1831234,1868746,1877508,1931598,1936402

> > > 1368923,1719716,1794373,1865170,1884803,1916021,1975160

> > > 

> > > # Testpmd pps results

> > > 1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75

> > > 1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034

> > > 1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316

> > > 

> > > pktgen was run again for rx with 1024 and 2048 buf size, giving

> > > 1988760.75 and 1978316 pps. Testpmd goes the same way.

> > 

> > Don't really understand what does this data mean.

> > Which number of descs is batched for each run?

> > 

> 

> Sorry, I should have explained better. I will expand here, but feel free to skip it since we are going to discard the

> data anyway. Or to propose a better way to tell them.

> 

> Is a CSV with the values I've obtained, in pps, from pktgen and testpmd. This way is easy to plot them.

> 

> Maybe is easier as tables, if mail readers/gmail does not misalign them.

> 

> > > # Tx

> > > # ===

> 

> Base: With the previous code, not integrating any patch. testpmd is txonly mode, tap interface is XDP_DROP everything.

> We vary VHOST_NET_BATCH (1, 16, 32, ...). As Jason put in a previous mail:

> 

> TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP

> 

> 

>      1     |     16     |     32     |     64     |     128    |    256     |   512  |

> 2293304.308| 3396057.769| 3540860.615| 3636056.077| 3332950.846| 3694276.154| 3689820|

> 

> If we add the batching part of the series, but not the bufapi:

> 

>       1     |     16     |     32     |     64     |     128    |    256    |     512    |

> 2286723.857 | 3307191.643| 3400346.571| 3452527.786| 3460766.857| 3431042.5 | 3440722.286|

> 

> And if we add the bufapi part, i.e., all the series:

> 

>       1    |     16     |     32     |     64     |     128    |     256    |     512    |    1024

> 2257970.769| 3151268.385| 3260150.538| 3379383.846| 3424028.846| 3433384.308| 3385635.231| 3406554.538

> 

> For easier treatment, all in the same table:

> 

>      1      |     16      |     32      |      64     |     128     |    256      |   512      |    1024

> ------------+-------------+-------------+-------------+-------------+-------------+------------+------------

> 2293304.308 | 3396057.769 | 3540860.615 | 3636056.077 | 3332950.846 | 3694276.154 | 3689820    |

> 2286723.857 | 3307191.643 | 3400346.571 | 3452527.786 | 3460766.857 | 3431042.5   | 3440722.286|

> 2257970.769 | 3151268.385 | 3260150.538 | 3379383.846 | 3424028.846 | 3433384.308 | 3385635.231| 3406554.538

>  

> > > # Rx

> > > # ==

> 

> The rx tests are done with pktgen injecting packets in tap interface, and testpmd in rxonly forward mode. Again, each

> column is a different value of VHOST_NET_BATCH, and each row is base, +batching, and +buf_api:

> 

> > > # pktgen results (pps)

> 

> (Didn't record extreme cases like >512 bufs batching)

> 

>    1   |   16   |   32   |   64   |   128  |  256   |   512

> -------+--------+--------+--------+--------+--------+--------

> 1223275| 1668868| 1728794| 1769261| 1808574| 1837252| 1846436

> 1456924| 1797901| 1831234| 1868746| 1877508| 1931598| 1936402

> 1368923| 1719716| 1794373| 1865170| 1884803| 1916021| 1975160

> 

> > > # Testpmd pps results

> 

>       1     |     16     |     32     |     64    |    128    |    256     |    512     |    1024    |   2048

> ------------+------------+------------+-----------+-----------+------------+------------+------------+---------

> 1222698.143 | 1670604    | 1731040.6  | 1769218   | 1811206   | 1839308.75 | 1848478.75 |

> 1450140.5   | 1799985.75 | 1834089.75 | 1871290   | 1880005.5 | 1934147.25 | 1939034    |

> 1370621     | 1721858    | 1796287.75 | 1866618.5 | 1885466.5 | 1918670.75 | 1976173.5  | 1988760.75 | 1978316

> 

> The last extreme cases (>512 bufs batched) were recorded just for the bufapi case.

> 

> Does that make sense now?

> 

> Thanks!


yes, thanks!
Eugenio Perez Martin July 20, 2020, 1:07 p.m. UTC | #12
On Mon, Jul 20, 2020 at 10:55 AM Jason Wang <jasowang@redhat.com> wrote:
>

>

> On 2020/7/17 上午1:16, Eugenio Perez Martin wrote:

> > On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin <mst@redhat.com> wrote:

> >> On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:

> >>>>> How about playing with the batch size? Make it a mod parameter instead

> >>>>> of the hard coded 64, and measure for all values 1 to 64 ...

> >>>>

> >>>> Right, according to the test result, 64 seems to be too aggressive in

> >>>> the case of TX.

> >>>>

> >>> Got it, thanks both!

> >> In particular I wonder whether with batch size 1

> >> we get same performance as without batching

> >> (would indicate 64 is too aggressive)

> >> or not (would indicate one of the code changes

> >> affects performance in an unexpected way).

> >>

> >> --

> >> MST

> >>

> > Hi!

> >

> > Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,

>

>

> Did you mean varying the value of VHOST_NET_BATCH itself or the number

> of batched descriptors?

>

>

> > and testing

> > the pps as previous mail says. This means that we have either only

> > vhost_net batching (in base testing, like previously to apply this

> > patch) or both batching sizes the same.

> >

> > I've checked that vhost process (and pktgen) goes 100% cpu also.

> >

> > For tx: Batching decrements always the performance, in all cases. Not

> > sure why bufapi made things better the last time.

> >

> > Batching makes improvements until 64 bufs, I see increments of pps but like 1%.

> >

> > For rx: Batching always improves performance. It seems that if we

> > batch little, bufapi decreases performance, but beyond 64, bufapi is

> > much better. The bufapi version keeps improving until I set a batching

> > of 1024. So I guess it is super good to have a bunch of buffers to

> > receive.

> >

> > Since with this test I cannot disable event_idx or things like that,

> > what would be the next step for testing?

> >

> > Thanks!

> >

> > --

> > Results:

> > # Buf size: 1,16,32,64,128,256,512

> >

> > # Tx

> > # ===

> > # Base

> > 2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820

>

>

> What's the meaning of buf size in the context of "base"?

>


Hi Jason.

I think that all the previous questions have been answered in the
response to MST, please let me know if I missed something.

> And I wonder maybe perf diff can help.


Great, I will run it too.

Thanks!

>

> Thanks

>

>

> > # Batch

> > 2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286

> > # Batch + Bufapi

> > 2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538

> >

> > # Rx

> > # ==

> > # pktgen results (pps)

> > 1223275,1668868,1728794,1769261,1808574,1837252,1846436

> > 1456924,1797901,1831234,1868746,1877508,1931598,1936402

> > 1368923,1719716,1794373,1865170,1884803,1916021,1975160

> >

> > # Testpmd pps results

> > 1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75

> > 1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034

> > 1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316

> >

> > pktgen was run again for rx with 1024 and 2048 buf size, giving

> > 1988760.75 and 1978316 pps. Testpmd goes the same way.

> >

>
Jason Wang July 21, 2020, 2:55 a.m. UTC | #13
On 2020/7/20 下午7:16, Eugenio Pérez wrote:
> On Mon, Jul 20, 2020 at 11:27 AM Michael S. Tsirkin <mst@redhat.com> wrote:

>> On Thu, Jul 16, 2020 at 07:16:27PM +0200, Eugenio Perez Martin wrote:

>>> On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin <mst@redhat.com> wrote:

>>>> On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:

>>>>>>> How about playing with the batch size? Make it a mod parameter instead

>>>>>>> of the hard coded 64, and measure for all values 1 to 64 ...

>>>>>> Right, according to the test result, 64 seems to be too aggressive in

>>>>>> the case of TX.

>>>>>>

>>>>> Got it, thanks both!

>>>> In particular I wonder whether with batch size 1

>>>> we get same performance as without batching

>>>> (would indicate 64 is too aggressive)

>>>> or not (would indicate one of the code changes

>>>> affects performance in an unexpected way).

>>>>

>>>> --

>>>> MST

>>>>

>>> Hi!

>>>

>>> Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,

>> sorry this is not what I meant.

>>

>> I mean something like this:

>>

>>

>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c

>> index 0b509be8d7b1..b94680e5721d 100644

>> --- a/drivers/vhost/net.c

>> +++ b/drivers/vhost/net.c

>> @@ -1279,6 +1279,10 @@ static void handle_rx_net(struct vhost_work *work)

>>          handle_rx(net);

>>   }

>>

>> +MODULE_PARM_DESC(batch_num, "Number of batched descriptors. (offset from 64)");

>> +module_param(batch_num, int, 0644);

>> +static int batch_num = 0;

>> +

>>   static int vhost_net_open(struct inode *inode, struct file *f)

>>   {

>>          struct vhost_net *n;

>> @@ -1333,7 +1337,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)

>>                  vhost_net_buf_init(&n->vqs[i].rxq);

>>          }

>>          vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,

>> -                      UIO_MAXIOV + VHOST_NET_BATCH,

>> +                      UIO_MAXIOV + VHOST_NET_BATCH + batch_num,

>>                         VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,

>>                         NULL);

>>

>>

>> then you can try tweaking batching and playing with mod parameter without

>> recompiling.

>>

>>

>> VHOST_NET_BATCH affects lots of other things.

>>

> Ok, got it. Since they were aligned from the start, I thought it was a good idea to maintain them in-sync.

>

>>> and testing

>>> the pps as previous mail says. This means that we have either only

>>> vhost_net batching (in base testing, like previously to apply this

>>> patch) or both batching sizes the same.

>>>

>>> I've checked that vhost process (and pktgen) goes 100% cpu also.

>>>

>>> For tx: Batching decrements always the performance, in all cases. Not

>>> sure why bufapi made things better the last time.

>>>

>>> Batching makes improvements until 64 bufs, I see increments of pps but like 1%.

>>>

>>> For rx: Batching always improves performance. It seems that if we

>>> batch little, bufapi decreases performance, but beyond 64, bufapi is

>>> much better. The bufapi version keeps improving until I set a batching

>>> of 1024. So I guess it is super good to have a bunch of buffers to

>>> receive.

>>>

>>> Since with this test I cannot disable event_idx or things like that,

>>> what would be the next step for testing?

>>>

>>> Thanks!

>>>

>>> --

>>> Results:

>>> # Buf size: 1,16,32,64,128,256,512

>>>

>>> # Tx

>>> # ===

>>> # Base

>>> 2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820

>>> # Batch

>>> 2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286

>>> # Batch + Bufapi

>>> 2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538

>>>

>>> # Rx

>>> # ==

>>> # pktgen results (pps)

>>> 1223275,1668868,1728794,1769261,1808574,1837252,1846436

>>> 1456924,1797901,1831234,1868746,1877508,1931598,1936402

>>> 1368923,1719716,1794373,1865170,1884803,1916021,1975160

>>>

>>> # Testpmd pps results

>>> 1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75

>>> 1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034

>>> 1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316

>>>

>>> pktgen was run again for rx with 1024 and 2048 buf size, giving

>>> 1988760.75 and 1978316 pps. Testpmd goes the same way.

>> Don't really understand what does this data mean.

>> Which number of descs is batched for each run?

>>

> Sorry, I should have explained better. I will expand here, but feel free to skip it since we are going to discard the

> data anyway. Or to propose a better way to tell them.

>

> Is a CSV with the values I've obtained, in pps, from pktgen and testpmd. This way is easy to plot them.

>

> Maybe is easier as tables, if mail readers/gmail does not misalign them.

>

>>> # Tx

>>> # ===

> Base: With the previous code, not integrating any patch. testpmd is txonly mode, tap interface is XDP_DROP everything.

> We vary VHOST_NET_BATCH (1, 16, 32, ...). As Jason put in a previous mail:

>

> TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP

>

>

>       1     |     16     |     32     |     64     |     128    |    256     |   512  |

> 2293304.308| 3396057.769| 3540860.615| 3636056.077| 3332950.846| 3694276.154| 3689820|

>

> If we add the batching part of the series, but not the bufapi:

>

>        1     |     16     |     32     |     64     |     128    |    256    |     512    |

> 2286723.857 | 3307191.643| 3400346.571| 3452527.786| 3460766.857| 3431042.5 | 3440722.286|

>

> And if we add the bufapi part, i.e., all the series:

>

>        1    |     16     |     32     |     64     |     128    |     256    |     512    |    1024

> 2257970.769| 3151268.385| 3260150.538| 3379383.846| 3424028.846| 3433384.308| 3385635.231| 3406554.538

>

> For easier treatment, all in the same table:

>

>       1      |     16      |     32      |      64     |     128     |    256      |   512      |    1024

> ------------+-------------+-------------+-------------+-------------+-------------+------------+------------

> 2293304.308 | 3396057.769 | 3540860.615 | 3636056.077 | 3332950.846 | 3694276.154 | 3689820    |

> 2286723.857 | 3307191.643 | 3400346.571 | 3452527.786 | 3460766.857 | 3431042.5   | 3440722.286|

> 2257970.769 | 3151268.385 | 3260150.538 | 3379383.846 | 3424028.846 | 3433384.308 | 3385635.231| 3406554.538

>   

>>> # Rx

>>> # ==

> The rx tests are done with pktgen injecting packets in tap interface, and testpmd in rxonly forward mode. Again, each

> column is a different value of VHOST_NET_BATCH, and each row is base, +batching, and +buf_api:

>

>>> # pktgen results (pps)

> (Didn't record extreme cases like >512 bufs batching)

>

>     1   |   16   |   32   |   64   |   128  |  256   |   512

> -------+--------+--------+--------+--------+--------+--------

> 1223275| 1668868| 1728794| 1769261| 1808574| 1837252| 1846436

> 1456924| 1797901| 1831234| 1868746| 1877508| 1931598| 1936402

> 1368923| 1719716| 1794373| 1865170| 1884803| 1916021| 1975160

>

>>> # Testpmd pps results

>        1     |     16     |     32     |     64    |    128    |    256     |    512     |    1024    |   2048

> ------------+------------+------------+-----------+-----------+------------+------------+------------+---------

> 1222698.143 | 1670604    | 1731040.6  | 1769218   | 1811206   | 1839308.75 | 1848478.75 |

> 1450140.5   | 1799985.75 | 1834089.75 | 1871290   | 1880005.5 | 1934147.25 | 1939034    |

> 1370621     | 1721858    | 1796287.75 | 1866618.5 | 1885466.5 | 1918670.75 | 1976173.5  | 1988760.75 | 1978316

>

> The last extreme cases (>512 bufs batched) were recorded just for the bufapi case.

>

> Does that make sense now?

>

> Thanks!



I wonder why we saw huge difference between TX and RX pps. Have you used 
samples/pktgen/XXX for doing the test? Maybe you can paste the perf 
record result for the pktgen thread + vhost thread.

Thanks


>
Eugenio Perez Martin July 29, 2020, 6:37 p.m. UTC | #14
On Tue, Jul 21, 2020 at 4:55 AM Jason Wang <jasowang@redhat.com> wrote:
>

>

> On 2020/7/20 下午7:16, Eugenio Pérez wrote:

> > On Mon, Jul 20, 2020 at 11:27 AM Michael S. Tsirkin <mst@redhat.com> wrote:

> >> On Thu, Jul 16, 2020 at 07:16:27PM +0200, Eugenio Perez Martin wrote:

> >>> On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin <mst@redhat.com> wrote:

> >>>> On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:

> >>>>>>> How about playing with the batch size? Make it a mod parameter instead

> >>>>>>> of the hard coded 64, and measure for all values 1 to 64 ...

> >>>>>> Right, according to the test result, 64 seems to be too aggressive in

> >>>>>> the case of TX.

> >>>>>>

> >>>>> Got it, thanks both!

> >>>> In particular I wonder whether with batch size 1

> >>>> we get same performance as without batching

> >>>> (would indicate 64 is too aggressive)

> >>>> or not (would indicate one of the code changes

> >>>> affects performance in an unexpected way).

> >>>>

> >>>> --

> >>>> MST

> >>>>

> >>> Hi!

> >>>

> >>> Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,

> >> sorry this is not what I meant.

> >>

> >> I mean something like this:

> >>

> >>

> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c

> >> index 0b509be8d7b1..b94680e5721d 100644

> >> --- a/drivers/vhost/net.c

> >> +++ b/drivers/vhost/net.c

> >> @@ -1279,6 +1279,10 @@ static void handle_rx_net(struct vhost_work *work)

> >>          handle_rx(net);

> >>   }

> >>

> >> +MODULE_PARM_DESC(batch_num, "Number of batched descriptors. (offset from 64)");

> >> +module_param(batch_num, int, 0644);

> >> +static int batch_num = 0;

> >> +

> >>   static int vhost_net_open(struct inode *inode, struct file *f)

> >>   {

> >>          struct vhost_net *n;

> >> @@ -1333,7 +1337,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)

> >>                  vhost_net_buf_init(&n->vqs[i].rxq);

> >>          }

> >>          vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,

> >> -                      UIO_MAXIOV + VHOST_NET_BATCH,

> >> +                      UIO_MAXIOV + VHOST_NET_BATCH + batch_num,

> >>                         VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,

> >>                         NULL);

> >>

> >>

> >> then you can try tweaking batching and playing with mod parameter without

> >> recompiling.

> >>

> >>

> >> VHOST_NET_BATCH affects lots of other things.

> >>

> > Ok, got it. Since they were aligned from the start, I thought it was a good idea to maintain them in-sync.

> >

> >>> and testing

> >>> the pps as previous mail says. This means that we have either only

> >>> vhost_net batching (in base testing, like previously to apply this

> >>> patch) or both batching sizes the same.

> >>>

> >>> I've checked that vhost process (and pktgen) goes 100% cpu also.

> >>>

> >>> For tx: Batching decrements always the performance, in all cases. Not

> >>> sure why bufapi made things better the last time.

> >>>

> >>> Batching makes improvements until 64 bufs, I see increments of pps but like 1%.

> >>>

> >>> For rx: Batching always improves performance. It seems that if we

> >>> batch little, bufapi decreases performance, but beyond 64, bufapi is

> >>> much better. The bufapi version keeps improving until I set a batching

> >>> of 1024. So I guess it is super good to have a bunch of buffers to

> >>> receive.

> >>>

> >>> Since with this test I cannot disable event_idx or things like that,

> >>> what would be the next step for testing?

> >>>

> >>> Thanks!

> >>>

> >>> --

> >>> Results:

> >>> # Buf size: 1,16,32,64,128,256,512

> >>>

> >>> # Tx

> >>> # ===

> >>> # Base

> >>> 2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820

> >>> # Batch

> >>> 2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286

> >>> # Batch + Bufapi

> >>> 2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538

> >>>

> >>> # Rx

> >>> # ==

> >>> # pktgen results (pps)

> >>> 1223275,1668868,1728794,1769261,1808574,1837252,1846436

> >>> 1456924,1797901,1831234,1868746,1877508,1931598,1936402

> >>> 1368923,1719716,1794373,1865170,1884803,1916021,1975160

> >>>

> >>> # Testpmd pps results

> >>> 1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75

> >>> 1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034

> >>> 1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316

> >>>

> >>> pktgen was run again for rx with 1024 and 2048 buf size, giving

> >>> 1988760.75 and 1978316 pps. Testpmd goes the same way.

> >> Don't really understand what does this data mean.

> >> Which number of descs is batched for each run?

> >>

> > Sorry, I should have explained better. I will expand here, but feel free to skip it since we are going to discard the

> > data anyway. Or to propose a better way to tell them.

> >

> > Is a CSV with the values I've obtained, in pps, from pktgen and testpmd. This way is easy to plot them.

> >

> > Maybe is easier as tables, if mail readers/gmail does not misalign them.

> >


Hi!

Posting here the results varying batch_num with the patch MST proposed.


> >>> # Tx

> >>> # ===

> > Base: With the previous code, not integrating any patch. testpmd is txonly mode, tap interface is XDP_DROP everything.

> > We vary VHOST_NET_BATCH (1, 16, 32, ...). As Jason put in a previous mail:

> >

> > TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP

> >

> >

> >       1     |     16     |     32     |     64     |     128    |    256     |   512  |

> > 2293304.308| 3396057.769| 3540860.615| 3636056.077| 3332950.846| 3694276.154| 3689820|

> >


    -64    |    -63    |    -32    |     0     |     64    |    192    |    448
3493152.154|3495505.462|3494803.692|3492645.692|3501892.154|3496698.846|3495192.462

As Michael said, varying VHOST_NET_BATCH affected much more than
varying only the vhost batch_num. Here we see that to vary batch_size
does not affect pps, since we still have not applied the batch patch.

However, performance is worse in pps when we set VHOST_NET_BATCH to a
bigger value. Would this be a good moment to evaluate if we should
increase it?

> > If we add the batching part of the series, but not the bufapi:

> >

> >        1     |     16     |     32     |     64     |     128    |    256    |     512    |

> > 2286723.857 | 3307191.643| 3400346.571| 3452527.786| 3460766.857| 3431042.5 | 3440722.286|

> >


    -64    |  -63  |    -32    |    0    |    64     |    192    |    448
3403270.286|3420415|3423424.071|3445849.5|3452552.429|3447267.571|3429406.286

As before, adding the batching patch decreases pps, but by a very
little factor this time.

This makes me think: Is

> > And if we add the bufapi part, i.e., all the series:

> >

> >        1    |     16     |     32     |     64     |     128    |     256    |     512    |    1024

> > 2257970.769| 3151268.385| 3260150.538| 3379383.846| 3424028.846| 3433384.308| 3385635.231| 3406554.538

> >


    -64    |    -63    |    -32    |     0     |    64     |  192  |   448
3363233.929|3409874.429|3418717.929|3422728.214|3428160.214|3416061|3428423.071

It looks like a small performance decrease again, but by a very tiny factor.

> > For easier treatment, all in the same table:

> >

> >       1      |     16      |     32      |      64     |     128     |    256      |   512      |    1024

> > ------------+-------------+-------------+-------------+-------------+-------------+------------+------------

> > 2293304.308 | 3396057.769 | 3540860.615 | 3636056.077 | 3332950.846 | 3694276.154 | 3689820    |

> > 2286723.857 | 3307191.643 | 3400346.571 | 3452527.786 | 3460766.857 | 3431042.5   | 3440722.286|

> > 2257970.769 | 3151268.385 | 3260150.538 | 3379383.846 | 3424028.846 | 3433384.308 | 3385635.231| 3406554.538

> >


    -64    |    -63    |    -32    |     0     |     64    |    192    |    448
3493152.154|3495505.462|3494803.692|3492645.692|3501892.154|3496698.846|3495192.462
3403270.286|  3420415  |3423424.071| 3445849.5
|3452552.429|3447267.571|3429406.286
3363233.929|3409874.429|3418717.929|3422728.214|3428160.214|  3416061
|3428423.071

> >>> # Rx

> >>> # ==

> > The rx tests are done with pktgen injecting packets in tap interface, and testpmd in rxonly forward mode. Again, each

> > column is a different value of VHOST_NET_BATCH, and each row is base, +batching, and +buf_api:

> >

> >>> # pktgen results (pps)

> > (Didn't record extreme cases like >512 bufs batching)

> >

> >     1   |   16   |   32   |   64   |   128  |  256   |   512

> > -------+--------+--------+--------+--------+--------+--------

> > 1223275| 1668868| 1728794| 1769261| 1808574| 1837252| 1846436

> > 1456924| 1797901| 1831234| 1868746| 1877508| 1931598| 1936402

> > 1368923| 1719716| 1794373| 1865170| 1884803| 1916021| 1975160

> >


  -64  |  -63  |  -32  |   0   |   64  |  192  |448
1798545|1785760|1788313|1782499|1784369|1788149|1790630
1794057|1837997|1865024|1866864|1890044|1877582|1884620
1804382|1860677|1877419|1885466|1900464|1887813|1896813

Except in the -64 case, buffering and buf_api increase pps rate, more
as more batching is used.

> >>> # Testpmd pps results

> >        1     |     16     |     32     |     64    |    128    |    256     |    512     |    1024    |   2048

> > ------------+------------+------------+-----------+-----------+------------+------------+------------+---------

> > 1222698.143 | 1670604    | 1731040.6  | 1769218   | 1811206   | 1839308.75 | 1848478.75 |

> > 1450140.5   | 1799985.75 | 1834089.75 | 1871290   | 1880005.5 | 1934147.25 | 1939034    |

> > 1370621     | 1721858    | 1796287.75 | 1866618.5 | 1885466.5 | 1918670.75 | 1976173.5  | 1988760.75 | 1978316

> >


    -64   |    -63   |    -32   |    0     |    64    |    192   |   448
1799920   |1786848   |1789520.25|1783995.75|1786184.5 |1790263.75|1793109.25
1796374   |1840254   |1867761   |1868076.25|1892006   |1878957.25|1886311
1805797.25|1862528.75|1879510.75|1888218.5 |1902516.25|1889216.25|1899251.25

Same as previous.


> > The last extreme cases (>512 bufs batched) were recorded just for the bufapi case.

> >

> > Does that make sense now?

> >

> > Thanks!

>

>

> I wonder why we saw huge difference between TX and RX pps. Have you used

> samples/pktgen/XXX for doing the test? Maybe you can paste the perf

> record result for the pktgen thread + vhost thread.

>


With the rx base and batch_num=0 (i.e., with no modifications):
Overhead  Command     Shared Object     Symbol
  14,40%  vhost-3904  [kernel.vmlinux]  [k] copy_user_generic_unrolled
  12,63%  vhost-3904  [tun]             [k] tun_do_read
  11,70%  vhost-3904  [vhost_net]       [k] vhost_net_buf_peek
   9,77%  vhost-3904  [kernel.vmlinux]  [k] _copy_to_iter
   6,52%  vhost-3904  [vhost_net]       [k] handle_rx
   6,29%  vhost-3904  [vhost]           [k] vhost_get_vq_desc
   4,60%  vhost-3904  [kernel.vmlinux]  [k] __check_object_size
   4,14%  vhost-3904  [kernel.vmlinux]  [k] kmem_cache_free
   4,06%  vhost-3904  [kernel.vmlinux]  [k] iov_iter_advance
   3,10%  vhost-3904  [vhost]           [k] translate_desc
   2,60%  vhost-3904  [kernel.vmlinux]  [k] __virt_addr_valid
   2,53%  vhost-3904  [kernel.vmlinux]  [k] __slab_free
   2,16%  vhost-3904  [tun]             [k] tun_recvmsg
   1,64%  vhost-3904  [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
   1,31%  vhost-3904  [vhost_iotlb]     [k]
vhost_iotlb_itree_subtree_search.part.2
   1,27%  vhost-3904  [kernel.vmlinux]  [k] __skb_datagram_iter
   1,12%  vhost-3904  [kernel.vmlinux]  [k] page_frag_free
   0,92%  vhost-3904  [kernel.vmlinux]  [k] skb_release_data
   0,87%  vhost-3904  [kernel.vmlinux]  [k] skb_copy_datagram_iter
   0,62%  vhost-3904  [kernel.vmlinux]  [k] simple_copy_to_iter
   0,60%  vhost-3904  [kernel.vmlinux]  [k] __free_pages_ok
   0,54%  vhost-3904  [kernel.vmlinux]  [k] skb_release_head_state
   0,53%  vhost-3904  [vhost]           [k] vhost_exceeds_weight
   0,53%  vhost-3904  [kernel.vmlinux]  [k] consume_skb
   0,52%  vhost-3904  [vhost_iotlb]     [k] vhost_iotlb_itree_first
   0,45%  vhost-3904  [vhost]           [k] vhost_signal

With rx in batch, I have a few unknown symbols, but much less
copy_user_generic. Not sure why these symbols are unknown, since they
were recorded using the exact same command. I will try to investigate
more, but here they are meanwhile.

I suspect the top unknown one will be the "cpoy_user_generic_unrolled":
  14,06%  vhost-5127  [tun]             [k] tun_do_read
  12,53%  vhost-5127  [vhost_net]       [k] vhost_net_buf_peek
   6,80%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff852cde46
   6,20%  vhost-5127  [vhost_net]       [k] handle_rx
   5,73%  vhost-5127  [vhost]           [k] fetch_buf
   3,77%  vhost-5127  [vhost]           [k] vhost_get_vq_desc
   2,08%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff852cde6e
   1,82%  vhost-5127  [tun]             [k] tun_recvmsg
   1,37%  vhost-5127  [vhost]           [k] translate_desc
   1,34%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff8510b0a8
   1,32%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff852cdec0
   0,94%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff85291688
   0,84%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff852cde49
   0,79%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff852cde44
   0,67%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff8529167c
   0,66%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff852cde5e
   0,64%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff8510b0b6
   0,59%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff85291663
   0,59%  vhost-5127  [vhost_iotlb]     [k]
vhost_iotlb_itree_subtree_search.part.2
   0,57%  vhost-5127  [kernel.vmlinux]  [k] 0xffffffff852916c0

For tx, here we have the base, with a lot of
copy_user_generic/copy_page_from_iter:
  28,87%  vhost-3095  [kernel.vmlinux]  [k] copy_user_generic_unrolled
  16,34%  vhost-3095  [kernel.vmlinux]  [k] copy_page_from_iter
  11,53%  vhost-3095  [vhost_net]       [k] handle_tx_copy
   7,87%  vhost-3095  [vhost]           [k] vhost_get_vq_desc
   5,42%  vhost-3095  [vhost]           [k] translate_desc
   3,47%  vhost-3095  [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
   3,16%  vhost-3095  [tun]             [k] tun_sendmsg
   2,72%  vhost-3095  [vhost_net]       [k] get_tx_bufs
   2,19%  vhost-3095  [vhost_iotlb]     [k]
vhost_iotlb_itree_subtree_search.part.2
   1,84%  vhost-3095  [kernel.vmlinux]  [k] iov_iter_advance
   1,21%  vhost-3095  [tun]             [k] tun_xdp_act.isra.54
   1,15%  vhost-3095  [kernel.vmlinux]  [k] __netif_receive_skb_core
   1,10%  vhost-3095  [kernel.vmlinux]  [k] kmem_cache_free
   1,08%  vhost-3095  [kernel.vmlinux]  [k] __skb_flow_dissect
   0,93%  vhost-3095  [vhost_iotlb]     [k] vhost_iotlb_itree_first
   0,79%  vhost-3095  [vhost]           [k] vhost_exceeds_weight
   0,72%  vhost-3095  [kernel.vmlinux]  [k] copyin
   0,55%  vhost-3095  [vhost]           [k] vhost_signal

And, again, the batch version with unknown symbols. I expected two of
them (copy_user_generic/copy_page_from_iter), but only one unknown
symbol was found.
  21,40%  vhost-3382  [kernel.vmlinux]  [k] 0xffffffff852cde46
  11,07%  vhost-3382  [vhost_net]       [k] handle_tx_copy
   9,91%  vhost-3382  [vhost]           [k] fetch_buf
   3,81%  vhost-3382  [vhost]           [k] vhost_get_vq_desc
   3,55%  vhost-3382  [kernel.vmlinux]  [k] 0xffffffff852cde6e
   3,10%  vhost-3382  [tun]             [k] tun_sendmsg
   2,64%  vhost-3382  [vhost_net]       [k] get_tx_bufs
   2,26%  vhost-3382  [vhost]           [k] translate_desc

Do you want different reports? I will try to resolve these unknown
symbols, and to generate pktgen reports too.

Thanks!

> Thanks

>

>

> >

>