mbox series

[v8,bpf-next,00/14] mvneta: introduce XDP multi-buffer support

Message ID cover.1617885385.git.lorenzo@kernel.org
Headers show
Series mvneta: introduce XDP multi-buffer support | expand

Message

Lorenzo Bianconi April 8, 2021, 12:50 p.m. UTC
This series introduce XDP multi-buffer support. The mvneta driver is
the first to support these new "non-linear" xdp_{buff,frame}. Reviewers
please focus on how these new types of xdp_{buff,frame} packets
traverse the different layers and the layout design. It is on purpose
that BPF-helpers are kept simple, as we don't want to expose the
internal layout to allow later changes.

For now, to keep the design simple and to maintain performance, the XDP
BPF-prog (still) only have access to the first-buffer. It is left for
later (another patchset) to add payload access across multiple buffers.
This patchset should still allow for these future extensions. The goal
is to lift the XDP MTU restriction that comes with XDP, but maintain
same performance as before.

The main idea for the new multi-buffer layout is to reuse the same
layout used for non-linear SKB. We introduced a "xdp_shared_info" data
structure at the end of the first buffer to link together subsequent buffers.
xdp_shared_info will alias skb_shared_info allowing to keep most of the frags
in the same cache-line (while with skb_shared_info only the first fragment will
be placed in the first "shared_info" cache-line). Moreover we introduced some
xdp_shared_info helpers aligned to skb_frag* ones.
Converting xdp_frame to SKB and deliver it to the network stack is shown in
patch 07/14. Building the SKB, the xdp_shared_info structure will be converted
in a skb_shared_info one.

A multi-buffer bit (mb) has been introduced in xdp_{buff,frame} structure
to notify the bpf/network layer if this is a xdp multi-buffer frame (mb = 1)
or not (mb = 0).
The mb bit will be set by a xdp multi-buffer capable driver only for
non-linear frames maintaining the capability to receive linear frames
without any extra cost since the xdp_shared_info structure at the end
of the first buffer will be initialized only if mb is set.

Typical use cases for this series are:
- Jumbo-frames
- Packet header split (please see Google’s use-case @ NetDevConf 0x14, [0])
- TSO

A new frame_length field has been introduce in XDP ctx in order to notify the
eBPF layer about the total frame size (linear + paged parts).

bpf_xdp_adjust_tail and bpf_xdp_copy helpers have been modified to take into
account xdp multi-buff frames.

More info about the main idea behind this approach can be found here [1][2].

Changes since v7:
- rebase on top of bpf-next
- fix sparse warnings
- improve comments for frame_length in include/net/xdp.h

Changes since v6:
- the main difference respect to previous versions is the new approach proposed
  by Eelco to pass full length of the packet to eBPF layer in XDP context
- reintroduce multi-buff support to eBPF kself-tests
- reintroduce multi-buff support to bpf_xdp_adjust_tail helper
- introduce multi-buffer support to bpf_xdp_copy helper
- rebase on top of bpf-next

Changes since v5:
- rebase on top of bpf-next
- initialize mb bit in xdp_init_buff() and drop per-driver initialization
- drop xdp->mb initialization in xdp_convert_zc_to_xdp_frame()
- postpone introduction of frame_length field in XDP ctx to another series
- minor changes

Changes since v4:
- rebase ontop of bpf-next
- introduce xdp_shared_info to build xdp multi-buff instead of using the
  skb_shared_info struct
- introduce frame_length in xdp ctx
- drop previous bpf helpers
- fix bpf_xdp_adjust_tail for xdp multi-buff
- introduce xdp multi-buff self-tests for bpf_xdp_adjust_tail
- fix xdp_return_frame_bulk for xdp multi-buff

Changes since v3:
- rebase ontop of bpf-next
- add patch 10/13 to copy back paged data from a xdp multi-buff frame to
  userspace buffer for xdp multi-buff selftests

Changes since v2:
- add throughput measurements
- drop bpf_xdp_adjust_mb_header bpf helper
- introduce selftest for xdp multibuffer
- addressed comments on bpf_xdp_get_frags_count
- introduce xdp multi-buff support to cpumaps

Changes since v1:
- Fix use-after-free in xdp_return_{buff/frame}
- Introduce bpf helpers
- Introduce xdp_mb sample program
- access skb_shared_info->nr_frags only on the last fragment

Changes since RFC:
- squash multi-buffer bit initialization in a single patch
- add mvneta non-linear XDP buff support for tx side

[0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
[2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

Eelco Chaudron (4):
  bpf: add multi-buff support to the bpf_xdp_adjust_tail() API
  bpd: add multi-buffer support to xdp copy helpers
  bpf: add new frame_length field to the XDP ctx
  bpf: update xdp_adjust_tail selftest to include multi-buffer

Lorenzo Bianconi (10):
  xdp: introduce mb in xdp_buff/xdp_frame
  xdp: add xdp_shared_info data structure
  net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
  xdp: add multi-buff support to xdp_return_{buff/frame}
  net: mvneta: add multi buffer support to XDP_TX
  net: mvneta: enable jumbo frames for XDP
  net: xdp: add multi-buff support to xdp_build_skb_from_fram
  bpf: move user_size out of bpf_test_init
  bpf: introduce multibuff support to bpf_prog_test_run_xdp()
  bpf: test_run: add xdp_shared_info pointer in bpf_test_finish
    signature

 drivers/net/ethernet/marvell/mvneta.c         | 182 ++++++++++--------
 include/linux/filter.h                        |   7 +
 include/net/xdp.h                             | 105 +++++++++-
 include/uapi/linux/bpf.h                      |   1 +
 net/bpf/test_run.c                            | 109 +++++++++--
 net/core/filter.c                             | 134 ++++++++++++-
 net/core/xdp.c                                | 103 +++++++++-
 tools/include/uapi/linux/bpf.h                |   1 +
 .../bpf/prog_tests/xdp_adjust_tail.c          | 105 ++++++++++
 .../selftests/bpf/prog_tests/xdp_bpf2bpf.c    | 127 ++++++++----
 .../bpf/progs/test_xdp_adjust_tail_grow.c     |  17 +-
 .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 ++-
 .../selftests/bpf/progs/test_xdp_bpf2bpf.c    |   3 +-
 13 files changed, 767 insertions(+), 159 deletions(-)

Comments

Lorenzo Bianconi April 9, 2021, 8:16 p.m. UTC | #1
> Lorenzo Bianconi wrote:

[...]

> 
> I just read the commit messages for v8 so far. But, I'm still wondering how
> to handle use cases where we want to put extra bytes at the end of the
> packet, or really anywhere in the general case. We can extend tail with above
> is there anyway to then write into that extra space?
> 
> I think most use cases will only want headers so we can likely make it 
> a callout to a helper. Could we add something like, xdp_get_bytes(start, end)
> to pull in the bytes?
> 
> My dumb pseudoprogram being something like,
> 
>   trailer[16] = {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e}
>   trailer_size = 16;
>   old_end = xdp->length;
>   new_end = xdp->length + trailer_size;
> 
>   err = bpf_xdp_adjust_tail(xdp, trailer_size)
>   if (err) return err;
> 
>   err = xdp_get_bytes(xdp, old_end, new_end);
>   if (err) return err;
> 
>   memcpy(xdp->data, trailer, trailer_size);
> 
> Do you think that could work if we code up xdp_get_bytes()? Does the driver
> have enough context to adjust xdp to map to my get_bytes() call? I think
> so but we should check.

Hi John,

can you please give more details about how xdp_get_bytes() is expected to work?
iiuc trailer will be pulled at the beginning of the frame after updating the
xdp_buff with xdp_get_bytes helper, correct?
If so I guess it will doable, it is just a matter of reserve more space mapping
the buffer on the dma engine respect to XDP_PACKET_HEADROOM. If the frame does
not fit in a single buffer, it will be split over multiple buffers.
If you are referring to add trailer at the end of the buffer, I guess it is
doable as well introducing a bpf helper.
I guess both of the solutions are orthogonal to this series.

Regards,
Lorenzo

> 
> > 
> > More info about the main idea behind this approach can be found here [1][2].
> 
> Thanks for working on this!
>
Eelco Chaudron April 13, 2021, 3:16 p.m. UTC | #2
On 9 Apr 2021, at 2:56, John Fastabend wrote:

> Lorenzo Bianconi wrote:

>> This series introduce XDP multi-buffer support. The mvneta driver is

>> the first to support these new "non-linear" xdp_{buff,frame}. 

>> Reviewers

>> please focus on how these new types of xdp_{buff,frame} packets

>> traverse the different layers and the layout design. It is on purpose

>> that BPF-helpers are kept simple, as we don't want to expose the

>> internal layout to allow later changes.

>>

>> For now, to keep the design simple and to maintain performance, the 

>> XDP

>> BPF-prog (still) only have access to the first-buffer. It is left for

>> later (another patchset) to add payload access across multiple 

>> buffers.

>> This patchset should still allow for these future extensions. The 

>> goal

>> is to lift the XDP MTU restriction that comes with XDP, but maintain

>> same performance as before.

>>

>> The main idea for the new multi-buffer layout is to reuse the same

>> layout used for non-linear SKB. We introduced a "xdp_shared_info" 

>> data

>> structure at the end of the first buffer to link together subsequent 

>> buffers.

>> xdp_shared_info will alias skb_shared_info allowing to keep most of 

>> the frags

>> in the same cache-line (while with skb_shared_info only the first 

>> fragment will

>> be placed in the first "shared_info" cache-line). Moreover we 

>> introduced some

>> xdp_shared_info helpers aligned to skb_frag* ones.

>> Converting xdp_frame to SKB and deliver it to the network stack is 

>> shown in

>> patch 07/14. Building the SKB, the xdp_shared_info structure will be 

>> converted

>> in a skb_shared_info one.

>>

>> A multi-buffer bit (mb) has been introduced in xdp_{buff,frame} 

>> structure

>> to notify the bpf/network layer if this is a xdp multi-buffer frame 

>> (mb = 1)

>> or not (mb = 0).

>> The mb bit will be set by a xdp multi-buffer capable driver only for

>> non-linear frames maintaining the capability to receive linear frames

>> without any extra cost since the xdp_shared_info structure at the end

>> of the first buffer will be initialized only if mb is set.

>>

>> Typical use cases for this series are:

>> - Jumbo-frames

>> - Packet header split (please see Google���s use-case @ 

>> NetDevConf 0x14, [0])

>> - TSO

>>

>> A new frame_length field has been introduce in XDP ctx in order to 

>> notify the

>> eBPF layer about the total frame size (linear + paged parts).

>>

>> bpf_xdp_adjust_tail and bpf_xdp_copy helpers have been modified to 

>> take into

>> account xdp multi-buff frames.

>

> I just read the commit messages for v8 so far. But, I'm still 

> wondering how

> to handle use cases where we want to put extra bytes at the end of the

> packet, or really anywhere in the general case. We can extend tail 

> with above

> is there anyway to then write into that extra space?

>

> I think most use cases will only want headers so we can likely make it

> a callout to a helper. Could we add something like, 

> xdp_get_bytes(start, end)

> to pull in the bytes?

>

> My dumb pseudoprogram being something like,

>

>   trailer[16] = {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e}

>   trailer_size = 16;

>   old_end = xdp->length;

>   new_end = xdp->length + trailer_size;

>

>   err = bpf_xdp_adjust_tail(xdp, trailer_size)

>   if (err) return err;

>

>   err = xdp_get_bytes(xdp, old_end, new_end);

>   if (err) return err;

>

>   memcpy(xdp->data, trailer, trailer_size);

>

> Do you think that could work if we code up xdp_get_bytes()? Does the 

> driver

> have enough context to adjust xdp to map to my get_bytes() call? I 

> think

> so but we should check.

>


I was thinking of doing something like the below, but I have no cycles 
to work on it:

void *bpf_xdp_access_bytes(struct xdp_buff *xdp_md, u32 offset, int 
*len, void *buffer)
      Description
              This function returns a pointer to the packet data, which 
can be
              accessed linearly for a maximum of *len* bytes.

              *offset* marks the starting point in the packet for which 
you
              would like to get a data pointer.

              *len* point to an initialized integer which tells the 
helper
              how many bytes from *offset* you would like to access. 
Supplying
              a value of 0 or less will tell the helper to report back 
how
              many bytes are available linearly from the offset (in this 
case
              the value of *buffer* is ignored). On return, the helper 
will
              update this value with the length available to access
              linearly at the address returned.

              *buffer* point to an optional buffer which MUST be the 
same size
              as *\*len* and will be used to copy in the data if it's 
not
              available linearly.

      Return
              Returns a pointer to the packet data requested accessible 
with
              a maximum length of *\*len*. NULL is returned on failure.

              Note that if a *buffer* is supplied and the data is not 
available
              linearly, the content is copied. In this case a pointer to
              *buffer* is returned.


int bpf_xdp_store_bytes(struct xdp_buff *xdp_md, u32 offset, const void 
*from, u32 len)
      Description
              Store *len* bytes from address *from* into the packet 
associated
              to *xdp_md*, at *offset*. This function will take care of 
copying
              data to multi-buffer XDP packets.

              A call to this helper is susceptible to change the 
underlying
              packet buffer. Therefore, at load time, all checks on 
pointers
              previously done by the verifier are invalidated and must 
be
              performed again, if the helper is used in combination with
              direct packet access.

      Return
              0 on success, or a negative error in case of failure.

>>

>> More info about the main idea behind this approach can be found here 

>> [1][2].

>

> Thanks for working on this!
Magnus Karlsson April 16, 2021, 2:27 p.m. UTC | #3
On Thu, Apr 8, 2021 at 2:51 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>

> This series introduce XDP multi-buffer support. The mvneta driver is

> the first to support these new "non-linear" xdp_{buff,frame}. Reviewers

> please focus on how these new types of xdp_{buff,frame} packets

> traverse the different layers and the layout design. It is on purpose

> that BPF-helpers are kept simple, as we don't want to expose the

> internal layout to allow later changes.

>

> For now, to keep the design simple and to maintain performance, the XDP

> BPF-prog (still) only have access to the first-buffer. It is left for

> later (another patchset) to add payload access across multiple buffers.

> This patchset should still allow for these future extensions. The goal

> is to lift the XDP MTU restriction that comes with XDP, but maintain

> same performance as before.

>

> The main idea for the new multi-buffer layout is to reuse the same

> layout used for non-linear SKB. We introduced a "xdp_shared_info" data

> structure at the end of the first buffer to link together subsequent buffers.

> xdp_shared_info will alias skb_shared_info allowing to keep most of the frags

> in the same cache-line (while with skb_shared_info only the first fragment will

> be placed in the first "shared_info" cache-line). Moreover we introduced some

> xdp_shared_info helpers aligned to skb_frag* ones.

> Converting xdp_frame to SKB and deliver it to the network stack is shown in

> patch 07/14. Building the SKB, the xdp_shared_info structure will be converted

> in a skb_shared_info one.

>

> A multi-buffer bit (mb) has been introduced in xdp_{buff,frame} structure

> to notify the bpf/network layer if this is a xdp multi-buffer frame (mb = 1)

> or not (mb = 0).

> The mb bit will be set by a xdp multi-buffer capable driver only for

> non-linear frames maintaining the capability to receive linear frames

> without any extra cost since the xdp_shared_info structure at the end

> of the first buffer will be initialized only if mb is set.

>

> Typical use cases for this series are:

> - Jumbo-frames

> - Packet header split (please see Google’s use-case @ NetDevConf 0x14, [0])

> - TSO

>

> A new frame_length field has been introduce in XDP ctx in order to notify the

> eBPF layer about the total frame size (linear + paged parts).

>

> bpf_xdp_adjust_tail and bpf_xdp_copy helpers have been modified to take into

> account xdp multi-buff frames.

>

> More info about the main idea behind this approach can be found here [1][2].

>

> Changes since v7:

> - rebase on top of bpf-next

> - fix sparse warnings

> - improve comments for frame_length in include/net/xdp.h

>

> Changes since v6:

> - the main difference respect to previous versions is the new approach proposed

>   by Eelco to pass full length of the packet to eBPF layer in XDP context

> - reintroduce multi-buff support to eBPF kself-tests

> - reintroduce multi-buff support to bpf_xdp_adjust_tail helper

> - introduce multi-buffer support to bpf_xdp_copy helper

> - rebase on top of bpf-next

>

> Changes since v5:

> - rebase on top of bpf-next

> - initialize mb bit in xdp_init_buff() and drop per-driver initialization

> - drop xdp->mb initialization in xdp_convert_zc_to_xdp_frame()

> - postpone introduction of frame_length field in XDP ctx to another series

> - minor changes

>

> Changes since v4:

> - rebase ontop of bpf-next

> - introduce xdp_shared_info to build xdp multi-buff instead of using the

>   skb_shared_info struct

> - introduce frame_length in xdp ctx

> - drop previous bpf helpers

> - fix bpf_xdp_adjust_tail for xdp multi-buff

> - introduce xdp multi-buff self-tests for bpf_xdp_adjust_tail

> - fix xdp_return_frame_bulk for xdp multi-buff

>

> Changes since v3:

> - rebase ontop of bpf-next

> - add patch 10/13 to copy back paged data from a xdp multi-buff frame to

>   userspace buffer for xdp multi-buff selftests

>

> Changes since v2:

> - add throughput measurements

> - drop bpf_xdp_adjust_mb_header bpf helper

> - introduce selftest for xdp multibuffer

> - addressed comments on bpf_xdp_get_frags_count

> - introduce xdp multi-buff support to cpumaps

>

> Changes since v1:

> - Fix use-after-free in xdp_return_{buff/frame}

> - Introduce bpf helpers

> - Introduce xdp_mb sample program

> - access skb_shared_info->nr_frags only on the last fragment

>

> Changes since RFC:

> - squash multi-buffer bit initialization in a single patch

> - add mvneta non-linear XDP buff support for tx side

>

> [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy

> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

> [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)


Took your patches for a test run with the AF_XDP sample xdpsock on an
i40e card and the throughput degradation is between 2 to 6% depending
on the setup and microbenchmark within xdpsock that is executed. And
this is without sending any multi frame packets. Just single frame
ones. Tirtha made changes to the i40e driver to support this new
interface so that is being included in the measurements.

What performance do you see with the mvneta card? How much are we
willing to pay for this feature when it is not being used or can we in
some way selectively turn it on only when needed?

Thanks: Magnus

> Eelco Chaudron (4):

>   bpf: add multi-buff support to the bpf_xdp_adjust_tail() API

>   bpd: add multi-buffer support to xdp copy helpers

>   bpf: add new frame_length field to the XDP ctx

>   bpf: update xdp_adjust_tail selftest to include multi-buffer

>

> Lorenzo Bianconi (10):

>   xdp: introduce mb in xdp_buff/xdp_frame

>   xdp: add xdp_shared_info data structure

>   net: mvneta: update mb bit before passing the xdp buffer to eBPF layer

>   xdp: add multi-buff support to xdp_return_{buff/frame}

>   net: mvneta: add multi buffer support to XDP_TX

>   net: mvneta: enable jumbo frames for XDP

>   net: xdp: add multi-buff support to xdp_build_skb_from_fram

>   bpf: move user_size out of bpf_test_init

>   bpf: introduce multibuff support to bpf_prog_test_run_xdp()

>   bpf: test_run: add xdp_shared_info pointer in bpf_test_finish

>     signature

>

>  drivers/net/ethernet/marvell/mvneta.c         | 182 ++++++++++--------

>  include/linux/filter.h                        |   7 +

>  include/net/xdp.h                             | 105 +++++++++-

>  include/uapi/linux/bpf.h                      |   1 +

>  net/bpf/test_run.c                            | 109 +++++++++--

>  net/core/filter.c                             | 134 ++++++++++++-

>  net/core/xdp.c                                | 103 +++++++++-

>  tools/include/uapi/linux/bpf.h                |   1 +

>  .../bpf/prog_tests/xdp_adjust_tail.c          | 105 ++++++++++

>  .../selftests/bpf/prog_tests/xdp_bpf2bpf.c    | 127 ++++++++----

>  .../bpf/progs/test_xdp_adjust_tail_grow.c     |  17 +-

>  .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 ++-

>  .../selftests/bpf/progs/test_xdp_bpf2bpf.c    |   3 +-

>  13 files changed, 767 insertions(+), 159 deletions(-)

>

> --

> 2.30.2

>
Lorenzo Bianconi April 16, 2021, 9:29 p.m. UTC | #4
> 

> Took your patches for a test run with the AF_XDP sample xdpsock on an

> i40e card and the throughput degradation is between 2 to 6% depending

> on the setup and microbenchmark within xdpsock that is executed. And

> this is without sending any multi frame packets. Just single frame

> ones. Tirtha made changes to the i40e driver to support this new

> interface so that is being included in the measurements.


Hi Magnus,

thx for working on it. Assuming the fragmented part is only initialized/accessed
if mb is set (so for multi frame packets), I would not expect any throughput
degradation in the single frame scenario. Can you please share the i40e
support added by Tirtha?

> 

> What performance do you see with the mvneta card? How much are we

> willing to pay for this feature when it is not being used or can we in

> some way selectively turn it on only when needed?


IIRC I did not get sensible throughput degradation on mvneta but I will re-run
the tests running an updated bpf-next tree.

Regards,
Lorenzo

> 

> Thanks: Magnus

> 

> > Eelco Chaudron (4):

> >   bpf: add multi-buff support to the bpf_xdp_adjust_tail() API

> >   bpd: add multi-buffer support to xdp copy helpers

> >   bpf: add new frame_length field to the XDP ctx

> >   bpf: update xdp_adjust_tail selftest to include multi-buffer

> >

> > Lorenzo Bianconi (10):

> >   xdp: introduce mb in xdp_buff/xdp_frame

> >   xdp: add xdp_shared_info data structure

> >   net: mvneta: update mb bit before passing the xdp buffer to eBPF layer

> >   xdp: add multi-buff support to xdp_return_{buff/frame}

> >   net: mvneta: add multi buffer support to XDP_TX

> >   net: mvneta: enable jumbo frames for XDP

> >   net: xdp: add multi-buff support to xdp_build_skb_from_fram

> >   bpf: move user_size out of bpf_test_init

> >   bpf: introduce multibuff support to bpf_prog_test_run_xdp()

> >   bpf: test_run: add xdp_shared_info pointer in bpf_test_finish

> >     signature

> >

> >  drivers/net/ethernet/marvell/mvneta.c         | 182 ++++++++++--------

> >  include/linux/filter.h                        |   7 +

> >  include/net/xdp.h                             | 105 +++++++++-

> >  include/uapi/linux/bpf.h                      |   1 +

> >  net/bpf/test_run.c                            | 109 +++++++++--

> >  net/core/filter.c                             | 134 ++++++++++++-

> >  net/core/xdp.c                                | 103 +++++++++-

> >  tools/include/uapi/linux/bpf.h                |   1 +

> >  .../bpf/prog_tests/xdp_adjust_tail.c          | 105 ++++++++++

> >  .../selftests/bpf/prog_tests/xdp_bpf2bpf.c    | 127 ++++++++----

> >  .../bpf/progs/test_xdp_adjust_tail_grow.c     |  17 +-

> >  .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 ++-

> >  .../selftests/bpf/progs/test_xdp_bpf2bpf.c    |   3 +-

> >  13 files changed, 767 insertions(+), 159 deletions(-)

> >

> > --

> > 2.30.2

> >

>
Daniel Borkmann April 16, 2021, 11 p.m. UTC | #5
On 4/16/21 11:29 PM, Lorenzo Bianconi wrote:
>>

>> Took your patches for a test run with the AF_XDP sample xdpsock on an

>> i40e card and the throughput degradation is between 2 to 6% depending

>> on the setup and microbenchmark within xdpsock that is executed. And

>> this is without sending any multi frame packets. Just single frame

>> ones. Tirtha made changes to the i40e driver to support this new

>> interface so that is being included in the measurements.

> 

> thx for working on it. Assuming the fragmented part is only initialized/accessed

> if mb is set (so for multi frame packets), I would not expect any throughput

> degradation in the single frame scenario. Can you please share the i40e

> support added by Tirtha?


Thanks Tirtha & Magnus for adding and testing mb support for i40e, and sharing those
data points; a degradation between 2-6% when mb is not used would definitely not be
acceptable. Would be great to root-cause and debug this further with Lorenzo, there
really should be close to /zero/ additional overhead to avoid regressing existing
performance sensitive workloads like load balancers, etc once they upgrade their
kernels/drivers.

>> What performance do you see with the mvneta card? How much are we

>> willing to pay for this feature when it is not being used or can we in

>> some way selectively turn it on only when needed?

> 

> IIRC I did not get sensible throughput degradation on mvneta but I will re-run

> the tests running an updated bpf-next tree.


But compared to i40e, mvneta is also only 1-2.5 Gbps so potentially less visible,
right [0]? Either way, it's definitely good to get more data points from benchmarking
given this was lacking before for higher speed NICs in particular.

Thanks everyone,
Daniel

   [0] https://doc.dpdk.org/guides/nics/mvneta.html
Jesper Dangaard Brouer April 18, 2021, 4:18 p.m. UTC | #6
On Fri, 16 Apr 2021 16:27:18 +0200
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Thu, Apr 8, 2021 at 2:51 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:

> >

> > This series introduce XDP multi-buffer support. The mvneta driver is

> > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers

> > please focus on how these new types of xdp_{buff,frame} packets

> > traverse the different layers and the layout design. It is on purpose

> > that BPF-helpers are kept simple, as we don't want to expose the

> > internal layout to allow later changes.

> >

> > For now, to keep the design simple and to maintain performance, the XDP

> > BPF-prog (still) only have access to the first-buffer. It is left for

> > later (another patchset) to add payload access across multiple buffers.

> > This patchset should still allow for these future extensions. The goal

> > is to lift the XDP MTU restriction that comes with XDP, but maintain

> > same performance as before.

[...]
> >

> > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy

> > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

> > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)  

> 

> Took your patches for a test run with the AF_XDP sample xdpsock on an

> i40e card and the throughput degradation is between 2 to 6% depending

> on the setup and microbenchmark within xdpsock that is executed. And

> this is without sending any multi frame packets. Just single frame

> ones. Tirtha made changes to the i40e driver to support this new

> interface so that is being included in the measurements.


Could you please share Tirtha's i40e support patch with me?

I would like to reproduce these results in my testlab, in-order to
figure out where the throughput degradation comes from.

> What performance do you see with the mvneta card? How much are we

> willing to pay for this feature when it is not being used or can we in

> some way selectively turn it on only when needed?


Well, as Daniel says performance wise we require close to /zero/
additional overhead, especially as you state this happens when sending
a single frame, which is a base case that we must not slowdown.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
Magnus Karlsson April 19, 2021, 6:20 a.m. UTC | #7
On Sun, Apr 18, 2021 at 6:18 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>

> On Fri, 16 Apr 2021 16:27:18 +0200

> Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

>

> > On Thu, Apr 8, 2021 at 2:51 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:

> > >

> > > This series introduce XDP multi-buffer support. The mvneta driver is

> > > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers

> > > please focus on how these new types of xdp_{buff,frame} packets

> > > traverse the different layers and the layout design. It is on purpose

> > > that BPF-helpers are kept simple, as we don't want to expose the

> > > internal layout to allow later changes.

> > >

> > > For now, to keep the design simple and to maintain performance, the XDP

> > > BPF-prog (still) only have access to the first-buffer. It is left for

> > > later (another patchset) to add payload access across multiple buffers.

> > > This patchset should still allow for these future extensions. The goal

> > > is to lift the XDP MTU restriction that comes with XDP, but maintain

> > > same performance as before.

> [...]

> > >

> > > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy

> > > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

> > > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

> >

> > Took your patches for a test run with the AF_XDP sample xdpsock on an

> > i40e card and the throughput degradation is between 2 to 6% depending

> > on the setup and microbenchmark within xdpsock that is executed. And

> > this is without sending any multi frame packets. Just single frame

> > ones. Tirtha made changes to the i40e driver to support this new

> > interface so that is being included in the measurements.

>

> Could you please share Tirtha's i40e support patch with me?


We will post them on the list as an RFC. Tirtha also added AF_XDP
multi-frame support on top of Lorenzo's patches so we will send that
one out as well. Will also rerun my experiments, properly document
them and send out just to be sure that I did not make any mistake.

Just note that I would really like for the multi-frame support to get
in. I have lost count on how many people that have asked for it to be
added to XDP and AF_XDP. So please check our implementation and
improve it so we can get the overhead down to where we want it to be.

Thanks: Magnus

> I would like to reproduce these results in my testlab, in-order to

> figure out where the throughput degradation comes from.

>

> > What performance do you see with the mvneta card? How much are we

> > willing to pay for this feature when it is not being used or can we in

> > some way selectively turn it on only when needed?

>

> Well, as Daniel says performance wise we require close to /zero/

> additional overhead, especially as you state this happens when sending

> a single frame, which is a base case that we must not slowdown.

>

> --

> Best regards,

>   Jesper Dangaard Brouer

>   MSc.CS, Principal Kernel Engineer at Red Hat

>   LinkedIn: http://www.linkedin.com/in/brouer

>
Lorenzo Bianconi April 19, 2021, 6:55 a.m. UTC | #8
> On Sun, Apr 18, 2021 at 6:18 PM Jesper Dangaard Brouer

> <brouer@redhat.com> wrote:

> >

> > On Fri, 16 Apr 2021 16:27:18 +0200

> > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> >

> > > On Thu, Apr 8, 2021 at 2:51 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:

> > > >

> > > > This series introduce XDP multi-buffer support. The mvneta driver is

> > > > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers

> > > > please focus on how these new types of xdp_{buff,frame} packets

> > > > traverse the different layers and the layout design. It is on purpose

> > > > that BPF-helpers are kept simple, as we don't want to expose the

> > > > internal layout to allow later changes.

> > > >

> > > > For now, to keep the design simple and to maintain performance, the XDP

> > > > BPF-prog (still) only have access to the first-buffer. It is left for

> > > > later (another patchset) to add payload access across multiple buffers.

> > > > This patchset should still allow for these future extensions. The goal

> > > > is to lift the XDP MTU restriction that comes with XDP, but maintain

> > > > same performance as before.

> > [...]

> > > >

> > > > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy

> > > > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

> > > > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

> > >

> > > Took your patches for a test run with the AF_XDP sample xdpsock on an

> > > i40e card and the throughput degradation is between 2 to 6% depending

> > > on the setup and microbenchmark within xdpsock that is executed. And

> > > this is without sending any multi frame packets. Just single frame

> > > ones. Tirtha made changes to the i40e driver to support this new

> > > interface so that is being included in the measurements.

> >

> > Could you please share Tirtha's i40e support patch with me?

> 

> We will post them on the list as an RFC. Tirtha also added AF_XDP

> multi-frame support on top of Lorenzo's patches so we will send that

> one out as well. Will also rerun my experiments, properly document

> them and send out just to be sure that I did not make any mistake.


ack, very cool, thx

> 

> Just note that I would really like for the multi-frame support to get

> in. I have lost count on how many people that have asked for it to be

> added to XDP and AF_XDP. So please check our implementation and

> improve it so we can get the overhead down to where we want it to be.


sure, I will do.

Regards,
Lorenzo

> 

> Thanks: Magnus

> 

> > I would like to reproduce these results in my testlab, in-order to

> > figure out where the throughput degradation comes from.

> >

> > > What performance do you see with the mvneta card? How much are we

> > > willing to pay for this feature when it is not being used or can we in

> > > some way selectively turn it on only when needed?

> >

> > Well, as Daniel says performance wise we require close to /zero/

> > additional overhead, especially as you state this happens when sending

> > a single frame, which is a base case that we must not slowdown.

> >

> > --

> > Best regards,

> >   Jesper Dangaard Brouer

> >   MSc.CS, Principal Kernel Engineer at Red Hat

> >   LinkedIn: http://www.linkedin.com/in/brouer

> >

>
Magnus Karlsson April 20, 2021, 1:49 p.m. UTC | #9
On Mon, Apr 19, 2021 at 8:56 AM Lorenzo Bianconi
<lorenzo.bianconi@redhat.com> wrote:
>

> > On Sun, Apr 18, 2021 at 6:18 PM Jesper Dangaard Brouer

> > <brouer@redhat.com> wrote:

> > >

> > > On Fri, 16 Apr 2021 16:27:18 +0200

> > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> > >

> > > > On Thu, Apr 8, 2021 at 2:51 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:

> > > > >

> > > > > This series introduce XDP multi-buffer support. The mvneta driver is

> > > > > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers

> > > > > please focus on how these new types of xdp_{buff,frame} packets

> > > > > traverse the different layers and the layout design. It is on purpose

> > > > > that BPF-helpers are kept simple, as we don't want to expose the

> > > > > internal layout to allow later changes.

> > > > >

> > > > > For now, to keep the design simple and to maintain performance, the XDP

> > > > > BPF-prog (still) only have access to the first-buffer. It is left for

> > > > > later (another patchset) to add payload access across multiple buffers.

> > > > > This patchset should still allow for these future extensions. The goal

> > > > > is to lift the XDP MTU restriction that comes with XDP, but maintain

> > > > > same performance as before.

> > > [...]

> > > > >

> > > > > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy

> > > > > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

> > > > > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

> > > >

> > > > Took your patches for a test run with the AF_XDP sample xdpsock on an

> > > > i40e card and the throughput degradation is between 2 to 6% depending

> > > > on the setup and microbenchmark within xdpsock that is executed. And

> > > > this is without sending any multi frame packets. Just single frame

> > > > ones. Tirtha made changes to the i40e driver to support this new

> > > > interface so that is being included in the measurements.

> > >

> > > Could you please share Tirtha's i40e support patch with me?

> >

> > We will post them on the list as an RFC. Tirtha also added AF_XDP

> > multi-frame support on top of Lorenzo's patches so we will send that

> > one out as well. Will also rerun my experiments, properly document

> > them and send out just to be sure that I did not make any mistake.

>

> ack, very cool, thx


I have now run a new set of experiments on a Cascade Lake server at
2.1 GHz with turbo boost disabled. Two NICs: i40e and ice. The
baseline is commit 5c507329000e ("libbpf: Clarify flags in ringbuf
helpers") and Lorenzo's and Eelco's path set is their v8. First some
runs with xdpsock (i.e. AF_XDP) in both 2-core mode (app on one core
and the driver on another) and 1-core mode using busy_poll.

xdpsock rxdrop throughput change with the multi-buffer patches without
any driver changes:
1-core i40e: -0.5 to 0%   2-cores i40e: -0.5%
1-core ice: -2%   2-cores ice: -1 to -0.5%

xdp_rxq_info -a XDP_DROP
i40e: -4%   ice: +8%

xdp_rxq_info -a XDP_TX
i40e: -10%   ice: +9%

The XDP results with xdp_rxq_info are just weird! I reran them three
times, rebuilt and rebooted in between and I always get the same
results. And I also checked that I am running on the correct NUMA node
and so on. But I have a hard time believing them. Nearly +10% and -10%
difference. Too much in my book. Jesper, could you please run the same
and see what you get? The xdpsock numbers are more in the ballpark of
what I would expect.

Tirtha and I found some optimizations in the i40e
multi-frame/multi-buffer support that we have implemented. Will test
those next, post the results and share the code.

> >

> > Just note that I would really like for the multi-frame support to get

> > in. I have lost count on how many people that have asked for it to be

> > added to XDP and AF_XDP. So please check our implementation and

> > improve it so we can get the overhead down to where we want it to be.

>

> sure, I will do.

>

> Regards,

> Lorenzo

>

> >

> > Thanks: Magnus

> >

> > > I would like to reproduce these results in my testlab, in-order to

> > > figure out where the throughput degradation comes from.

> > >

> > > > What performance do you see with the mvneta card? How much are we

> > > > willing to pay for this feature when it is not being used or can we in

> > > > some way selectively turn it on only when needed?

> > >

> > > Well, as Daniel says performance wise we require close to /zero/

> > > additional overhead, especially as you state this happens when sending

> > > a single frame, which is a base case that we must not slowdown.

> > >

> > > --

> > > Best regards,

> > >   Jesper Dangaard Brouer

> > >   MSc.CS, Principal Kernel Engineer at Red Hat

> > >   LinkedIn: http://www.linkedin.com/in/brouer

> > >

> >
Jesper Dangaard Brouer April 21, 2021, 12:47 p.m. UTC | #10
On Tue, 20 Apr 2021 15:49:44 +0200
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Mon, Apr 19, 2021 at 8:56 AM Lorenzo Bianconi

> <lorenzo.bianconi@redhat.com> wrote:

> >  

> > > On Sun, Apr 18, 2021 at 6:18 PM Jesper Dangaard Brouer

> > > <brouer@redhat.com> wrote:  

> > > >

> > > > On Fri, 16 Apr 2021 16:27:18 +0200

> > > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> > > >  

> > > > > On Thu, Apr 8, 2021 at 2:51 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:  

> > > > > >

> > > > > > This series introduce XDP multi-buffer support. The mvneta driver is

> > > > > > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers

> > > > > > please focus on how these new types of xdp_{buff,frame} packets

> > > > > > traverse the different layers and the layout design. It is on purpose

> > > > > > that BPF-helpers are kept simple, as we don't want to expose the

> > > > > > internal layout to allow later changes.

> > > > > >

> > > > > > For now, to keep the design simple and to maintain performance, the XDP

> > > > > > BPF-prog (still) only have access to the first-buffer. It is left for

> > > > > > later (another patchset) to add payload access across multiple buffers.

> > > > > > This patchset should still allow for these future extensions. The goal

> > > > > > is to lift the XDP MTU restriction that comes with XDP, but maintain

> > > > > > same performance as before.  

> > > > [...]  

> > > > > >

> > > > > > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy

> > > > > > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

> > > > > > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)  

> > > > >

> > > > > Took your patches for a test run with the AF_XDP sample xdpsock on an

> > > > > i40e card and the throughput degradation is between 2 to 6% depending

> > > > > on the setup and microbenchmark within xdpsock that is executed. And

> > > > > this is without sending any multi frame packets. Just single frame

> > > > > ones. Tirtha made changes to the i40e driver to support this new

> > > > > interface so that is being included in the measurements.  

> > > >

> > > > Could you please share Tirtha's i40e support patch with me?  

> > >

> > > We will post them on the list as an RFC. Tirtha also added AF_XDP

> > > multi-frame support on top of Lorenzo's patches so we will send that

> > > one out as well. Will also rerun my experiments, properly document

> > > them and send out just to be sure that I did not make any mistake.  

> >

> > ack, very cool, thx  

> 

> I have now run a new set of experiments on a Cascade Lake server at

> 2.1 GHz with turbo boost disabled. Two NICs: i40e and ice. The

> baseline is commit 5c507329000e ("libbpf: Clarify flags in ringbuf

> helpers") and Lorenzo's and Eelco's path set is their v8. First some

> runs with xdpsock (i.e. AF_XDP) in both 2-core mode (app on one core

> and the driver on another) and 1-core mode using busy_poll.

> 

> xdpsock rxdrop throughput change with the multi-buffer patches without

> any driver changes:

> 1-core i40e: -0.5 to 0%   2-cores i40e: -0.5%

> 1-core ice: -2%   2-cores ice: -1 to -0.5%

> 

> xdp_rxq_info -a XDP_DROP

> i40e: -4%   ice: +8%

> 

> xdp_rxq_info -a XDP_TX

> i40e: -10%   ice: +9%

> 

> The XDP results with xdp_rxq_info are just weird! I reran them three

> times, rebuilt and rebooted in between and I always get the same

> results. And I also checked that I am running on the correct NUMA node

> and so on. But I have a hard time believing them. Nearly +10% and -10%

> difference. Too much in my book. Jesper, could you please run the same

> and see what you get? 


We of-cause have to find the root-cause of the +-10%, but let me drill
into what the 10% represent time/cycle wise.  Using a percentage
difference is usually a really good idea as it implies a comparative
measure (something I always request people to do, as a single
performance number means nothing by itself).

For a zoom-in-benchmarks like these where the amount of code executed
is very small, the effect of removing or adding code can effect the
measurement a lot.

I can only do the tests for i40e, as I don't have ice hardware (but
Intel is working on fixing that ;-)).

 xdp_rxq_info -a XDP_DROP
  i40e: 33,417,775 pps

 CPU is 100% used, so we can calculate nanosec used per packet:
  29.92 nanosec (1/33417775*10^9)
  2.1 GHz CPU =  approx 63 CPU-cycles

 You lost -4% performance in this case.  This correspond to:
  -1.2 nanosec (29.92*0.04) slower
  (This could be cost of single func call overhead = 1.3 ns)
  
My measurement for XDP_TX:

 xdp_rxq_info -a XDP_TX
  28,278,722 pps
  35.36 ns (1/28278722*10^9)

 You lost -10% performance in this case:
  -3.54 nanosec (35.36*0.10) slower

In XDP context 3.54 nanosec is a lot, as you can see it is 10% in this
zoom-in benchmark.  We have to look at the details.

One detail/issue with i40e doing XDP_TX, is that I cannot verify that
packets are actually transmitted... not via exception tracepoint, not
via netstats, not via ethtool_stats.pl.  Maybe all the packets are
getting (silently) drop in my tests...!?!


> The xdpsock numbers are more in the ballpark of

> what I would expect.

>

> Tirtha and I found some optimizations in the i40e

> multi-frame/multi-buffer support that we have implemented. Will test

> those next, post the results and share the code.

> 

> > >

> > > Just note that I would really like for the multi-frame support to get

> > > in. I have lost count on how many people that have asked for it to be

> > > added to XDP and AF_XDP. So please check our implementation and

> > > improve it so we can get the overhead down to where we want it to be.  

> >

> > sure, I will do.

> >

> > Regards,

> > Lorenzo

> >  

> > >

> > > Thanks: Magnus

> > >  

> > > > I would like to reproduce these results in my testlab, in-order to

> > > > figure out where the throughput degradation comes from.

> > > >  

> > > > > What performance do you see with the mvneta card? How much are we

> > > > > willing to pay for this feature when it is not being used or can we in

> > > > > some way selectively turn it on only when needed?  

> > > >

> > > > Well, as Daniel says performance wise we require close to /zero/

> > > > additional overhead, especially as you state this happens when sending

> > > > a single frame, which is a base case that we must not slowdown.

> > > >

> > > > --

> > > > Best regards,

> > > >   Jesper Dangaard Brouer


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Running XDP on dev:i40e2 (ifindex:6) action:XDP_DROP options:read
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      2       33,417,775  0          
XDP-RX CPU      total   33,417,775 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    2:2   33,417,775  0          
rx_queue_index    2:sum 33,417,775 


Running XDP on dev:i40e2 (ifindex:6) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      2       28,278,722  0          
XDP-RX CPU      total   28,278,722 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    2:2   28,278,726  0          
rx_queue_index    2:sum 28,278,726
Magnus Karlsson April 21, 2021, 2:12 p.m. UTC | #11
On Wed, Apr 21, 2021 at 2:48 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>

> On Tue, 20 Apr 2021 15:49:44 +0200

> Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

>

> > On Mon, Apr 19, 2021 at 8:56 AM Lorenzo Bianconi

> > <lorenzo.bianconi@redhat.com> wrote:

> > >

> > > > On Sun, Apr 18, 2021 at 6:18 PM Jesper Dangaard Brouer

> > > > <brouer@redhat.com> wrote:

> > > > >

> > > > > On Fri, 16 Apr 2021 16:27:18 +0200

> > > > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> > > > >

> > > > > > On Thu, Apr 8, 2021 at 2:51 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:

> > > > > > >

> > > > > > > This series introduce XDP multi-buffer support. The mvneta driver is

> > > > > > > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers

> > > > > > > please focus on how these new types of xdp_{buff,frame} packets

> > > > > > > traverse the different layers and the layout design. It is on purpose

> > > > > > > that BPF-helpers are kept simple, as we don't want to expose the

> > > > > > > internal layout to allow later changes.

> > > > > > >

> > > > > > > For now, to keep the design simple and to maintain performance, the XDP

> > > > > > > BPF-prog (still) only have access to the first-buffer. It is left for

> > > > > > > later (another patchset) to add payload access across multiple buffers.

> > > > > > > This patchset should still allow for these future extensions. The goal

> > > > > > > is to lift the XDP MTU restriction that comes with XDP, but maintain

> > > > > > > same performance as before.

> > > > > [...]

> > > > > > >

> > > > > > > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy

> > > > > > > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

> > > > > > > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

> > > > > >

> > > > > > Took your patches for a test run with the AF_XDP sample xdpsock on an

> > > > > > i40e card and the throughput degradation is between 2 to 6% depending

> > > > > > on the setup and microbenchmark within xdpsock that is executed. And

> > > > > > this is without sending any multi frame packets. Just single frame

> > > > > > ones. Tirtha made changes to the i40e driver to support this new

> > > > > > interface so that is being included in the measurements.

> > > > >

> > > > > Could you please share Tirtha's i40e support patch with me?

> > > >

> > > > We will post them on the list as an RFC. Tirtha also added AF_XDP

> > > > multi-frame support on top of Lorenzo's patches so we will send that

> > > > one out as well. Will also rerun my experiments, properly document

> > > > them and send out just to be sure that I did not make any mistake.

> > >

> > > ack, very cool, thx

> >

> > I have now run a new set of experiments on a Cascade Lake server at

> > 2.1 GHz with turbo boost disabled. Two NICs: i40e and ice. The

> > baseline is commit 5c507329000e ("libbpf: Clarify flags in ringbuf

> > helpers") and Lorenzo's and Eelco's path set is their v8. First some

> > runs with xdpsock (i.e. AF_XDP) in both 2-core mode (app on one core

> > and the driver on another) and 1-core mode using busy_poll.

> >

> > xdpsock rxdrop throughput change with the multi-buffer patches without

> > any driver changes:

> > 1-core i40e: -0.5 to 0%   2-cores i40e: -0.5%

> > 1-core ice: -2%   2-cores ice: -1 to -0.5%

> >

> > xdp_rxq_info -a XDP_DROP

> > i40e: -4%   ice: +8%

> >

> > xdp_rxq_info -a XDP_TX

> > i40e: -10%   ice: +9%

> >

> > The XDP results with xdp_rxq_info are just weird! I reran them three

> > times, rebuilt and rebooted in between and I always get the same

> > results. And I also checked that I am running on the correct NUMA node

> > and so on. But I have a hard time believing them. Nearly +10% and -10%

> > difference. Too much in my book. Jesper, could you please run the same

> > and see what you get?

>

> We of-cause have to find the root-cause of the +-10%, but let me drill

> into what the 10% represent time/cycle wise.  Using a percentage

> difference is usually a really good idea as it implies a comparative

> measure (something I always request people to do, as a single

> performance number means nothing by itself).

>

> For a zoom-in-benchmarks like these where the amount of code executed

> is very small, the effect of removing or adding code can effect the

> measurement a lot.

>

> I can only do the tests for i40e, as I don't have ice hardware (but

> Intel is working on fixing that ;-)).

>

>  xdp_rxq_info -a XDP_DROP

>   i40e: 33,417,775 pps


Here I only get around 21 Mpps

>  CPU is 100% used, so we can calculate nanosec used per packet:

>   29.92 nanosec (1/33417775*10^9)

>   2.1 GHz CPU =  approx 63 CPU-cycles

>

>  You lost -4% performance in this case.  This correspond to:

>   -1.2 nanosec (29.92*0.04) slower

>   (This could be cost of single func call overhead = 1.3 ns)

>

> My measurement for XDP_TX:

>

>  xdp_rxq_info -a XDP_TX

>   28,278,722 pps

>   35.36 ns (1/28278722*10^9)


And here, much lower at around 8 Mpps. But I do see correct packets
coming back on the cable for i40e but not for ice! There is likely a
bug there in the XDP_TX logic for ice. Might explain the weird results
I am getting. Will investigate.

But why do I get only a fraction of your performance? XDP_TX touches
the packet so I would expect it to be far less than what you get, but
more than I get. What CPU core do you run on? It actually looks like
your packet data gets prefetched successfully. If it had not, you
would have gotten an access to LLC which is much more expensive than
the drop you are seeing. If I run on the wrong NUMA node, I get 4
Mpps, so it is not that.

One interesting thing is that I get better results using the zero-copy
path in the driver. I start xdp_rxq_drop then tie an AF_XDP socket to
the queue id the XDP program gets its traffic from. The AF_XDP program
will get no traffic in this case, but it will force the driver to use
the zero-copy path for its XDP processing. In this case I get this:

-0.5% for XDP_DROP and +-0% for XDP_TX for i40e.

>  You lost -10% performance in this case:

>   -3.54 nanosec (35.36*0.10) slower

>

> In XDP context 3.54 nanosec is a lot, as you can see it is 10% in this

> zoom-in benchmark.  We have to look at the details.

>

> One detail/issue with i40e doing XDP_TX, is that I cannot verify that

> packets are actually transmitted... not via exception tracepoint, not

> via netstats, not via ethtool_stats.pl.  Maybe all the packets are

> getting (silently) drop in my tests...!?!

>

>

> > The xdpsock numbers are more in the ballpark of

> > what I would expect.

> >

> > Tirtha and I found some optimizations in the i40e

> > multi-frame/multi-buffer support that we have implemented. Will test

> > those next, post the results and share the code.

> >

> > > >

> > > > Just note that I would really like for the multi-frame support to get

> > > > in. I have lost count on how many people that have asked for it to be

> > > > added to XDP and AF_XDP. So please check our implementation and

> > > > improve it so we can get the overhead down to where we want it to be.

> > >

> > > sure, I will do.

> > >

> > > Regards,

> > > Lorenzo

> > >

> > > >

> > > > Thanks: Magnus

> > > >

> > > > > I would like to reproduce these results in my testlab, in-order to

> > > > > figure out where the throughput degradation comes from.

> > > > >

> > > > > > What performance do you see with the mvneta card? How much are we

> > > > > > willing to pay for this feature when it is not being used or can we in

> > > > > > some way selectively turn it on only when needed?

> > > > >

> > > > > Well, as Daniel says performance wise we require close to /zero/

> > > > > additional overhead, especially as you state this happens when sending

> > > > > a single frame, which is a base case that we must not slowdown.

> > > > >

> > > > > --

> > > > > Best regards,

> > > > >   Jesper Dangaard Brouer

>

> --

> Best regards,

>   Jesper Dangaard Brouer

>   MSc.CS, Principal Kernel Engineer at Red Hat

>   LinkedIn: http://www.linkedin.com/in/brouer

>

>

> Running XDP on dev:i40e2 (ifindex:6) action:XDP_DROP options:read

> XDP stats       CPU     pps         issue-pps

> XDP-RX CPU      2       33,417,775  0

> XDP-RX CPU      total   33,417,775

>

> RXQ stats       RXQ:CPU pps         issue-pps

> rx_queue_index    2:2   33,417,775  0

> rx_queue_index    2:sum 33,417,775

>

>

> Running XDP on dev:i40e2 (ifindex:6) action:XDP_TX options:swapmac

> XDP stats       CPU     pps         issue-pps

> XDP-RX CPU      2       28,278,722  0

> XDP-RX CPU      total   28,278,722

>

> RXQ stats       RXQ:CPU pps         issue-pps

> rx_queue_index    2:2   28,278,726  0

> rx_queue_index    2:sum 28,278,726

>

>

>
Jesper Dangaard Brouer April 21, 2021, 3:39 p.m. UTC | #12
On Wed, 21 Apr 2021 16:12:32 +0200
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Wed, Apr 21, 2021 at 2:48 PM Jesper Dangaard Brouer

> <brouer@redhat.com> wrote:

> >

> > On Tue, 20 Apr 2021 15:49:44 +0200

> > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> >  

> > > On Mon, Apr 19, 2021 at 8:56 AM Lorenzo Bianconi

> > > <lorenzo.bianconi@redhat.com> wrote:  

> > > >  

> > > > > On Sun, Apr 18, 2021 at 6:18 PM Jesper Dangaard Brouer

> > > > > <brouer@redhat.com> wrote:  

> > > > > >

> > > > > > On Fri, 16 Apr 2021 16:27:18 +0200

> > > > > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> > > > > >  

> > > > > > > On Thu, Apr 8, 2021 at 2:51 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:  

> > > > > > > >

> > > > > > > > This series introduce XDP multi-buffer support. The mvneta driver is

> > > > > > > > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers

> > > > > > > > please focus on how these new types of xdp_{buff,frame} packets

> > > > > > > > traverse the different layers and the layout design. It is on purpose

> > > > > > > > that BPF-helpers are kept simple, as we don't want to expose the

> > > > > > > > internal layout to allow later changes.

> > > > > > > >

> > > > > > > > For now, to keep the design simple and to maintain performance, the XDP

> > > > > > > > BPF-prog (still) only have access to the first-buffer. It is left for

> > > > > > > > later (another patchset) to add payload access across multiple buffers.

> > > > > > > > This patchset should still allow for these future extensions. The goal

> > > > > > > > is to lift the XDP MTU restriction that comes with XDP, but maintain

> > > > > > > > same performance as before.  

> > > > > > [...]  

> > > > > > > >

> > > > > > > > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy

> > > > > > > > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

> > > > > > > > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)  

> > > > > > >

> > > > > > > Took your patches for a test run with the AF_XDP sample xdpsock on an

> > > > > > > i40e card and the throughput degradation is between 2 to 6% depending

> > > > > > > on the setup and microbenchmark within xdpsock that is executed. And

> > > > > > > this is without sending any multi frame packets. Just single frame

> > > > > > > ones. Tirtha made changes to the i40e driver to support this new

> > > > > > > interface so that is being included in the measurements.  

> > > > > >

> > > > > > Could you please share Tirtha's i40e support patch with me?  

> > > > >

> > > > > We will post them on the list as an RFC. Tirtha also added AF_XDP

> > > > > multi-frame support on top of Lorenzo's patches so we will send that

> > > > > one out as well. Will also rerun my experiments, properly document

> > > > > them and send out just to be sure that I did not make any mistake.  

> > > >

> > > > ack, very cool, thx  

> > >

> > > I have now run a new set of experiments on a Cascade Lake server at

> > > 2.1 GHz with turbo boost disabled. Two NICs: i40e and ice. The

> > > baseline is commit 5c507329000e ("libbpf: Clarify flags in ringbuf

> > > helpers") and Lorenzo's and Eelco's path set is their v8. First some

> > > runs with xdpsock (i.e. AF_XDP) in both 2-core mode (app on one core

> > > and the driver on another) and 1-core mode using busy_poll.

> > >

> > > xdpsock rxdrop throughput change with the multi-buffer patches without

> > > any driver changes:

> > > 1-core i40e: -0.5 to 0%   2-cores i40e: -0.5%

> > > 1-core ice: -2%   2-cores ice: -1 to -0.5%

> > >

> > > xdp_rxq_info -a XDP_DROP

> > > i40e: -4%   ice: +8%

> > >

> > > xdp_rxq_info -a XDP_TX

> > > i40e: -10%   ice: +9%

> > >

> > > The XDP results with xdp_rxq_info are just weird! I reran them three

> > > times, rebuilt and rebooted in between and I always get the same

> > > results. And I also checked that I am running on the correct NUMA node

> > > and so on. But I have a hard time believing them. Nearly +10% and -10%

> > > difference. Too much in my book. Jesper, could you please run the same

> > > and see what you get?  

> >

> > We of-cause have to find the root-cause of the +-10%, but let me drill

> > into what the 10% represent time/cycle wise.  Using a percentage

> > difference is usually a really good idea as it implies a comparative

> > measure (something I always request people to do, as a single

> > performance number means nothing by itself).

> >

> > For a zoom-in-benchmarks like these where the amount of code executed

> > is very small, the effect of removing or adding code can effect the

> > measurement a lot.

> >

> > I can only do the tests for i40e, as I don't have ice hardware (but

> > Intel is working on fixing that ;-)).

> >

> >  xdp_rxq_info -a XDP_DROP

> >   i40e: 33,417,775 pps  

> 

> Here I only get around 21 Mpps

> 

> >  CPU is 100% used, so we can calculate nanosec used per packet:

> >   29.92 nanosec (1/33417775*10^9)

> >   2.1 GHz CPU =  approx 63 CPU-cycles

> >

> >  You lost -4% performance in this case.  This correspond to:

> >   -1.2 nanosec (29.92*0.04) slower

> >   (This could be cost of single func call overhead = 1.3 ns)

> >

> > My measurement for XDP_TX:

> >

> >  xdp_rxq_info -a XDP_TX

> >   28,278,722 pps

> >   35.36 ns (1/28278722*10^9)  

> 

> And here, much lower at around 8 Mpps. But I do see correct packets

> coming back on the cable for i40e but not for ice! There is likely a

> bug there in the XDP_TX logic for ice. Might explain the weird results

> I am getting. Will investigate.

> 

> But why do I get only a fraction of your performance? XDP_TX touches

> the packet so I would expect it to be far less than what you get, but

> more than I get. 


I clearly have a bug in the i40e driver.  As I wrote later, I don't see
any packets transmitted for XDP_TX.  Hmm, I using Mel Gorman's tree,
which doesn't contain the i40e/ice/ixgbe bug we fixed earlier.

The call to xdp_convert_buff_to_frame() fails, but (see below) that
error is simply converted to I40E_XDP_CONSUMED.  Thus, not even the
'trace_xdp_exception' will be able to troubleshoot this.  You/Intel
should consider making XDP_TX errors detectable (this will also happen
if TX ring don't have room).

 int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, struct i40e_ring *xdp_ring)
 {
	struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);

	if (unlikely(!xdpf))
		return I40E_XDP_CONSUMED;

	return i40e_xmit_xdp_ring(xdpf, xdp_ring);
 }


> What CPU core do you run on? 


Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

> It actually looks like

> your packet data gets prefetched successfully. If it had not, you

> would have gotten an access to LLC which is much more expensive than

> the drop you are seeing. If I run on the wrong NUMA node, I get 4

> Mpps, so it is not that.

> 

> One interesting thing is that I get better results using the zero-copy

> path in the driver. I start xdp_rxq_drop then tie an AF_XDP socket to

> the queue id the XDP program gets its traffic from. The AF_XDP program

> will get no traffic in this case, but it will force the driver to use

> the zero-copy path for its XDP processing. In this case I get this:

> 

> -0.5% for XDP_DROP and +-0% for XDP_TX for i40e.

> 

> >  You lost -10% performance in this case:

> >   -3.54 nanosec (35.36*0.10) slower

> >

> > In XDP context 3.54 nanosec is a lot, as you can see it is 10% in this

> > zoom-in benchmark.  We have to look at the details.

> >

> > One detail/issue with i40e doing XDP_TX, is that I cannot verify that

> > packets are actually transmitted... not via exception tracepoint, not

> > via netstats, not via ethtool_stats.pl.  Maybe all the packets are

> > getting (silently) drop in my tests...!?!

> >

> >  

> > > The xdpsock numbers are more in the ballpark of

> > > what I would expect.

> > >

> > > Tirtha and I found some optimizations in the i40e

> > > multi-frame/multi-buffer support that we have implemented. Will test

> > > those next, post the results and share the code.

> > >  

> > > > >

> > > > > Just note that I would really like for the multi-frame support to get

> > > > > in. I have lost count on how many people that have asked for it to be

> > > > > added to XDP and AF_XDP. So please check our implementation and

> > > > > improve it so we can get the overhead down to where we want it to be.  

> > > >

> > > > sure, I will do.

> > > >

> > > > Regards,

> > > > Lorenzo

> > > >  

> > > > >

> > > > > Thanks: Magnus

> > > > >  

> > > > > > I would like to reproduce these results in my testlab, in-order to

> > > > > > figure out where the throughput degradation comes from.

> > > > > >  

> > > > > > > What performance do you see with the mvneta card? How much are we

> > > > > > > willing to pay for this feature when it is not being used or can we in

> > > > > > > some way selectively turn it on only when needed?  

> > > > > >

> > > > > > Well, as Daniel says performance wise we require close to /zero/

> > > > > > additional overhead, especially as you state this happens when sending

> > > > > > a single frame, which is a base case that we must not slowdown.

> > > > > >

> > > > > > --

> > > > > > Best regards,

> > > > > >   Jesper Dangaard Brouer  

> >

> > --

> > Best regards,

> >   Jesper Dangaard Brouer

> >   MSc.CS, Principal Kernel Engineer at Red Hat

> >   LinkedIn: http://www.linkedin.com/in/brouer

> >

> >

> > Running XDP on dev:i40e2 (ifindex:6) action:XDP_DROP options:read

> > XDP stats       CPU     pps         issue-pps

> > XDP-RX CPU      2       33,417,775  0

> > XDP-RX CPU      total   33,417,775

> >

> > RXQ stats       RXQ:CPU pps         issue-pps

> > rx_queue_index    2:2   33,417,775  0

> > rx_queue_index    2:sum 33,417,775

> >

> >

> > Running XDP on dev:i40e2 (ifindex:6) action:XDP_TX options:swapmac

> > XDP stats       CPU     pps         issue-pps

> > XDP-RX CPU      2       28,278,722  0

> > XDP-RX CPU      total   28,278,722

> >

> > RXQ stats       RXQ:CPU pps         issue-pps

> > rx_queue_index    2:2   28,278,726  0

> > rx_queue_index    2:sum 28,278,726

> >

> >

> >  

> 




-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
Magnus Karlsson April 22, 2021, 10:24 a.m. UTC | #13
On Wed, Apr 21, 2021 at 5:39 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>

> On Wed, 21 Apr 2021 16:12:32 +0200

> Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

>

> > On Wed, Apr 21, 2021 at 2:48 PM Jesper Dangaard Brouer

> > <brouer@redhat.com> wrote:

> > >

> > > On Tue, 20 Apr 2021 15:49:44 +0200

> > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> > >

> > > > On Mon, Apr 19, 2021 at 8:56 AM Lorenzo Bianconi

> > > > <lorenzo.bianconi@redhat.com> wrote:

> > > > >

> > > > > > On Sun, Apr 18, 2021 at 6:18 PM Jesper Dangaard Brouer

> > > > > > <brouer@redhat.com> wrote:

> > > > > > >

> > > > > > > On Fri, 16 Apr 2021 16:27:18 +0200

> > > > > > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> > > > > > >

> > > > > > > > On Thu, Apr 8, 2021 at 2:51 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:

> > > > > > > > >

> > > > > > > > > This series introduce XDP multi-buffer support. The mvneta driver is

> > > > > > > > > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers

> > > > > > > > > please focus on how these new types of xdp_{buff,frame} packets

> > > > > > > > > traverse the different layers and the layout design. It is on purpose

> > > > > > > > > that BPF-helpers are kept simple, as we don't want to expose the

> > > > > > > > > internal layout to allow later changes.

> > > > > > > > >

> > > > > > > > > For now, to keep the design simple and to maintain performance, the XDP

> > > > > > > > > BPF-prog (still) only have access to the first-buffer. It is left for

> > > > > > > > > later (another patchset) to add payload access across multiple buffers.

> > > > > > > > > This patchset should still allow for these future extensions. The goal

> > > > > > > > > is to lift the XDP MTU restriction that comes with XDP, but maintain

> > > > > > > > > same performance as before.

> > > > > > > [...]

> > > > > > > > >

> > > > > > > > > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy

> > > > > > > > > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

> > > > > > > > > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

> > > > > > > >

> > > > > > > > Took your patches for a test run with the AF_XDP sample xdpsock on an

> > > > > > > > i40e card and the throughput degradation is between 2 to 6% depending

> > > > > > > > on the setup and microbenchmark within xdpsock that is executed. And

> > > > > > > > this is without sending any multi frame packets. Just single frame

> > > > > > > > ones. Tirtha made changes to the i40e driver to support this new

> > > > > > > > interface so that is being included in the measurements.

> > > > > > >

> > > > > > > Could you please share Tirtha's i40e support patch with me?

> > > > > >

> > > > > > We will post them on the list as an RFC. Tirtha also added AF_XDP

> > > > > > multi-frame support on top of Lorenzo's patches so we will send that

> > > > > > one out as well. Will also rerun my experiments, properly document

> > > > > > them and send out just to be sure that I did not make any mistake.

> > > > >

> > > > > ack, very cool, thx

> > > >

> > > > I have now run a new set of experiments on a Cascade Lake server at

> > > > 2.1 GHz with turbo boost disabled. Two NICs: i40e and ice. The

> > > > baseline is commit 5c507329000e ("libbpf: Clarify flags in ringbuf

> > > > helpers") and Lorenzo's and Eelco's path set is their v8. First some

> > > > runs with xdpsock (i.e. AF_XDP) in both 2-core mode (app on one core

> > > > and the driver on another) and 1-core mode using busy_poll.

> > > >

> > > > xdpsock rxdrop throughput change with the multi-buffer patches without

> > > > any driver changes:

> > > > 1-core i40e: -0.5 to 0%   2-cores i40e: -0.5%

> > > > 1-core ice: -2%   2-cores ice: -1 to -0.5%

> > > >

> > > > xdp_rxq_info -a XDP_DROP

> > > > i40e: -4%   ice: +8%

> > > >

> > > > xdp_rxq_info -a XDP_TX

> > > > i40e: -10%   ice: +9%

> > > >

> > > > The XDP results with xdp_rxq_info are just weird! I reran them three

> > > > times, rebuilt and rebooted in between and I always get the same

> > > > results. And I also checked that I am running on the correct NUMA node

> > > > and so on. But I have a hard time believing them. Nearly +10% and -10%

> > > > difference. Too much in my book. Jesper, could you please run the same

> > > > and see what you get?

> > >

> > > We of-cause have to find the root-cause of the +-10%, but let me drill

> > > into what the 10% represent time/cycle wise.  Using a percentage

> > > difference is usually a really good idea as it implies a comparative

> > > measure (something I always request people to do, as a single

> > > performance number means nothing by itself).

> > >

> > > For a zoom-in-benchmarks like these where the amount of code executed

> > > is very small, the effect of removing or adding code can effect the

> > > measurement a lot.

> > >

> > > I can only do the tests for i40e, as I don't have ice hardware (but

> > > Intel is working on fixing that ;-)).

> > >

> > >  xdp_rxq_info -a XDP_DROP

> > >   i40e: 33,417,775 pps

> >

> > Here I only get around 21 Mpps

> >

> > >  CPU is 100% used, so we can calculate nanosec used per packet:

> > >   29.92 nanosec (1/33417775*10^9)

> > >   2.1 GHz CPU =  approx 63 CPU-cycles

> > >

> > >  You lost -4% performance in this case.  This correspond to:

> > >   -1.2 nanosec (29.92*0.04) slower

> > >   (This could be cost of single func call overhead = 1.3 ns)

> > >

> > > My measurement for XDP_TX:

> > >

> > >  xdp_rxq_info -a XDP_TX

> > >   28,278,722 pps

> > >   35.36 ns (1/28278722*10^9)

> >

> > And here, much lower at around 8 Mpps. But I do see correct packets

> > coming back on the cable for i40e but not for ice! There is likely a

> > bug there in the XDP_TX logic for ice. Might explain the weird results

> > I am getting. Will investigate.

> >

> > But why do I get only a fraction of your performance? XDP_TX touches

> > the packet so I would expect it to be far less than what you get, but

> > more than I get.

>

> I clearly have a bug in the i40e driver.  As I wrote later, I don't see

> any packets transmitted for XDP_TX.  Hmm, I using Mel Gorman's tree,

> which doesn't contain the i40e/ice/ixgbe bug we fixed earlier.

>

> The call to xdp_convert_buff_to_frame() fails, but (see below) that

> error is simply converted to I40E_XDP_CONSUMED.  Thus, not even the

> 'trace_xdp_exception' will be able to troubleshoot this.  You/Intel

> should consider making XDP_TX errors detectable (this will also happen

> if TX ring don't have room).


This is not good. Will submit a fix. Thanks for reporting Jesper.

>  int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, struct i40e_ring *xdp_ring)

>  {

>         struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);

>

>         if (unlikely(!xdpf))

>                 return I40E_XDP_CONSUMED;

>

>         return i40e_xmit_xdp_ring(xdpf, xdp_ring);

>  }

>

>

> > What CPU core do you run on?

>

> Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz


So significantly higher clocked than my system. Explains your high numbers.

> > It actually looks like

> > your packet data gets prefetched successfully. If it had not, you

> > would have gotten an access to LLC which is much more expensive than

> > the drop you are seeing. If I run on the wrong NUMA node, I get 4

> > Mpps, so it is not that.

> >

> > One interesting thing is that I get better results using the zero-copy

> > path in the driver. I start xdp_rxq_drop then tie an AF_XDP socket to

> > the queue id the XDP program gets its traffic from. The AF_XDP program

> > will get no traffic in this case, but it will force the driver to use

> > the zero-copy path for its XDP processing. In this case I get this:

> >

> > -0.5% for XDP_DROP and +-0% for XDP_TX for i40e.

> >

> > >  You lost -10% performance in this case:

> > >   -3.54 nanosec (35.36*0.10) slower

> > >

> > > In XDP context 3.54 nanosec is a lot, as you can see it is 10% in this

> > > zoom-in benchmark.  We have to look at the details.

> > >

> > > One detail/issue with i40e doing XDP_TX, is that I cannot verify that

> > > packets are actually transmitted... not via exception tracepoint, not

> > > via netstats, not via ethtool_stats.pl.  Maybe all the packets are

> > > getting (silently) drop in my tests...!?!

> > >

> > >

> > > > The xdpsock numbers are more in the ballpark of

> > > > what I would expect.

> > > >

> > > > Tirtha and I found some optimizations in the i40e

> > > > multi-frame/multi-buffer support that we have implemented. Will test

> > > > those next, post the results and share the code.

> > > >

> > > > > >

> > > > > > Just note that I would really like for the multi-frame support to get

> > > > > > in. I have lost count on how many people that have asked for it to be

> > > > > > added to XDP and AF_XDP. So please check our implementation and

> > > > > > improve it so we can get the overhead down to where we want it to be.

> > > > >

> > > > > sure, I will do.

> > > > >

> > > > > Regards,

> > > > > Lorenzo

> > > > >

> > > > > >

> > > > > > Thanks: Magnus

> > > > > >

> > > > > > > I would like to reproduce these results in my testlab, in-order to

> > > > > > > figure out where the throughput degradation comes from.

> > > > > > >

> > > > > > > > What performance do you see with the mvneta card? How much are we

> > > > > > > > willing to pay for this feature when it is not being used or can we in

> > > > > > > > some way selectively turn it on only when needed?

> > > > > > >

> > > > > > > Well, as Daniel says performance wise we require close to /zero/

> > > > > > > additional overhead, especially as you state this happens when sending

> > > > > > > a single frame, which is a base case that we must not slowdown.

> > > > > > >

> > > > > > > --

> > > > > > > Best regards,

> > > > > > >   Jesper Dangaard Brouer

> > >

> > > --

> > > Best regards,

> > >   Jesper Dangaard Brouer

> > >   MSc.CS, Principal Kernel Engineer at Red Hat

> > >   LinkedIn: http://www.linkedin.com/in/brouer

> > >

> > >

> > > Running XDP on dev:i40e2 (ifindex:6) action:XDP_DROP options:read

> > > XDP stats       CPU     pps         issue-pps

> > > XDP-RX CPU      2       33,417,775  0

> > > XDP-RX CPU      total   33,417,775

> > >

> > > RXQ stats       RXQ:CPU pps         issue-pps

> > > rx_queue_index    2:2   33,417,775  0

> > > rx_queue_index    2:sum 33,417,775

> > >

> > >

> > > Running XDP on dev:i40e2 (ifindex:6) action:XDP_TX options:swapmac

> > > XDP stats       CPU     pps         issue-pps

> > > XDP-RX CPU      2       28,278,722  0

> > > XDP-RX CPU      total   28,278,722

> > >

> > > RXQ stats       RXQ:CPU pps         issue-pps

> > > rx_queue_index    2:2   28,278,726  0

> > > rx_queue_index    2:sum 28,278,726

> > >

> > >

> > >

> >

>

>

>

> --

> Best regards,

>   Jesper Dangaard Brouer

>   MSc.CS, Principal Kernel Engineer at Red Hat

>   LinkedIn: http://www.linkedin.com/in/brouer

>
Jesper Dangaard Brouer April 22, 2021, 2:42 p.m. UTC | #14
On Thu, 22 Apr 2021 12:24:32 +0200
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Wed, Apr 21, 2021 at 5:39 PM Jesper Dangaard Brouer

> <brouer@redhat.com> wrote:

> >

> > On Wed, 21 Apr 2021 16:12:32 +0200

> > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> >  

[...]
> > > more than I get.  

> >

> > I clearly have a bug in the i40e driver.  As I wrote later, I don't see

> > any packets transmitted for XDP_TX.  Hmm, I using Mel Gorman's tree,

> > which contains the i40e/ice/ixgbe bug we fixed earlier.


Something is wrong with i40e, I changed git-tree to net-next (at
commit 5d869070569a) and XDP seems to have stopped working on i40e :-(

$ uname -a
Linux broadwell 5.12.0-rc7-net-next+ #600 SMP PREEMPT Thu Apr 22 15:13:15 CEST 2021 x86_64 x86_64 x86_64 GNU/Linux

When I load any XDP prog almost no packets are let through:

 [kernel-bpf-samples]$ sudo ./xdp1 i40e2
 libbpf: elf: skipping unrecognized data section(16) .eh_frame
 libbpf: elf: skipping relo section(17) .rel.eh_frame for section(16) .eh_frame
 proto 17:          1 pkt/s
 proto 0:          0 pkt/s
 proto 17:          0 pkt/s
 proto 0:          0 pkt/s
 proto 17:          1 pkt/s

On the same system my mlx5 NIC works fine:

 [kernel-bpf-samples]$ sudo ./xdp1 mlx5p1
 libbpf: elf: skipping unrecognized data section(16) .eh_frame
 libbpf: elf: skipping relo section(17) .rel.eh_frame for section(16) .eh_frame
 proto 17:   10984608 pkt/s
 proto 17:   24374656 pkt/s
 proto 17:   24339904 pkt/s



> > The call to xdp_convert_buff_to_frame() fails, but (see below) that

> > error is simply converted to I40E_XDP_CONSUMED.  Thus, not even the

> > 'trace_xdp_exception' will be able to troubleshoot this.  You/Intel

> > should consider making XDP_TX errors detectable (this will also happen

> > if TX ring don't have room).  

> 

> This is not good. Will submit a fix. Thanks for reporting Jesper.


Usually I use this command to troubleshoot:
  sudo ./xdp_monitor --stats 

But as driver i40e doesn't call the 'trace_xdp_exception' then I don't
capture any error this way.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
Jesper Dangaard Brouer April 22, 2021, 3:05 p.m. UTC | #15
On Thu, 22 Apr 2021 16:42:23 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Thu, 22 Apr 2021 12:24:32 +0200
> Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> 
> > On Wed, Apr 21, 2021 at 5:39 PM Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:  
> > >
> > > On Wed, 21 Apr 2021 16:12:32 +0200
> > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > >    
> [...]
> > > > more than I get.    
> > >
> > > I clearly have a bug in the i40e driver.  As I wrote later, I don't see
> > > any packets transmitted for XDP_TX.  Hmm, I using Mel Gorman's tree,
> > > which contains the i40e/ice/ixgbe bug we fixed earlier.  
> 
> Something is wrong with i40e, I changed git-tree to net-next (at
> commit 5d869070569a) and XDP seems to have stopped working on i40e :-(

Renamed subj as this is without this patchset applied.

> $ uname -a
> Linux broadwell 5.12.0-rc7-net-next+ #600 SMP PREEMPT Thu Apr 22 15:13:15 CEST 2021 x86_64 x86_64 x86_64 GNU/Linux
> 
> When I load any XDP prog almost no packets are let through:
> 
>  [kernel-bpf-samples]$ sudo ./xdp1 i40e2
>  libbpf: elf: skipping unrecognized data section(16) .eh_frame
>  libbpf: elf: skipping relo section(17) .rel.eh_frame for section(16) .eh_frame
>  proto 17:          1 pkt/s
>  proto 0:          0 pkt/s
>  proto 17:          0 pkt/s
>  proto 0:          0 pkt/s
>  proto 17:          1 pkt/s

Trying out xdp_redirect:

 [kernel-bpf-samples]$ sudo ./xdp_redirect i40e2 i40e2
 input: 7 output: 7
 libbpf: elf: skipping unrecognized data section(20) .eh_frame
 libbpf: elf: skipping relo section(21) .rel.eh_frame for section(20) .eh_frame
 libbpf: Kernel error message: XDP program already attached
 WARN: link set xdp fd failed on 7
 ifindex 7:       7357 pkt/s
 ifindex 7:       7909 pkt/s
 ifindex 7:       7909 pkt/s
 ifindex 7:       7909 pkt/s
 ifindex 7:       7909 pkt/s
 ifindex 7:       7909 pkt/s
 ifindex 7:       6357 pkt/s

And then it crash (see below) at page_frag_free+0x31 which calls
virt_to_head_page() with a wrong addr (I guess).  This is called by
i40e_clean_tx_irq+0xc9.

 $ ./scripts/faddr2line drivers/net/ethernet/intel/i40e/i40e.o i40e_clean_tx_irq+0xc9
 i40e_clean_tx_irq+0xc9/0x440:
 i40e_clean_tx_irq at /home/jbrouer/git/kernel/net-next/drivers/net/ethernet/intel/i40e/i40e_txrx.c:976

Which is:

 		/* unmap skb header data */
 Line:976	dma_unmap_single(tx_ring->dev,
				 dma_unmap_addr(tx_buf, dma),
				 dma_unmap_len(tx_buf, len),
				 DMA_TO_DEVICE);


[  935.781751] BUG: unable to handle page fault for address: ffffebde00000008
[  935.788630] #PF: supervisor read access in kernel mode
[  935.793766] #PF: error_code(0x0000) - not-present page
[  935.798906] PGD 0 P4D 0 
[  935.801445] Oops: 0000 [#1] PREEMPT SMP PTI
[  935.805632] CPU: 4 PID: 113 Comm: kworker/u12:9 Not tainted 5.12.0-rc7-net-next+ #600
[  935.813460] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
[  935.820937] Workqueue: events_unbound call_usermodehelper_exec_work
[  935.827214] RIP: 0010:page_frag_free+0x31/0x70
[  935.831656] Code: 00 00 80 48 01 c7 72 55 48 b8 00 00 00 80 7f 77 00 00 48 01 c7 48 b8 00 00 00 00 00 ea ff ff 48 c1 ef 0c 48 c1 e7 06 48 01 c7 <48> 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa f0 ff 4f 34 74 01 c3 48
[  935.850406] RSP: 0018:ffffc900001c0e50 EFLAGS: 00010286
[  935.855629] RAX: ffffea0000000000 RBX: ffff88813b258180 RCX: 0000000000000000
[  935.862764] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffebde00000000
[  935.869895] RBP: ffff88813b258180 R08: ffffc900001c0f38 R09: 0000000000000180
[  935.877028] R10: 00000000fffffe18 R11: ffffc900001c0ff8 R12: ffff88813bc403c0
[  935.884160] R13: 000000000000003c R14: 00000000fffffe18 R15: ffff88813b15b180
[  935.891295] FS:  0000000000000000(0000) GS:ffff88887fd00000(0000) knlGS:0000000000000000
[  935.899380] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  935.905126] CR2: ffffebde00000008 CR3: 000000087e810002 CR4: 00000000003706e0
[  935.912259] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  935.919391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  935.926526] Call Trace:
[  935.928980]  <IRQ>
[  935.930999]  i40e_clean_tx_irq+0xc9/0x440 [i40e]
[  935.935653]  i40e_napi_poll+0x101/0x410 [i40e]
[  935.940116]  __napi_poll+0x2a/0x140
[  935.943607]  net_rx_action+0x215/0x2d0
[  935.947358]  ? i40e_msix_clean_rings+0x3f/0x50 [i40e]
[  935.952449]  __do_softirq+0xe3/0x2df
[  935.956028]  irq_exit_rcu+0xa7/0xb0
[  935.959519]  common_interrupt+0x83/0xa0
[  935.963358]  </IRQ>
[  935.965465]  asm_common_interrupt+0x1e/0x40
[  935.969651] RIP: 0010:clear_page_erms+0x7/0x10
[  935.974096] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc cc cc 0f 1f 44 00 00 48 85 ff 0f 84 f2 00 00
[  935.992845] RSP: 0018:ffffc900003bfbc8 EFLAGS: 00010246
[  935.998069] RAX: 0000000000000000 RBX: 0000000000000901 RCX: 0000000000000340
[  936.005202] RDX: 0000000000002dc2 RSI: 0000000000000000 RDI: ffff88813b4d8cc0
[  936.012334] RBP: ffffea0004ed3600 R08: ffff88887c625f00 R09: ffffea0004ed3600
[  936.019467] R10: ffff888000000000 R11: 0000160000000000 R12: ffffea0004ed3640
[  936.026600] R13: 0000000000000000 R14: 0000000000005c39 R15: ffffc900003bfc50
[  936.033738]  prep_new_page+0x88/0xe0
[  936.037313]  get_page_from_freelist+0x2c6/0x3d0
[  936.041847]  __alloc_pages_nodemask+0x137/0x2e0
[  936.046380]  __vmalloc_node_range+0x14f/0x270
[  936.050738]  copy_process+0x39d/0x1ad0
[  936.054491]  ? kernel_clone+0x8b/0x3c0
[  936.058244]  kernel_clone+0x8b/0x3c0
[  936.061822]  ? dequeue_entity+0xc0/0x270
[  936.065748]  kernel_thread+0x47/0x50
[  936.069329]  ? umh_complete+0x40/0x40
[  936.072992]  call_usermodehelper_exec_work+0x2f/0x90
[  936.077960]  process_one_work+0x1ad/0x380
[  936.081974]  worker_thread+0x50/0x390
[  936.085638]  ? process_one_work+0x380/0x380
[  936.089824]  kthread+0x116/0x150
[  936.093057]  ? kthread_park+0x90/0x90
[  936.096722]  ret_from_fork+0x22/0x30
[  936.100307] Modules linked in: veth nf_defrag_ipv6 nf_defrag_ipv4 tun bridge stp llc rpcrdma sunrpc rdma_ucm ib_umad coretemp rdma_cm ib_ipoib kvm_intel iw_cm ib_cm kvm mlx5_ib i40iw irqbypass ib_uverbs rapl intel_cstate intel_uncore ib_core pcspkr i2c_i801 bfq i2c_smbus acpi_ipmi wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel sd_mod t10_pi ixgbe igb igc mdio mlx5_core i40e mlxfw nfp psample i2c_algo_bit ptp i2c_core pps_core hid_generic [last unloaded: bpfilter]
[  936.142697] CR2: ffffebde00000008
[  936.146015] ---[ end trace 1bffa979f2cccd16 ]---
[  936.156720] RIP: 0010:page_frag_free+0x31/0x70
[  936.161170] Code: 00 00 80 48 01 c7 72 55 48 b8 00 00 00 80 7f 77 00 00 48 01 c7 48 b8 00 00 00 00 00 ea ff ff 48 c1 ef 0c 48 c1 e7 06 48 01 c7 <48> 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa f0 ff 4f 34 74 01 c3 48
[  936.179919] RSP: 0018:ffffc900001c0e50 EFLAGS: 00010286
[  936.185140] RAX: ffffea0000000000 RBX: ffff88813b258180 RCX: 0000000000000000
[  936.192275] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffebde00000000
[  936.199407] RBP: ffff88813b258180 R08: ffffc900001c0f38 R09: 0000000000000180
[  936.206541] R10: 00000000fffffe18 R11: ffffc900001c0ff8 R12: ffff88813bc403c0
[  936.213673] R13: 000000000000003c R14: 00000000fffffe18 R15: ffff88813b15b180
[  936.220804] FS:  0000000000000000(0000) GS:ffff88887fd00000(0000) knlGS:0000000000000000
[  936.228893] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  936.234638] CR2: ffffebde00000008 CR3: 000000087e810002 CR4: 00000000003706e0
[  936.241771] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  936.248903] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  936.256036] Kernel panic - not syncing: Fatal exception in interrupt
[  936.262401] Kernel Offset: disabled
[  936.271867] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
Magnus Karlsson April 23, 2021, 5:28 a.m. UTC | #16
On Thu, Apr 22, 2021 at 5:05 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>

> On Thu, 22 Apr 2021 16:42:23 +0200

> Jesper Dangaard Brouer <brouer@redhat.com> wrote:

>

> > On Thu, 22 Apr 2021 12:24:32 +0200

> > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> >

> > > On Wed, Apr 21, 2021 at 5:39 PM Jesper Dangaard Brouer

> > > <brouer@redhat.com> wrote:

> > > >

> > > > On Wed, 21 Apr 2021 16:12:32 +0200

> > > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> > > >

> > [...]

> > > > > more than I get.

> > > >

> > > > I clearly have a bug in the i40e driver.  As I wrote later, I don't see

> > > > any packets transmitted for XDP_TX.  Hmm, I using Mel Gorman's tree,

> > > > which contains the i40e/ice/ixgbe bug we fixed earlier.

> >

> > Something is wrong with i40e, I changed git-tree to net-next (at

> > commit 5d869070569a) and XDP seems to have stopped working on i40e :-(


Found this out too when switching to the net tree yesterday to work on
proper packet drop tracing as you spotted/requested yesterday. The
commit below completely broke XDP support on i40e (if you do not run
with a zero-copy AF_XDP socket because that path still works). I am
working on a fix that does not just revert the patch, but fixes the
original problem without breaking XDP. Will post it and the tracing
fixes as soon as I can.

commit 12738ac4754ec92a6a45bf3677d8da780a1412b3
Author: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Date:   Fri Mar 26 19:43:40 2021 +0100

    i40e: Fix sparse errors in i40e_txrx.c

    Remove error handling through pointers. Instead use plain int
    to return value from i40e_run_xdp(...).

    Previously:
    - sparse errors were produced during compilation:
    i40e_txrx.c:2338 i40e_run_xdp() error: (-2147483647) too low for ERR_PTR
    i40e_txrx.c:2558 i40e_clean_rx_irq() error: 'skb' dereferencing
possible ERR_PTR()

    - sk_buff* was used to return value, but it has never had valid
    pointer to sk_buff. Returned value was always int handled as
    a pointer.

    Fixes: 0c8493d90b6b ("i40e: add XDP support for pass and drop actions")
    Fixes: 2e6893123830 ("i40e: split XDP_TX tail and XDP_REDIRECT map
flushing")
    Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

    Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>

    Tested-by: Dave Switzer <david.switzer@intel.com>

    Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>



> Renamed subj as this is without this patchset applied.

>

> > $ uname -a

> > Linux broadwell 5.12.0-rc7-net-next+ #600 SMP PREEMPT Thu Apr 22 15:13:15 CEST 2021 x86_64 x86_64 x86_64 GNU/Linux

> >

> > When I load any XDP prog almost no packets are let through:

> >

> >  [kernel-bpf-samples]$ sudo ./xdp1 i40e2

> >  libbpf: elf: skipping unrecognized data section(16) .eh_frame

> >  libbpf: elf: skipping relo section(17) .rel.eh_frame for section(16) .eh_frame

> >  proto 17:          1 pkt/s

> >  proto 0:          0 pkt/s

> >  proto 17:          0 pkt/s

> >  proto 0:          0 pkt/s

> >  proto 17:          1 pkt/s

>

> Trying out xdp_redirect:

>

>  [kernel-bpf-samples]$ sudo ./xdp_redirect i40e2 i40e2

>  input: 7 output: 7

>  libbpf: elf: skipping unrecognized data section(20) .eh_frame

>  libbpf: elf: skipping relo section(21) .rel.eh_frame for section(20) .eh_frame

>  libbpf: Kernel error message: XDP program already attached

>  WARN: link set xdp fd failed on 7

>  ifindex 7:       7357 pkt/s

>  ifindex 7:       7909 pkt/s

>  ifindex 7:       7909 pkt/s

>  ifindex 7:       7909 pkt/s

>  ifindex 7:       7909 pkt/s

>  ifindex 7:       7909 pkt/s

>  ifindex 7:       6357 pkt/s

>

> And then it crash (see below) at page_frag_free+0x31 which calls

> virt_to_head_page() with a wrong addr (I guess).  This is called by

> i40e_clean_tx_irq+0xc9.


Did not see a crash myself, just 4 Kpps. But the rings and DMA
mappings got completely mangled by the patch above, so could be the
same cause.

>  $ ./scripts/faddr2line drivers/net/ethernet/intel/i40e/i40e.o i40e_clean_tx_irq+0xc9

>  i40e_clean_tx_irq+0xc9/0x440:

>  i40e_clean_tx_irq at /home/jbrouer/git/kernel/net-next/drivers/net/ethernet/intel/i40e/i40e_txrx.c:976

>

> Which is:

>

>                 /* unmap skb header data */

>  Line:976       dma_unmap_single(tx_ring->dev,

>                                  dma_unmap_addr(tx_buf, dma),

>                                  dma_unmap_len(tx_buf, len),

>                                  DMA_TO_DEVICE);

>

>

> [  935.781751] BUG: unable to handle page fault for address: ffffebde00000008

> [  935.788630] #PF: supervisor read access in kernel mode

> [  935.793766] #PF: error_code(0x0000) - not-present page

> [  935.798906] PGD 0 P4D 0

> [  935.801445] Oops: 0000 [#1] PREEMPT SMP PTI

> [  935.805632] CPU: 4 PID: 113 Comm: kworker/u12:9 Not tainted 5.12.0-rc7-net-next+ #600

> [  935.813460] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016

> [  935.820937] Workqueue: events_unbound call_usermodehelper_exec_work

> [  935.827214] RIP: 0010:page_frag_free+0x31/0x70

> [  935.831656] Code: 00 00 80 48 01 c7 72 55 48 b8 00 00 00 80 7f 77 00 00 48 01 c7 48 b8 00 00 00 00 00 ea ff ff 48 c1 ef 0c 48 c1 e7 06 48 01 c7 <48> 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa f0 ff 4f 34 74 01 c3 48

> [  935.850406] RSP: 0018:ffffc900001c0e50 EFLAGS: 00010286

> [  935.855629] RAX: ffffea0000000000 RBX: ffff88813b258180 RCX: 0000000000000000

> [  935.862764] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffebde00000000

> [  935.869895] RBP: ffff88813b258180 R08: ffffc900001c0f38 R09: 0000000000000180

> [  935.877028] R10: 00000000fffffe18 R11: ffffc900001c0ff8 R12: ffff88813bc403c0

> [  935.884160] R13: 000000000000003c R14: 00000000fffffe18 R15: ffff88813b15b180

> [  935.891295] FS:  0000000000000000(0000) GS:ffff88887fd00000(0000) knlGS:0000000000000000

> [  935.899380] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

> [  935.905126] CR2: ffffebde00000008 CR3: 000000087e810002 CR4: 00000000003706e0

> [  935.912259] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

> [  935.919391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

> [  935.926526] Call Trace:

> [  935.928980]  <IRQ>

> [  935.930999]  i40e_clean_tx_irq+0xc9/0x440 [i40e]

> [  935.935653]  i40e_napi_poll+0x101/0x410 [i40e]

> [  935.940116]  __napi_poll+0x2a/0x140

> [  935.943607]  net_rx_action+0x215/0x2d0

> [  935.947358]  ? i40e_msix_clean_rings+0x3f/0x50 [i40e]

> [  935.952449]  __do_softirq+0xe3/0x2df

> [  935.956028]  irq_exit_rcu+0xa7/0xb0

> [  935.959519]  common_interrupt+0x83/0xa0

> [  935.963358]  </IRQ>

> [  935.965465]  asm_common_interrupt+0x1e/0x40

> [  935.969651] RIP: 0010:clear_page_erms+0x7/0x10

> [  935.974096] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc cc cc 0f 1f 44 00 00 48 85 ff 0f 84 f2 00 00

> [  935.992845] RSP: 0018:ffffc900003bfbc8 EFLAGS: 00010246

> [  935.998069] RAX: 0000000000000000 RBX: 0000000000000901 RCX: 0000000000000340

> [  936.005202] RDX: 0000000000002dc2 RSI: 0000000000000000 RDI: ffff88813b4d8cc0

> [  936.012334] RBP: ffffea0004ed3600 R08: ffff88887c625f00 R09: ffffea0004ed3600

> [  936.019467] R10: ffff888000000000 R11: 0000160000000000 R12: ffffea0004ed3640

> [  936.026600] R13: 0000000000000000 R14: 0000000000005c39 R15: ffffc900003bfc50

> [  936.033738]  prep_new_page+0x88/0xe0

> [  936.037313]  get_page_from_freelist+0x2c6/0x3d0

> [  936.041847]  __alloc_pages_nodemask+0x137/0x2e0

> [  936.046380]  __vmalloc_node_range+0x14f/0x270

> [  936.050738]  copy_process+0x39d/0x1ad0

> [  936.054491]  ? kernel_clone+0x8b/0x3c0

> [  936.058244]  kernel_clone+0x8b/0x3c0

> [  936.061822]  ? dequeue_entity+0xc0/0x270

> [  936.065748]  kernel_thread+0x47/0x50

> [  936.069329]  ? umh_complete+0x40/0x40

> [  936.072992]  call_usermodehelper_exec_work+0x2f/0x90

> [  936.077960]  process_one_work+0x1ad/0x380

> [  936.081974]  worker_thread+0x50/0x390

> [  936.085638]  ? process_one_work+0x380/0x380

> [  936.089824]  kthread+0x116/0x150

> [  936.093057]  ? kthread_park+0x90/0x90

> [  936.096722]  ret_from_fork+0x22/0x30

> [  936.100307] Modules linked in: veth nf_defrag_ipv6 nf_defrag_ipv4 tun bridge stp llc rpcrdma sunrpc rdma_ucm ib_umad coretemp rdma_cm ib_ipoib kvm_intel iw_cm ib_cm kvm mlx5_ib i40iw irqbypass ib_uverbs rapl intel_cstate intel_uncore ib_core pcspkr i2c_i801 bfq i2c_smbus acpi_ipmi wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel sd_mod t10_pi ixgbe igb igc mdio mlx5_core i40e mlxfw nfp psample i2c_algo_bit ptp i2c_core pps_core hid_generic [last unloaded: bpfilter]

> [  936.142697] CR2: ffffebde00000008

> [  936.146015] ---[ end trace 1bffa979f2cccd16 ]---

> [  936.156720] RIP: 0010:page_frag_free+0x31/0x70

> [  936.161170] Code: 00 00 80 48 01 c7 72 55 48 b8 00 00 00 80 7f 77 00 00 48 01 c7 48 b8 00 00 00 00 00 ea ff ff 48 c1 ef 0c 48 c1 e7 06 48 01 c7 <48> 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa f0 ff 4f 34 74 01 c3 48

> [  936.179919] RSP: 0018:ffffc900001c0e50 EFLAGS: 00010286

> [  936.185140] RAX: ffffea0000000000 RBX: ffff88813b258180 RCX: 0000000000000000

> [  936.192275] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffebde00000000

> [  936.199407] RBP: ffff88813b258180 R08: ffffc900001c0f38 R09: 0000000000000180

> [  936.206541] R10: 00000000fffffe18 R11: ffffc900001c0ff8 R12: ffff88813bc403c0

> [  936.213673] R13: 000000000000003c R14: 00000000fffffe18 R15: ffff88813b15b180

> [  936.220804] FS:  0000000000000000(0000) GS:ffff88887fd00000(0000) knlGS:0000000000000000

> [  936.228893] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

> [  936.234638] CR2: ffffebde00000008 CR3: 000000087e810002 CR4: 00000000003706e0

> [  936.241771] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

> [  936.248903] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

> [  936.256036] Kernel panic - not syncing: Fatal exception in interrupt

> [  936.262401] Kernel Offset: disabled

> [  936.271867] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

>

>

> --

> Best regards,

>   Jesper Dangaard Brouer

>   MSc.CS, Principal Kernel Engineer at Red Hat

>   LinkedIn: http://www.linkedin.com/in/brouer

>
Alexander Duyck April 23, 2021, 4:43 p.m. UTC | #17
On Thu, Apr 22, 2021 at 10:28 PM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>

> On Thu, Apr 22, 2021 at 5:05 PM Jesper Dangaard Brouer

> <brouer@redhat.com> wrote:

> >

> > On Thu, 22 Apr 2021 16:42:23 +0200

> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> >

> > > On Thu, 22 Apr 2021 12:24:32 +0200

> > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> > >

> > > > On Wed, Apr 21, 2021 at 5:39 PM Jesper Dangaard Brouer

> > > > <brouer@redhat.com> wrote:

> > > > >

> > > > > On Wed, 21 Apr 2021 16:12:32 +0200

> > > > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> > > > >

> > > [...]

> > > > > > more than I get.

> > > > >

> > > > > I clearly have a bug in the i40e driver.  As I wrote later, I don't see

> > > > > any packets transmitted for XDP_TX.  Hmm, I using Mel Gorman's tree,

> > > > > which contains the i40e/ice/ixgbe bug we fixed earlier.

> > >

> > > Something is wrong with i40e, I changed git-tree to net-next (at

> > > commit 5d869070569a) and XDP seems to have stopped working on i40e :-(

>

> Found this out too when switching to the net tree yesterday to work on

> proper packet drop tracing as you spotted/requested yesterday. The

> commit below completely broke XDP support on i40e (if you do not run

> with a zero-copy AF_XDP socket because that path still works). I am

> working on a fix that does not just revert the patch, but fixes the

> original problem without breaking XDP. Will post it and the tracing

> fixes as soon as I can.

>

> commit 12738ac4754ec92a6a45bf3677d8da780a1412b3

> Author: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>

> Date:   Fri Mar 26 19:43:40 2021 +0100

>

>     i40e: Fix sparse errors in i40e_txrx.c

>

>     Remove error handling through pointers. Instead use plain int

>     to return value from i40e_run_xdp(...).

>

>     Previously:

>     - sparse errors were produced during compilation:

>     i40e_txrx.c:2338 i40e_run_xdp() error: (-2147483647) too low for ERR_PTR

>     i40e_txrx.c:2558 i40e_clean_rx_irq() error: 'skb' dereferencing

> possible ERR_PTR()

>

>     - sk_buff* was used to return value, but it has never had valid

>     pointer to sk_buff. Returned value was always int handled as

>     a pointer.

>

>     Fixes: 0c8493d90b6b ("i40e: add XDP support for pass and drop actions")

>     Fixes: 2e6893123830 ("i40e: split XDP_TX tail and XDP_REDIRECT map

> flushing")

>     Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

>     Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>

>     Tested-by: Dave Switzer <david.switzer@intel.com>

>     Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>


Yeah, this patch would horribly break things, especially in the
multi-buffer case. The idea behind using the skb pointer to indicate
the error is that it is persistent until we hit the EOP descriptor.
With that removed you end up mangling the entire list of frames since
it will start trying to process the next frame in the middle of a
packet.

>

> > Renamed subj as this is without this patchset applied.

> >

> > > $ uname -a

> > > Linux broadwell 5.12.0-rc7-net-next+ #600 SMP PREEMPT Thu Apr 22 15:13:15 CEST 2021 x86_64 x86_64 x86_64 GNU/Linux

> > >

> > > When I load any XDP prog almost no packets are let through:

> > >

> > >  [kernel-bpf-samples]$ sudo ./xdp1 i40e2

> > >  libbpf: elf: skipping unrecognized data section(16) .eh_frame

> > >  libbpf: elf: skipping relo section(17) .rel.eh_frame for section(16) .eh_frame

> > >  proto 17:          1 pkt/s

> > >  proto 0:          0 pkt/s

> > >  proto 17:          0 pkt/s

> > >  proto 0:          0 pkt/s

> > >  proto 17:          1 pkt/s

> >

> > Trying out xdp_redirect:

> >

> >  [kernel-bpf-samples]$ sudo ./xdp_redirect i40e2 i40e2

> >  input: 7 output: 7

> >  libbpf: elf: skipping unrecognized data section(20) .eh_frame

> >  libbpf: elf: skipping relo section(21) .rel.eh_frame for section(20) .eh_frame

> >  libbpf: Kernel error message: XDP program already attached

> >  WARN: link set xdp fd failed on 7

> >  ifindex 7:       7357 pkt/s

> >  ifindex 7:       7909 pkt/s

> >  ifindex 7:       7909 pkt/s

> >  ifindex 7:       7909 pkt/s

> >  ifindex 7:       7909 pkt/s

> >  ifindex 7:       7909 pkt/s

> >  ifindex 7:       6357 pkt/s

> >

> > And then it crash (see below) at page_frag_free+0x31 which calls

> > virt_to_head_page() with a wrong addr (I guess).  This is called by

> > i40e_clean_tx_irq+0xc9.

>

> Did not see a crash myself, just 4 Kpps. But the rings and DMA

> mappings got completely mangled by the patch above, so could be the

> same cause.


Are you running with jumbo frames enabled? I would think this change
would really blow things up in the jumbo enabled case.
Magnus Karlsson April 25, 2021, 9:45 a.m. UTC | #18
On Fri, Apr 23, 2021 at 6:43 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>

> On Thu, Apr 22, 2021 at 10:28 PM Magnus Karlsson

> <magnus.karlsson@gmail.com> wrote:

> >

> > On Thu, Apr 22, 2021 at 5:05 PM Jesper Dangaard Brouer

> > <brouer@redhat.com> wrote:

> > >

> > > On Thu, 22 Apr 2021 16:42:23 +0200

> > > Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> > >

> > > > On Thu, 22 Apr 2021 12:24:32 +0200

> > > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> > > >

> > > > > On Wed, Apr 21, 2021 at 5:39 PM Jesper Dangaard Brouer

> > > > > <brouer@redhat.com> wrote:

> > > > > >

> > > > > > On Wed, 21 Apr 2021 16:12:32 +0200

> > > > > > Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> > > > > >

> > > > [...]

> > > > > > > more than I get.

> > > > > >

> > > > > > I clearly have a bug in the i40e driver.  As I wrote later, I don't see

> > > > > > any packets transmitted for XDP_TX.  Hmm, I using Mel Gorman's tree,

> > > > > > which contains the i40e/ice/ixgbe bug we fixed earlier.

> > > >

> > > > Something is wrong with i40e, I changed git-tree to net-next (at

> > > > commit 5d869070569a) and XDP seems to have stopped working on i40e :-(

> >

> > Found this out too when switching to the net tree yesterday to work on

> > proper packet drop tracing as you spotted/requested yesterday. The

> > commit below completely broke XDP support on i40e (if you do not run

> > with a zero-copy AF_XDP socket because that path still works). I am

> > working on a fix that does not just revert the patch, but fixes the

> > original problem without breaking XDP. Will post it and the tracing

> > fixes as soon as I can.

> >

> > commit 12738ac4754ec92a6a45bf3677d8da780a1412b3

> > Author: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>

> > Date:   Fri Mar 26 19:43:40 2021 +0100

> >

> >     i40e: Fix sparse errors in i40e_txrx.c

> >

> >     Remove error handling through pointers. Instead use plain int

> >     to return value from i40e_run_xdp(...).

> >

> >     Previously:

> >     - sparse errors were produced during compilation:

> >     i40e_txrx.c:2338 i40e_run_xdp() error: (-2147483647) too low for ERR_PTR

> >     i40e_txrx.c:2558 i40e_clean_rx_irq() error: 'skb' dereferencing

> > possible ERR_PTR()

> >

> >     - sk_buff* was used to return value, but it has never had valid

> >     pointer to sk_buff. Returned value was always int handled as

> >     a pointer.

> >

> >     Fixes: 0c8493d90b6b ("i40e: add XDP support for pass and drop actions")

> >     Fixes: 2e6893123830 ("i40e: split XDP_TX tail and XDP_REDIRECT map

> > flushing")

> >     Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

> >     Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>

> >     Tested-by: Dave Switzer <david.switzer@intel.com>

> >     Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

>

> Yeah, this patch would horribly break things, especially in the

> multi-buffer case. The idea behind using the skb pointer to indicate

> the error is that it is persistent until we hit the EOP descriptor.

> With that removed you end up mangling the entire list of frames since

> it will start trying to process the next frame in the middle of a

> packet.

>

> >

> > > Renamed subj as this is without this patchset applied.

> > >

> > > > $ uname -a

> > > > Linux broadwell 5.12.0-rc7-net-next+ #600 SMP PREEMPT Thu Apr 22 15:13:15 CEST 2021 x86_64 x86_64 x86_64 GNU/Linux

> > > >

> > > > When I load any XDP prog almost no packets are let through:

> > > >

> > > >  [kernel-bpf-samples]$ sudo ./xdp1 i40e2

> > > >  libbpf: elf: skipping unrecognized data section(16) .eh_frame

> > > >  libbpf: elf: skipping relo section(17) .rel.eh_frame for section(16) .eh_frame

> > > >  proto 17:          1 pkt/s

> > > >  proto 0:          0 pkt/s

> > > >  proto 17:          0 pkt/s

> > > >  proto 0:          0 pkt/s

> > > >  proto 17:          1 pkt/s

> > >

> > > Trying out xdp_redirect:

> > >

> > >  [kernel-bpf-samples]$ sudo ./xdp_redirect i40e2 i40e2

> > >  input: 7 output: 7

> > >  libbpf: elf: skipping unrecognized data section(20) .eh_frame

> > >  libbpf: elf: skipping relo section(21) .rel.eh_frame for section(20) .eh_frame

> > >  libbpf: Kernel error message: XDP program already attached

> > >  WARN: link set xdp fd failed on 7

> > >  ifindex 7:       7357 pkt/s

> > >  ifindex 7:       7909 pkt/s

> > >  ifindex 7:       7909 pkt/s

> > >  ifindex 7:       7909 pkt/s

> > >  ifindex 7:       7909 pkt/s

> > >  ifindex 7:       7909 pkt/s

> > >  ifindex 7:       6357 pkt/s

> > >

> > > And then it crash (see below) at page_frag_free+0x31 which calls

> > > virt_to_head_page() with a wrong addr (I guess).  This is called by

> > > i40e_clean_tx_irq+0xc9.

> >

> > Did not see a crash myself, just 4 Kpps. But the rings and DMA

> > mappings got completely mangled by the patch above, so could be the

> > same cause.

>

> Are you running with jumbo frames enabled? I would think this change

> would really blow things up in the jumbo enabled case.


I did not. Just using XDP_DROP or XDP_TX would crash the system just fine.
Lorenzo Bianconi April 27, 2021, 6:28 p.m. UTC | #19
[...]

> Took your patches for a test run with the AF_XDP sample xdpsock on an

> i40e card and the throughput degradation is between 2 to 6% depending

> on the setup and microbenchmark within xdpsock that is executed. And

> this is without sending any multi frame packets. Just single frame

> ones. Tirtha made changes to the i40e driver to support this new

> interface so that is being included in the measurements.

> 

> What performance do you see with the mvneta card? How much are we

> willing to pay for this feature when it is not being used or can we in

> some way selectively turn it on only when needed?


Hi Magnus,

Today I carried out some comparison tests between bpf-next and bpf-next +
xdp_multibuff series on mvneta running xdp_rxq_info sample. Results are
basically aligned:

bpf-next:
- xdp drop ~ 665Kpps
- xdp_tx   ~ 291Kpps
- xdp_pass ~ 118Kpps

bpf-next + xdp_multibuff:
- xdp drop ~ 672Kpps
- xdp_tx   ~ 288Kpps
- xdp_pass ~ 118Kpps

I am not sure if results are affected by the low power CPU, I will run some
tests on ixgbe card.

Regards,
Lorenzo

> 

> Thanks: Magnus

> 

> > Eelco Chaudron (4):

> >   bpf: add multi-buff support to the bpf_xdp_adjust_tail() API

> >   bpd: add multi-buffer support to xdp copy helpers

> >   bpf: add new frame_length field to the XDP ctx

> >   bpf: update xdp_adjust_tail selftest to include multi-buffer

> >

> > Lorenzo Bianconi (10):

> >   xdp: introduce mb in xdp_buff/xdp_frame

> >   xdp: add xdp_shared_info data structure

> >   net: mvneta: update mb bit before passing the xdp buffer to eBPF layer

> >   xdp: add multi-buff support to xdp_return_{buff/frame}

> >   net: mvneta: add multi buffer support to XDP_TX

> >   net: mvneta: enable jumbo frames for XDP

> >   net: xdp: add multi-buff support to xdp_build_skb_from_fram

> >   bpf: move user_size out of bpf_test_init

> >   bpf: introduce multibuff support to bpf_prog_test_run_xdp()

> >   bpf: test_run: add xdp_shared_info pointer in bpf_test_finish

> >     signature

> >

> >  drivers/net/ethernet/marvell/mvneta.c         | 182 ++++++++++--------

> >  include/linux/filter.h                        |   7 +

> >  include/net/xdp.h                             | 105 +++++++++-

> >  include/uapi/linux/bpf.h                      |   1 +

> >  net/bpf/test_run.c                            | 109 +++++++++--

> >  net/core/filter.c                             | 134 ++++++++++++-

> >  net/core/xdp.c                                | 103 +++++++++-

> >  tools/include/uapi/linux/bpf.h                |   1 +

> >  .../bpf/prog_tests/xdp_adjust_tail.c          | 105 ++++++++++

> >  .../selftests/bpf/prog_tests/xdp_bpf2bpf.c    | 127 ++++++++----

> >  .../bpf/progs/test_xdp_adjust_tail_grow.c     |  17 +-

> >  .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 ++-

> >  .../selftests/bpf/progs/test_xdp_bpf2bpf.c    |   3 +-

> >  13 files changed, 767 insertions(+), 159 deletions(-)

> >

> > --

> > 2.30.2

> >
Magnus Karlsson April 28, 2021, 7:41 a.m. UTC | #20
On Tue, Apr 27, 2021 at 8:28 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>

> [...]

>

> > Took your patches for a test run with the AF_XDP sample xdpsock on an

> > i40e card and the throughput degradation is between 2 to 6% depending

> > on the setup and microbenchmark within xdpsock that is executed. And

> > this is without sending any multi frame packets. Just single frame

> > ones. Tirtha made changes to the i40e driver to support this new

> > interface so that is being included in the measurements.

> >

> > What performance do you see with the mvneta card? How much are we

> > willing to pay for this feature when it is not being used or can we in

> > some way selectively turn it on only when needed?

>

> Hi Magnus,

>

> Today I carried out some comparison tests between bpf-next and bpf-next +

> xdp_multibuff series on mvneta running xdp_rxq_info sample. Results are

> basically aligned:

>

> bpf-next:

> - xdp drop ~ 665Kpps

> - xdp_tx   ~ 291Kpps

> - xdp_pass ~ 118Kpps

>

> bpf-next + xdp_multibuff:

> - xdp drop ~ 672Kpps

> - xdp_tx   ~ 288Kpps

> - xdp_pass ~ 118Kpps

>

> I am not sure if results are affected by the low power CPU, I will run some

> tests on ixgbe card.


Thanks Lorenzo. I made some new runs, this time with i40e driver
changes as a new data point. Same baseline as before but with patches
[1] and [2] applied. Note
that if you use net or net-next and i40e, you need patch [3] too.

The i40e multi-buffer support will be posted on the mailing list as a
separate RFC patch so you can reproduce and review.

Note, calculations are performed on non-truncated numbers. So 2 ns
might be 5 cycles on my 2.1 GHz machine since 2.49 ns * 2.1 GHz =
5.229 cycles ~ 5 cycles. xdpsock is run in zero-copy mode so it uses
the zero-copy driver data path in contrast with xdp_rxq_info that uses
the regular driver data path. Only ran the busy-poll 1-core case this
time. Reported numbers are the average over 3 runs.

multi-buffer patches without any driver changes:

xdpsock rxdrop 1-core:
i40e: -4.5% in throughput / +3 ns / +6 cycles
ice: -1.5% / +1 ns / +2 cycles

xdp_rxq_info -a XDP_DROP
i40e: -2.5% / +2 ns / +3 cycles
ice: +6% / -3 ns / -7 cycles

xdp_rxq_info -a XDP_TX
i40e: -10% / +15 ns / +32 cycles
ice: -9% / +14 ns / +29 cycles

multi-buffer patches + i40e driver changes from Tirtha:

xdpsock rxdrop 1-core:
i40e: -3% / +2 ns / +3 cycles

xdp_rxq_info -a XDP_DROP
i40e: -7.5% / +5 ns / +9 cycles

xdp_rxq_info -a XDP_TX
i40e: -10% / +15 ns / +32 cycles

Would be great if someone could rerun a similar set of experiments on
i40e or ice then
report.

[1] https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210419/024106.html
[2] https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210426/024135.html
[3] https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210426/024129.html

> Regards,

> Lorenzo

>

> >

> > Thanks: Magnus

> >

> > > Eelco Chaudron (4):

> > >   bpf: add multi-buff support to the bpf_xdp_adjust_tail() API

> > >   bpd: add multi-buffer support to xdp copy helpers

> > >   bpf: add new frame_length field to the XDP ctx

> > >   bpf: update xdp_adjust_tail selftest to include multi-buffer

> > >

> > > Lorenzo Bianconi (10):

> > >   xdp: introduce mb in xdp_buff/xdp_frame

> > >   xdp: add xdp_shared_info data structure

> > >   net: mvneta: update mb bit before passing the xdp buffer to eBPF layer

> > >   xdp: add multi-buff support to xdp_return_{buff/frame}

> > >   net: mvneta: add multi buffer support to XDP_TX

> > >   net: mvneta: enable jumbo frames for XDP

> > >   net: xdp: add multi-buff support to xdp_build_skb_from_fram

> > >   bpf: move user_size out of bpf_test_init

> > >   bpf: introduce multibuff support to bpf_prog_test_run_xdp()

> > >   bpf: test_run: add xdp_shared_info pointer in bpf_test_finish

> > >     signature

> > >

> > >  drivers/net/ethernet/marvell/mvneta.c         | 182 ++++++++++--------

> > >  include/linux/filter.h                        |   7 +

> > >  include/net/xdp.h                             | 105 +++++++++-

> > >  include/uapi/linux/bpf.h                      |   1 +

> > >  net/bpf/test_run.c                            | 109 +++++++++--

> > >  net/core/filter.c                             | 134 ++++++++++++-

> > >  net/core/xdp.c                                | 103 +++++++++-

> > >  tools/include/uapi/linux/bpf.h                |   1 +

> > >  .../bpf/prog_tests/xdp_adjust_tail.c          | 105 ++++++++++

> > >  .../selftests/bpf/prog_tests/xdp_bpf2bpf.c    | 127 ++++++++----

> > >  .../bpf/progs/test_xdp_adjust_tail_grow.c     |  17 +-

> > >  .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 ++-

> > >  .../selftests/bpf/progs/test_xdp_bpf2bpf.c    |   3 +-

> > >  13 files changed, 767 insertions(+), 159 deletions(-)

> > >

> > > --

> > > 2.30.2

> > >
Jesper Dangaard Brouer April 29, 2021, 12:49 p.m. UTC | #21
On Wed, 28 Apr 2021 09:41:52 +0200
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Tue, Apr 27, 2021 at 8:28 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:

> >

> > [...]

> >  

> > > Took your patches for a test run with the AF_XDP sample xdpsock on an

> > > i40e card and the throughput degradation is between 2 to 6% depending

> > > on the setup and microbenchmark within xdpsock that is executed. And

> > > this is without sending any multi frame packets. Just single frame

> > > ones. Tirtha made changes to the i40e driver to support this new

> > > interface so that is being included in the measurements.

> > >

> > > What performance do you see with the mvneta card? How much are we

> > > willing to pay for this feature when it is not being used or can we in

> > > some way selectively turn it on only when needed?  

> >

> > Hi Magnus,

> >

> > Today I carried out some comparison tests between bpf-next and bpf-next +

> > xdp_multibuff series on mvneta running xdp_rxq_info sample. Results are

> > basically aligned:

> >

> > bpf-next:

> > - xdp drop ~ 665Kpps

> > - xdp_tx   ~ 291Kpps

> > - xdp_pass ~ 118Kpps

> >

> > bpf-next + xdp_multibuff:

> > - xdp drop ~ 672Kpps

> > - xdp_tx   ~ 288Kpps

> > - xdp_pass ~ 118Kpps

> >

> > I am not sure if results are affected by the low power CPU, I will run some

> > tests on ixgbe card.  

> 

> Thanks Lorenzo. I made some new runs, this time with i40e driver

> changes as a new data point. Same baseline as before but with patches

> [1] and [2] applied. Note

> that if you use net or net-next and i40e, you need patch [3] too.

> 

> The i40e multi-buffer support will be posted on the mailing list as a

> separate RFC patch so you can reproduce and review.

> 

> Note, calculations are performed on non-truncated numbers. So 2 ns

> might be 5 cycles on my 2.1 GHz machine since 2.49 ns * 2.1 GHz =

> 5.229 cycles ~ 5 cycles. xdpsock is run in zero-copy mode so it uses

> the zero-copy driver data path in contrast with xdp_rxq_info that uses

> the regular driver data path. Only ran the busy-poll 1-core case this

> time. Reported numbers are the average over 3 runs.


Yes, for i40e the xdpsock zero-copy test uses another code path, this
is something we need to keep in mind. 

Also remember that we designed the central xdp_do_redirect() call to
delay creation of xdp_frame.  This is something what AF_XDP ZC takes
advantage of.
Thus, the cost of xdp_buff to xdp_frame conversion is not covered in
below tests, and I expect this patchset to increase that cost...
(UPDATE: below XDP_TX actually does xdp_frame conversion)


> multi-buffer patches without any driver changes:


Thanks you *SO* much Magnus for these superb tests.  I absolutely love
how comprehensive your test results are.  Thanks you for catching the
performance regression in this patchset. (I for one know how time
consuming these kind of tests are, I appreciate your effort, a lot!)

> xdpsock rxdrop 1-core:

> i40e: -4.5% in throughput / +3 ns / +6 cycles

> ice: -1.5% / +1 ns / +2 cycles

> 

> xdp_rxq_info -a XDP_DROP

> i40e: -2.5% / +2 ns / +3 cycles

> ice: +6% / -3 ns / -7 cycles

> 

> xdp_rxq_info -a XDP_TX

> i40e: -10% / +15 ns / +32 cycles

> ice: -9% / +14 ns / +29 cycles


This is a clear performance regression.

Looking closer at driver i40e_xmit_xdp_tx_ring() actually performs a
xdp_frame conversion calling xdp_convert_buff_to_frame(xdp).

FYI: We have started an offlist thread on finding the root-cause and
on IRC with Lorenzo.   The current lead is that, as Alexei so wisely
pointed out in earlier patches, that struct bit access is not
efficient...

As I expect we soon need bits for HW RX checksum indication, and
indication if metadata contains BTF described area, I've asked Lorenzo
to consider this, and look into introducing a flags member. (Then we
just have to figure out how to make flags access efficient).
 

> multi-buffer patches + i40e driver changes from Tirtha:

> 

> xdpsock rxdrop 1-core:

> i40e: -3% / +2 ns / +3 cycles

> 

> xdp_rxq_info -a XDP_DROP

> i40e: -7.5% / +5 ns / +9 cycles

> 

> xdp_rxq_info -a XDP_TX

> i40e: -10% / +15 ns / +32 cycles

> 

> Would be great if someone could rerun a similar set of experiments on

> i40e or ice then

> report.

 
> [1] https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210419/024106.html

> [2] https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210426/024135.html

> [3] https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210426/024129.html


I'm very happy that you/we all are paying attention to keep XDP
performance intact, as small 'paper-cuts' like +32 cycles does affect
XDP in the long run. Happy performance testing everybody :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer