[v5,bpf-next,00/14] mvneta: introduce XDP multi-buffer support

Message ID	cover.1607349924.git.lorenzo@kernel.org
Headers	show Return-Path: <netdev-owner@kernel.org> From: Lorenzo Bianconi <lorenzo@kernel.org> To: bpf@vger.kernel.org, netdev@vger.kernel.org Cc: davem@davemloft.net, kuba@kernel.org, ast@kernel.org, daniel@iogearbox.net, shayagr@amazon.com, sameehj@amazon.com, john.fastabend@gmail.com, dsahern@kernel.org, brouer@redhat.com, echaudro@redhat.com, lorenzo.bianconi@redhat.com, jasowang@redhat.com Subject: [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Date: Mon, 7 Dec 2020 17:32:29 +0100 Message-Id: <cover.1607349924.git.lorenzo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mvneta: introduce XDP multi-buffer support \| expand [v5,bpf-next,00/14] mvneta: introduce XDP multi-buffer support [v5,bpf-next,01/14] xdp: introduce mb in xdp_buff/xdp_frame [v5,bpf-next,04/14] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer [v5,bpf-next,05/14] xdp: add multi-buff support to xdp_return_{buff/frame} [v5,bpf-next,08/14] bpf: introduce multibuff support to bpf_prog_test_run_xdp() [v5,bpf-next,09/14] bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature [v5,bpf-next,10/14] net: mvneta: enable jumbo frames for XDP [v5,bpf-next,13/14] bpf: add new frame_length field to the XDP ctx [v5,bpf-next,14/14] bpf: update xdp_adjust_tail selftest to include multi-buffer

Lorenzo Bianconi Dec. 7, 2020, 4:32 p.m. UTC

This series introduce XDP multi-buffer support. The mvneta driver is
the first to support these new "non-linear" xdp_{buff,frame}. Reviewers
please focus on how these new types of xdp_{buff,frame} packets
traverse the different layers and the layout design. It is on purpose
that BPF-helpers are kept simple, as we don't want to expose the
internal layout to allow later changes.

For now, to keep the design simple and to maintain performance, the XDP
BPF-prog (still) only have access to the first-buffer. It is left for
later (another patchset) to add payload access across multiple buffers.
This patchset should still allow for these future extensions. The goal
is to lift the XDP MTU restriction that comes with XDP, but maintain
same performance as before.

The main idea for the new multi-buffer layout is to reuse the same
layout used for non-linear SKB. We introduced a "xdp_shared_info" data
structure at the end of the first buffer to link together subsequent buffers.
xdp_shared_info will alias skb_shared_info allowing to keep most of the frags
in the same cache-line (while with skb_shared_info only the first fragment will
be placed in the first "shared_info" cache-line). Moreover we introduced some
xdp_shared_info helpers aligned to skb_frag* ones.
Converting xdp_frame to SKB and deliver it to the network stack is shown in
cpumap code (patch 11/14). Building the SKB, the xdp_shared_info structure
will be converted in a skb_shared_info one.

A multi-buffer bit (mb) has been introduced in xdp_{buff,frame} structure
to notify the bpf/network layer if this is a xdp multi-buffer frame (mb = 1)
or not (mb = 0).
The mb bit will be set by a xdp multi-buffer capable driver only for
non-linear frames maintaining the capability to receive linear frames
without any extra cost since the xdp_shared_info structure at the end
of the first buffer will be initialized only if mb is set.

Typical use cases for this series are:
- Jumbo-frames
- Packet header split (please see Google’s use-case @ NetDevConf 0x14, [0])
- TSO

A new frame_length field has been introduce in XDP ctx in order to notify the
eBPF layer about the total frame size (linear + paged parts).

bpf_xdp_adjust_tail helper has been modified to take info account xdp
multi-buff frames.

More info about the main idea behind this approach can be found here [1][2].

Changes since v4:
- rebase ontop of bpf-next
- introduce xdp_shared_info to build xdp multi-buff instead of using the
  skb_shared_info struct
- introduce frame_length in xdp ctx
- drop previous bpf helpers
- fix bpf_xdp_adjust_tail for xdp multi-buff
- introduce xdp multi-buff self-tests for bpf_xdp_adjust_tail
- fix xdp_return_frame_bulk for xdp multi-buff

Changes since v3:
- rebase ontop of bpf-next
- add patch 10/13 to copy back paged data from a xdp multi-buff frame to
  userspace buffer for xdp multi-buff selftests

Changes since v2:
- add throughput measurements
- drop bpf_xdp_adjust_mb_header bpf helper
- introduce selftest for xdp multibuffer
- addressed comments on bpf_xdp_get_frags_count
- introduce xdp multi-buff support to cpumaps

Changes since v1:
- Fix use-after-free in xdp_return_{buff/frame}
- Introduce bpf helpers
- Introduce xdp_mb sample program
- access skb_shared_info->nr_frags only on the last fragment

Changes since RFC:
- squash multi-buffer bit initialization in a single patch
- add mvneta non-linear XDP buff support for tx side

[0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
[2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

Eelco Chaudron (3):
  bpf: add multi-buff support to the bpf_xdp_adjust_tail() API
  bpf: add new frame_length field to the XDP ctx
  bpf: update xdp_adjust_tail selftest to include multi-buffer

Lorenzo Bianconi (11):
  xdp: introduce mb in xdp_buff/xdp_frame
  xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  xdp: add xdp_shared_info data structure
  net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
  xdp: add multi-buff support to xdp_return_{buff/frame}
  net: mvneta: add multi buffer support to XDP_TX
  bpf: move user_size out of bpf_test_init
  bpf: introduce multibuff support to bpf_prog_test_run_xdp()
  bpf: test_run: add xdp_shared_info pointer in bpf_test_finish
    signature
  net: mvneta: enable jumbo frames for XDP
  bpf: cpumap: introduce xdp multi-buff support

 drivers/net/ethernet/amazon/ena/ena_netdev.c  |   1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   1 +
 .../net/ethernet/cavium/thunder/nicvf_main.c  |   1 +
 .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  |   1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   1 +
 drivers/net/ethernet/intel/ice/ice_txrx.c     |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   1 +
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |   1 +
 drivers/net/ethernet/marvell/mvneta.c         | 181 ++++++++++--------
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   |   1 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |   1 +
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   1 +
 .../ethernet/netronome/nfp/nfp_net_common.c   |   1 +
 drivers/net/ethernet/qlogic/qede/qede_fp.c    |   1 +
 drivers/net/ethernet/sfc/rx.c                 |   1 +
 drivers/net/ethernet/socionext/netsec.c       |   1 +
 drivers/net/ethernet/ti/cpsw.c                |   1 +
 drivers/net/ethernet/ti/cpsw_new.c            |   1 +
 drivers/net/hyperv/netvsc_bpf.c               |   1 +
 drivers/net/tun.c                             |   2 +
 drivers/net/veth.c                            |   1 +
 drivers/net/virtio_net.c                      |   2 +
 drivers/net/xen-netfront.c                    |   1 +
 include/net/xdp.h                             | 111 ++++++++++-
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/cpumap.c                           |  45 +----
 kernel/bpf/verifier.c                         |   2 +-
 net/bpf/test_run.c                            | 107 +++++++++--
 net/core/dev.c                                |   1 +
 net/core/filter.c                             | 146 ++++++++++++++
 net/core/xdp.c                                | 150 ++++++++++++++-
 tools/include/uapi/linux/bpf.h                |   1 +
 .../bpf/prog_tests/xdp_adjust_tail.c          | 105 ++++++++++
 .../bpf/progs/test_xdp_adjust_tail_grow.c     |  16 +-
 .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 +++-
 35 files changed, 761 insertions(+), 161 deletions(-)

Maciej Fijalkowski Dec. 7, 2020, 9:37 p.m. UTC | #1

On Mon, Dec 07, 2020 at 01:15:00PM -0800, Alexander Duyck wrote:
> On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
> >
> > Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.
> > This is a preliminary patch to enable xdp multi-buffer support.
> >
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> 
> I'm really not a fan of this design. Having to update every driver in
> order to initialize a field that was fragmented is a pain. At a
> minimum it seems like it might be time to consider introducing some
> sort of initializer function for this so that you can update things in
> one central place the next time you have to add a new field instead of
> having to update every individual driver that supports XDP. Otherwise
> this isn't going to scale going forward.

Also, a good example of why this might be bothering for us is a fact that
in the meantime the dpaa driver got XDP support and this patch hasn't been
updated to include mb setting in that driver.

> 
> > ---
> >  drivers/net/ethernet/amazon/ena/ena_netdev.c        | 1 +
> >  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c       | 1 +
> >  drivers/net/ethernet/cavium/thunder/nicvf_main.c    | 1 +
> >  drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c    | 1 +
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.c         | 1 +
> >  drivers/net/ethernet/intel/ice/ice_txrx.c           | 1 +
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c       | 1 +
> >  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c   | 1 +
> >  drivers/net/ethernet/marvell/mvneta.c               | 1 +
> >  drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c     | 1 +
> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c          | 1 +
> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c     | 1 +
> >  drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 1 +
> >  drivers/net/ethernet/qlogic/qede/qede_fp.c          | 1 +
> >  drivers/net/ethernet/sfc/rx.c                       | 1 +
> >  drivers/net/ethernet/socionext/netsec.c             | 1 +
> >  drivers/net/ethernet/ti/cpsw.c                      | 1 +
> >  drivers/net/ethernet/ti/cpsw_new.c                  | 1 +
> >  drivers/net/hyperv/netvsc_bpf.c                     | 1 +
> >  drivers/net/tun.c                                   | 2 ++
> >  drivers/net/veth.c                                  | 1 +
> >  drivers/net/virtio_net.c                            | 2 ++
> >  drivers/net/xen-netfront.c                          | 1 +
> >  net/core/dev.c                                      | 1 +
> >  24 files changed, 26 insertions(+)
> >

Saeed Mahameed Dec. 8, 2020, 12:22 a.m. UTC | #2

On Mon, 2020-12-07 at 17:32 +0100, Lorenzo Bianconi wrote:
> Introduce xdp_shared_info data structure to contain info about
> "non-linear" xdp frame. xdp_shared_info will alias skb_shared_info
> allowing to keep most of the frags in the same cache-line.
> Introduce some xdp_shared_info helpers aligned to skb_frag* ones
> 

is there or will be a more general purpose use to this xdp_shared_info
? other than hosting frags ?

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 62 +++++++++++++++--------
> ----
>  include/net/xdp.h                     | 52 ++++++++++++++++++++--
>  2 files changed, 82 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c
> b/drivers/net/ethernet/marvell/mvneta.c
> index 1e5b5c69685a..d635463609ad 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -2033,14 +2033,17 @@ int mvneta_rx_refill_queue(struct mvneta_port
> *pp, struct mvneta_rx_queue *rxq)
>  

[...]

>  static void
> @@ -2278,7 +2281,7 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port
> *pp,
>  			    struct mvneta_rx_desc *rx_desc,
>  			    struct mvneta_rx_queue *rxq,
>  			    struct xdp_buff *xdp, int *size,
> -			    struct skb_shared_info *xdp_sinfo,
> +			    struct xdp_shared_info *xdp_sinfo,
>  			    struct page *page)
>  {
>  	struct net_device *dev = pp->dev;
> @@ -2301,13 +2304,13 @@ mvneta_swbm_add_rx_fragment(struct
> mvneta_port *pp,
>  	if (data_len > 0 && xdp_sinfo->nr_frags < MAX_SKB_FRAGS) {
>  		skb_frag_t *frag = &xdp_sinfo->frags[xdp_sinfo-
> >nr_frags++];
>  
> -		skb_frag_off_set(frag, pp->rx_offset_correction);
> -		skb_frag_size_set(frag, data_len);
> -		__skb_frag_set_page(frag, page);
> +		xdp_set_frag_offset(frag, pp->rx_offset_correction);
> +		xdp_set_frag_size(frag, data_len);
> +		xdp_set_frag_page(frag, page);
>  

why three separate setters ? why not just one 
xdp_set_frag(page, offset, size) ?

>  		/* last fragment */
>  		if (len == *size) {
> -			struct skb_shared_info *sinfo;
> +			struct xdp_shared_info *sinfo;
>  
>  			sinfo = xdp_get_shared_info_from_buff(xdp);
>  			sinfo->nr_frags = xdp_sinfo->nr_frags;
> @@ -2324,10 +2327,13 @@ static struct sk_buff *
>  mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue
> *rxq,
>  		      struct xdp_buff *xdp, u32 desc_status)
>  {
> -	struct skb_shared_info *sinfo =
> xdp_get_shared_info_from_buff(xdp);
> -	int i, num_frags = sinfo->nr_frags;
> +	struct xdp_shared_info *xdp_sinfo =
> xdp_get_shared_info_from_buff(xdp);
> +	int i, num_frags = xdp_sinfo->nr_frags;
> +	skb_frag_t frag_list[MAX_SKB_FRAGS];
>  	struct sk_buff *skb;
>  
> +	memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) *
> num_frags);
> +
>  	skb = build_skb(xdp->data_hard_start, PAGE_SIZE);
>  	if (!skb)
>  		return ERR_PTR(-ENOMEM);
> @@ -2339,12 +2345,12 @@ mvneta_swbm_build_skb(struct mvneta_port *pp,
> struct mvneta_rx_queue *rxq,
>  	mvneta_rx_csum(pp, desc_status, skb);
>  
>  	for (i = 0; i < num_frags; i++) {
> -		skb_frag_t *frag = &sinfo->frags[i];
> +		struct page *page = xdp_get_frag_page(&frag_list[i]);
>  
>  		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> -				skb_frag_page(frag),
> skb_frag_off(frag),
> -				skb_frag_size(frag), PAGE_SIZE);
> -		page_pool_release_page(rxq->page_pool,
> skb_frag_page(frag));
> +				page,
> xdp_get_frag_offset(&frag_list[i]),
> +				xdp_get_frag_size(&frag_list[i]),
> PAGE_SIZE);
> +		page_pool_release_page(rxq->page_pool, page);
>  	}
>  
>  	return skb;
> @@ -2357,7 +2363,7 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  {
>  	int rx_proc = 0, rx_todo, refill, size = 0;
>  	struct net_device *dev = pp->dev;
> -	struct skb_shared_info sinfo;
> +	struct xdp_shared_info xdp_sinfo;
>  	struct mvneta_stats ps = {};
>  	struct bpf_prog *xdp_prog;
>  	u32 desc_status, frame_sz;
> @@ -2368,7 +2374,7 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  	xdp_buf.rxq = &rxq->xdp_rxq;
>  	xdp_buf.mb = 0;
>  
> -	sinfo.nr_frags = 0;
> +	xdp_sinfo.nr_frags = 0;
>  
>  	/* Get number of received packets */
>  	rx_todo = mvneta_rxq_busy_desc_num_get(pp, rxq);
> @@ -2412,7 +2418,7 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  			}
>  
>  			mvneta_swbm_add_rx_fragment(pp, rx_desc, rxq,
> &xdp_buf,
> -						    &size, &sinfo,
> page);
> +						    &size, &xdp_sinfo,
> page);
>  		} /* Middle or Last descriptor */
>  
>  		if (!(rx_status & MVNETA_RXD_LAST_DESC))
> @@ -2420,7 +2426,7 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  			continue;
>  
>  		if (size) {
> -			mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &sinfo,
> -1);
> +			mvneta_xdp_put_buff(pp, rxq, &xdp_buf,
> &xdp_sinfo, -1);
>  			goto next;
>  		}
>  
> @@ -2432,7 +2438,7 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  		if (IS_ERR(skb)) {
>  			struct mvneta_pcpu_stats *stats =
> this_cpu_ptr(pp->stats);
>  
> -			mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &sinfo,
> -1);
> +			mvneta_xdp_put_buff(pp, rxq, &xdp_buf,
> &xdp_sinfo, -1);
>  
>  			u64_stats_update_begin(&stats->syncp);
>  			stats->es.skb_alloc_error++;
> @@ -2449,12 +2455,12 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  		napi_gro_receive(napi, skb);
>  next:
>  		xdp_buf.data_hard_start = NULL;
> -		sinfo.nr_frags = 0;
> +		xdp_sinfo.nr_frags = 0;
>  	}
>  	rcu_read_unlock();
>  
>  	if (xdp_buf.data_hard_start)
> -		mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &sinfo, -1);
> +		mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &xdp_sinfo, -1);
>  
>  	if (ps.xdp_redirect)
>  		xdp_do_flush_map();
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 70559720ff44..614f66d35ee8 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -87,10 +87,54 @@ struct xdp_buff {
>  	((xdp)->data_hard_start + (xdp)->frame_sz -	\
>  	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>  
> -static inline struct skb_shared_info *
> +struct xdp_shared_info {

xdp_shared_info is a bad name, we need this to have a specific purpose 
xdp_frags should the proper name, so people will think twice before
adding weird bits to this so called shared_info.

> +	u16 nr_frags;
> +	u16 data_length; /* paged area length */
> +	skb_frag_t frags[MAX_SKB_FRAGS];

why MAX_SKB_FRAGS ? just use a flexible array member 
skb_frag_t frags[]; 

and enforce size via the n_frags and on the construction of the
tailroom preserved buffer, which is already being done.

this is waste of unnecessary space, at lease by definition of the
struct, in your use case you do:
memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) * num_frags);
And the tailroom space was already preserved for a full skb_shinfo.
so i don't see why you need this array to be of a fixed MAX_SKB_FRAGS
size.

> +};
> +
> +static inline struct xdp_shared_info *
>  xdp_get_shared_info_from_buff(struct xdp_buff *xdp)
>  {
> -	return (struct skb_shared_info *)xdp_data_hard_end(xdp);
> +	BUILD_BUG_ON(sizeof(struct xdp_shared_info) >
> +		     sizeof(struct skb_shared_info));
> +	return (struct xdp_shared_info *)xdp_data_hard_end(xdp);
> +}
> +

Back to my first comment, do we have plans to use this tail room buffer
for other than frag_list use cases ? what will be the buffer format
then ? should we push all new fields to the end of the xdp_shared_info
struct ? or deal with this tailroom buffer as a stack ? 
my main concern is that for drivers that don't support frag list and
still want to utilize the tailroom buffer for other usecases they will
have to skip the first sizeof(xdp_shared_info) so they won't break the
stack.

> +static inline struct page *xdp_get_frag_page(const skb_frag_t *frag)
> +{
> +	return frag->bv_page;
> +}
> +
> +static inline unsigned int xdp_get_frag_offset(const skb_frag_t
> *frag)
> +{
> +	return frag->bv_offset;
> +}
> +
> +static inline unsigned int xdp_get_frag_size(const skb_frag_t *frag)
> +{
> +	return frag->bv_len;
> +}
> +
> +static inline void *xdp_get_frag_address(const skb_frag_t *frag)
> +{
> +	return page_address(xdp_get_frag_page(frag)) +
> +	       xdp_get_frag_offset(frag);
> +}
> +
> +static inline void xdp_set_frag_page(skb_frag_t *frag, struct page
> *page)
> +{
> +	frag->bv_page = page;
> +}
> +
> +static inline void xdp_set_frag_offset(skb_frag_t *frag, u32 offset)
> +{
> +	frag->bv_offset = offset;
> +}
> +
> +static inline void xdp_set_frag_size(skb_frag_t *frag, u32 size)
> +{
> +	frag->bv_len = size;
>  }
>  
>  struct xdp_frame {
> @@ -120,12 +164,12 @@ static __always_inline void
> xdp_frame_bulk_init(struct xdp_frame_bulk *bq)
>  	bq->xa = NULL;
>  }
>  
> -static inline struct skb_shared_info *
> +static inline struct xdp_shared_info *
>  xdp_get_shared_info_from_frame(struct xdp_frame *frame)
>  {
>  	void *data_hard_start = frame->data - frame->headroom -
> sizeof(*frame);
>  
> -	return (struct skb_shared_info *)(data_hard_start + frame-
> >frame_sz -
> +	return (struct xdp_shared_info *)(data_hard_start + frame-
> >frame_sz -
>  				SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info)));
>  }
>  

need a comment here why we preserve the size of skb_shared_info, yet
the usable buffer is of type xdp_shared_info.

Lorenzo Bianconi Dec. 8, 2020, 10:31 a.m. UTC | #3

> On Mon, 2020-12-07 at 22:37 +0100, Maciej Fijalkowski wrote:

> > On Mon, Dec 07, 2020 at 01:15:00PM -0800, Alexander Duyck wrote:

> > > On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <lorenzo@kernel.org

> > > > wrote:

> > > > Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.

> > > > This is a preliminary patch to enable xdp multi-buffer support.

> > > > 

> > > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

> > > 

> > > I'm really not a fan of this design. Having to update every driver

> > > in

> > > order to initialize a field that was fragmented is a pain. At a

> > > minimum it seems like it might be time to consider introducing some

> > > sort of initializer function for this so that you can update things

> > > in

> > > one central place the next time you have to add a new field instead

> > > of

> > > having to update every individual driver that supports XDP.

> > > Otherwise

> > > this isn't going to scale going forward.

> > 

> > Also, a good example of why this might be bothering for us is a fact

> > that

> > in the meantime the dpaa driver got XDP support and this patch hasn't

> > been

> > updated to include mb setting in that driver.

> > 

> something like

> init_xdp_buff(hard_start, headroom, len, frame_sz, rxq);

> 

> would work for most of the drivers.

> 


ack, agree. I will add init_xdp_buff() in v6.

Regards,
Lorenzo

Lorenzo Bianconi Dec. 8, 2020, 11:01 a.m. UTC | #4

> On Mon, 2020-12-07 at 17:32 +0100, Lorenzo Bianconi wrote:

> > Introduce xdp_shared_info data structure to contain info about

> > "non-linear" xdp frame. xdp_shared_info will alias skb_shared_info

> > allowing to keep most of the frags in the same cache-line.

> > Introduce some xdp_shared_info helpers aligned to skb_frag* ones

> > 

> 

> is there or will be a more general purpose use to this xdp_shared_info

> ? other than hosting frags ?


I do not have other use-cases at the moment other than multi-buff but in
theory it is possible I guess.
The reason we introduced it is to have most of the frags in the first
shared_info cache-line to avoid cache-misses.

> 

> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

> > ---

> >  drivers/net/ethernet/marvell/mvneta.c | 62 +++++++++++++++--------

> > ----

> >  include/net/xdp.h                     | 52 ++++++++++++++++++++--

> >  2 files changed, 82 insertions(+), 32 deletions(-)

> > 

> > diff --git a/drivers/net/ethernet/marvell/mvneta.c

> > b/drivers/net/ethernet/marvell/mvneta.c

> > index 1e5b5c69685a..d635463609ad 100644

> > --- a/drivers/net/ethernet/marvell/mvneta.c

> > +++ b/drivers/net/ethernet/marvell/mvneta.c

> > @@ -2033,14 +2033,17 @@ int mvneta_rx_refill_queue(struct mvneta_port

> > *pp, struct mvneta_rx_queue *rxq)

> >  

> 

> [...]

> 

> >  static void

> > @@ -2278,7 +2281,7 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port

> > *pp,

> >  			    struct mvneta_rx_desc *rx_desc,

> >  			    struct mvneta_rx_queue *rxq,

> >  			    struct xdp_buff *xdp, int *size,

> > -			    struct skb_shared_info *xdp_sinfo,

> > +			    struct xdp_shared_info *xdp_sinfo,

> >  			    struct page *page)

> >  {

> >  	struct net_device *dev = pp->dev;

> > @@ -2301,13 +2304,13 @@ mvneta_swbm_add_rx_fragment(struct

> > mvneta_port *pp,

> >  	if (data_len > 0 && xdp_sinfo->nr_frags < MAX_SKB_FRAGS) {

> >  		skb_frag_t *frag = &xdp_sinfo->frags[xdp_sinfo-

> > >nr_frags++];

> >  

> > -		skb_frag_off_set(frag, pp->rx_offset_correction);

> > -		skb_frag_size_set(frag, data_len);

> > -		__skb_frag_set_page(frag, page);

> > +		xdp_set_frag_offset(frag, pp->rx_offset_correction);

> > +		xdp_set_frag_size(frag, data_len);

> > +		xdp_set_frag_page(frag, page);

> >  

> 

> why three separate setters ? why not just one 

> xdp_set_frag(page, offset, size) ?


to be aligned with skb_frags helpers, but I guess we can have a single helper,
I do not have a strong opinion on it

> 

> >  		/* last fragment */

> >  		if (len == *size) {

> > -			struct skb_shared_info *sinfo;

> > +			struct xdp_shared_info *sinfo;

> >  

> >  			sinfo = xdp_get_shared_info_from_buff(xdp);

> >  			sinfo->nr_frags = xdp_sinfo->nr_frags;

> > @@ -2324,10 +2327,13 @@ static struct sk_buff *

> >  mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue

> > *rxq,

> >  		      struct xdp_buff *xdp, u32 desc_status)

> >  {


[...]

> >  

> > -static inline struct skb_shared_info *

> > +struct xdp_shared_info {

> 

> xdp_shared_info is a bad name, we need this to have a specific purpose 

> xdp_frags should the proper name, so people will think twice before

> adding weird bits to this so called shared_info.


I named the struct xdp_shared_info to recall skb_shared_info but I guess
xdp_frags is fine too. Agree?

> 

> > +	u16 nr_frags;

> > +	u16 data_length; /* paged area length */

> > +	skb_frag_t frags[MAX_SKB_FRAGS];

> 

> why MAX_SKB_FRAGS ? just use a flexible array member 

> skb_frag_t frags[]; 

> 

> and enforce size via the n_frags and on the construction of the

> tailroom preserved buffer, which is already being done.

> 

> this is waste of unnecessary space, at lease by definition of the

> struct, in your use case you do:

> memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) * num_frags);

> And the tailroom space was already preserved for a full skb_shinfo.

> so i don't see why you need this array to be of a fixed MAX_SKB_FRAGS

> size.


In order to avoid cache-misses, xdp_shared info is built as a variable
on mvneta_rx_swbm() stack and it is written to "shared_info" area only on the
last fragment in mvneta_swbm_add_rx_fragment(). I used MAX_SKB_FRAGS to be
aligned with skb_shared_info struct but probably we can use even a smaller value.
Another approach would be to define two different struct, e.g.

stuct xdp_frag_metadata {
	u16 nr_frags;
	u16 data_length; /* paged area length */
};

struct xdp_frags {
	skb_frag_t frags[MAX_SKB_FRAGS];
};

and then define xdp_shared_info as

struct xdp_shared_info {
	stuct xdp_frag_metadata meta;
	skb_frag_t frags[];
};

In this way we can probably optimize the space. What do you think?

> 

> > +};

> > +

> > +static inline struct xdp_shared_info *

> >  xdp_get_shared_info_from_buff(struct xdp_buff *xdp)

> >  {

> > -	return (struct skb_shared_info *)xdp_data_hard_end(xdp);

> > +	BUILD_BUG_ON(sizeof(struct xdp_shared_info) >

> > +		     sizeof(struct skb_shared_info));

> > +	return (struct xdp_shared_info *)xdp_data_hard_end(xdp);

> > +}

> > +

> 

> Back to my first comment, do we have plans to use this tail room buffer

> for other than frag_list use cases ? what will be the buffer format

> then ? should we push all new fields to the end of the xdp_shared_info

> struct ? or deal with this tailroom buffer as a stack ? 

> my main concern is that for drivers that don't support frag list and

> still want to utilize the tailroom buffer for other usecases they will

> have to skip the first sizeof(xdp_shared_info) so they won't break the

> stack.


for the moment I do not know if this area is used for other purposes.
Do you think there are other use-cases for it?

> 

> > +static inline struct page *xdp_get_frag_page(const skb_frag_t *frag)

> > +{

> > +	return frag->bv_page;

> > +}

> > +

> > +static inline unsigned int xdp_get_frag_offset(const skb_frag_t

> > *frag)

> > +{

> > +	return frag->bv_offset;

> > +}

> > +

> > +static inline unsigned int xdp_get_frag_size(const skb_frag_t *frag)

> > +{

> > +	return frag->bv_len;

> > +}

> > +

> > +static inline void *xdp_get_frag_address(const skb_frag_t *frag)

> > +{

> > +	return page_address(xdp_get_frag_page(frag)) +

> > +	       xdp_get_frag_offset(frag);

> > +}

> > +

> > +static inline void xdp_set_frag_page(skb_frag_t *frag, struct page

> > *page)

> > +{

> > +	frag->bv_page = page;

> > +}

> > +

> > +static inline void xdp_set_frag_offset(skb_frag_t *frag, u32 offset)

> > +{

> > +	frag->bv_offset = offset;

> > +}

> > +

> > +static inline void xdp_set_frag_size(skb_frag_t *frag, u32 size)

> > +{

> > +	frag->bv_len = size;

> >  }

> >  

> >  struct xdp_frame {

> > @@ -120,12 +164,12 @@ static __always_inline void

> > xdp_frame_bulk_init(struct xdp_frame_bulk *bq)

> >  	bq->xa = NULL;

> >  }

> >  

> > -static inline struct skb_shared_info *

> > +static inline struct xdp_shared_info *

> >  xdp_get_shared_info_from_frame(struct xdp_frame *frame)

> >  {

> >  	void *data_hard_start = frame->data - frame->headroom -

> > sizeof(*frame);

> >  

> > -	return (struct skb_shared_info *)(data_hard_start + frame-

> > >frame_sz -

> > +	return (struct xdp_shared_info *)(data_hard_start + frame-

> > >frame_sz -

> >  				SKB_DATA_ALIGN(sizeof(struct

> > skb_shared_info)));

> >  }

> >  

> 

> need a comment here why we preserve the size of skb_shared_info, yet

> the usable buffer is of type xdp_shared_info.


ack, I will add it in v6.

Regards,
Lorenzo

>

Jesper Dangaard Brouer Dec. 8, 2020, 1:29 p.m. UTC | #5

On Tue, 8 Dec 2020 11:31:03 +0100
Lorenzo Bianconi <lorenzo.bianconi@redhat.com> wrote:

> > On Mon, 2020-12-07 at 22:37 +0100, Maciej Fijalkowski wrote:  

> > > On Mon, Dec 07, 2020 at 01:15:00PM -0800, Alexander Duyck wrote:  

> > > > On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <lorenzo@kernel.org  

> > > > > wrote:

> > > > > Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.

> > > > > This is a preliminary patch to enable xdp multi-buffer support.

> > > > > 

> > > > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>  

> > > > 

> > > > I'm really not a fan of this design. Having to update every driver in

> > > > order to initialize a field that was fragmented is a pain. At a

> > > > minimum it seems like it might be time to consider introducing some

> > > > sort of initializer function for this so that you can update things in

> > > > one central place the next time you have to add a new field instead of

> > > > having to update every individual driver that supports XDP. Otherwise

> > > > this isn't going to scale going forward.  


+1

> > > Also, a good example of why this might be bothering for us is a fact that

> > > in the meantime the dpaa driver got XDP support and this patch hasn't been

> > > updated to include mb setting in that driver.

> > >   

> > something like

> > init_xdp_buff(hard_start, headroom, len, frame_sz, rxq);

> >

> > would work for most of the drivers.

> >   

> 

> ack, agree. I will add init_xdp_buff() in v6.


I do like the idea of an initialize helper function.
Remember this is fast-path code and likely need to be inlined.

Further more, remember that drivers can and do optimize the number of
writes they do to xdp_buff.   There are a number of fields in xdp_buff
that only need to be initialized once per NAPI.  E.g. rxq and frame_sz
(some driver do change frame_sz per packet).  Thus, you likely need two
inlined helpers for init.

Again, remember that C-compiler will generate an expensive operation
(rep stos) for clearing a struct if it is initialized like this, where
all member are not initialized (do NOT do this):

 struct xdp_buff xdp = {
   .rxq = rxq,
   .frame_sz = PAGE_SIZE,
 };

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Shay Agroskin Dec. 19, 2020, 2:53 p.m. UTC | #6

Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:

>> On Mon, 2020-12-07 at 17:32 +0100, Lorenzo Bianconi wrote:

>> > Introduce xdp_shared_info data structure to contain info 

>> > about

>> > "non-linear" xdp frame. xdp_shared_info will alias 

>> > skb_shared_info

>> > allowing to keep most of the frags in the same cache-line.

[...]
>> 

>> > +	u16 nr_frags;

>> > +	u16 data_length; /* paged area length */

>> > +	skb_frag_t frags[MAX_SKB_FRAGS];

>> 

>> why MAX_SKB_FRAGS ? just use a flexible array member 

>> skb_frag_t frags[]; 

>> 

>> and enforce size via the n_frags and on the construction of the

>> tailroom preserved buffer, which is already being done.

>> 

>> this is waste of unnecessary space, at lease by definition of 

>> the

>> struct, in your use case you do:

>> memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) * 

>> num_frags);

>> And the tailroom space was already preserved for a full 

>> skb_shinfo.

>> so i don't see why you need this array to be of a fixed 

>> MAX_SKB_FRAGS

>> size.

>

> In order to avoid cache-misses, xdp_shared info is built as a 

> variable

> on mvneta_rx_swbm() stack and it is written to "shared_info" 

> area only on the

> last fragment in mvneta_swbm_add_rx_fragment(). I used 

> MAX_SKB_FRAGS to be

> aligned with skb_shared_info struct but probably we can use even 

> a smaller value.

> Another approach would be to define two different struct, e.g.

>

> stuct xdp_frag_metadata {

> 	u16 nr_frags;

> 	u16 data_length; /* paged area length */

> };

>

> struct xdp_frags {

> 	skb_frag_t frags[MAX_SKB_FRAGS];

> };

>

> and then define xdp_shared_info as

>

> struct xdp_shared_info {

> 	stuct xdp_frag_metadata meta;

> 	skb_frag_t frags[];

> };

>

> In this way we can probably optimize the space. What do you 

> think?


We're still reserving ~sizeof(skb_shared_info) bytes at the end of 
the first buffer and it seems like in mvneta code you keep 
updating all three fields (frags, nr_frags and data_length).
Can you explain how the space is optimized by splitting the 
structs please?

>> 

>> > +};

>> > +

>> > +static inline struct xdp_shared_info *

>> >  xdp_get_shared_info_from_buff(struct xdp_buff *xdp)

>> >  {

>> > -	return (struct skb_shared_info *)xdp_data_hard_end(xdp);

>> > +	BUILD_BUG_ON(sizeof(struct xdp_shared_info) >

>> > +		     sizeof(struct skb_shared_info));

>> > +	return (struct xdp_shared_info *)xdp_data_hard_end(xdp);

>> > +}

>> > +

>> 

>> Back to my first comment, do we have plans to use this tail 

>> room buffer

>> for other than frag_list use cases ? what will be the buffer 

>> format

>> then ? should we push all new fields to the end of the 

>> xdp_shared_info

>> struct ? or deal with this tailroom buffer as a stack ? 

>> my main concern is that for drivers that don't support frag 

>> list and

>> still want to utilize the tailroom buffer for other usecases 

>> they will

>> have to skip the first sizeof(xdp_shared_info) so they won't 

>> break the

>> stack.

>

> for the moment I do not know if this area is used for other 

> purposes.

> Do you think there are other use-cases for it?

>


Saeed, the stack receives skb_shared_info when the frames are 
passed to the stack (skb_add_rx_frag is used to add the whole 
information to skb's shared info), and for XDP_REDIRECT use case, 
it doesn't seem like all drivers check page's tailroom for more 
information anyway (ena doesn't at least).
Can you please explain what do you mean by "break the stack"?

Thanks, Shay

>> 

[...]
>

>>

Jamal Hadi Salim Dec. 19, 2020, 3:30 p.m. UTC | #7

On 2020-12-19 9:53 a.m., Shay Agroskin wrote:
> 

> Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:

> 

>> for the moment I do not know if this area is used for other purposes.

>> Do you think there are other use-cases for it?

Sorry to interject:
Does it make sense to use it to store arbitrary metadata or a scratchpad
in this space? Something equivalent to skb->cb which is lacking in
XDP.

cheers,
jamal

Shay Agroskin Dec. 19, 2020, 3:56 p.m. UTC | #8

Lorenzo Bianconi <lorenzo@kernel.org> writes:

> Introduce the capability to map non-linear xdp buffer running

> mvneta_xdp_submit_frame() for XDP_TX and XDP_REDIRECT

>

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

> ---

>  drivers/net/ethernet/marvell/mvneta.c | 94 

>  ++++++++++++++++-----------

>  1 file changed, 56 insertions(+), 38 deletions(-)

[...]
>  			if (napi && buf->type == 

>  MVNETA_TYPE_XDP_TX)

>  				xdp_return_frame_rx_napi(buf->xdpf);

>  			else

> @@ -2054,45 +2054,64 @@ mvneta_xdp_put_buff(struct mvneta_port 

> *pp, struct mvneta_rx_queue *rxq,

>  

>  static int

>  mvneta_xdp_submit_frame(struct mvneta_port *pp, struct 

>  mvneta_tx_queue *txq,

> -			struct xdp_frame *xdpf, bool dma_map)

> +			struct xdp_frame *xdpf, int *nxmit_byte, 

> bool dma_map)

>  {

> -	struct mvneta_tx_desc *tx_desc;

> -	struct mvneta_tx_buf *buf;

> -	dma_addr_t dma_addr;

> +	struct xdp_shared_info *xdp_sinfo = 

> xdp_get_shared_info_from_frame(xdpf);

> +	int i, num_frames = xdpf->mb ? xdp_sinfo->nr_frags + 1 : 

> 1;

> +	struct mvneta_tx_desc *tx_desc = NULL;

> +	struct page *page;

>  

> -	if (txq->count >= txq->tx_stop_threshold)

> +	if (txq->count + num_frames >= txq->size)

>  		return MVNETA_XDP_DROPPED;

>  

> -	tx_desc = mvneta_txq_next_desc_get(txq);

> +	for (i = 0; i < num_frames; i++) {

> +		struct mvneta_tx_buf *buf = 

> &txq->buf[txq->txq_put_index];

> +		skb_frag_t *frag = i ? &xdp_sinfo->frags[i - 1] : 

> NULL;

> +		int len = frag ? xdp_get_frag_size(frag) : 

> xdpf->len;


nit, from branch prediction point of view, maybe it would be 
better to write
     int len = i ? xdp_get_frag_size(frag) : xdpf->len;

since the value of i is checked one line above
Disclaimer: I'm far from a compiler expert, and don't know whether 
the compiler would know to group these two assignments together 
into a single branch prediction decision, but it feels like using 
'i' would make this decision easier for it.

Thanks,
Shay

[...]

Shay Agroskin Dec. 19, 2020, 5:46 p.m. UTC | #9

Lorenzo Bianconi <lorenzo@kernel.org> writes:

> Introduce __xdp_build_skb_from_frame and 

> xdp_build_skb_from_frame

> utility routines to build the skb from xdp_frame.

> Add xdp multi-buff support to cpumap

>

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

> ---

>  include/net/xdp.h   |  5 ++++

>  kernel/bpf/cpumap.c | 45 +---------------------------

>  net/core/xdp.c      | 73 

>  +++++++++++++++++++++++++++++++++++++++++++++

>  3 files changed, 79 insertions(+), 44 deletions(-)

>

[...]
> diff --git a/net/core/xdp.c b/net/core/xdp.c

> index 6c8e743ad03a..55f3e9c69427 100644

> --- a/net/core/xdp.c

> +++ b/net/core/xdp.c

> @@ -597,3 +597,76 @@ void xdp_warn(const char *msg, const char 

> *func, const int line)

>  	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);

>  };

>  EXPORT_SYMBOL_GPL(xdp_warn);

> +

> +struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame 

> *xdpf,

> +					   struct sk_buff *skb,

> +					   struct net_device *dev)

> +{

> +	unsigned int headroom = sizeof(*xdpf) + xdpf->headroom;

> +	void *hard_start = xdpf->data - headroom;

> +	skb_frag_t frag_list[MAX_SKB_FRAGS];

> +	struct xdp_shared_info *xdp_sinfo;

> +	int i, num_frags = 0;

> +

> +	xdp_sinfo = xdp_get_shared_info_from_frame(xdpf);

> +	if (unlikely(xdpf->mb)) {

> +		num_frags = xdp_sinfo->nr_frags;

> +		memcpy(frag_list, xdp_sinfo->frags,

> +		       sizeof(skb_frag_t) * num_frags);

> +	}


nit, can you please move the xdp_sinfo assignment inside this 'if' 
? This would help to emphasize that regarding xdp_frame tailroom 
as xdp_shared_info struct (rather than skb_shared_info) is correct 
only when the mb bit is set

thanks,
Shay

> +

> +	skb = build_skb_around(skb, hard_start, xdpf->frame_sz);

> +	if (unlikely(!skb))

> +		return NULL;

[...]

Lorenzo Bianconi Dec. 20, 2020, 5:52 p.m. UTC | #10

>

>

> Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:

>

> >> On Mon, 2020-12-07 at 17:32 +0100, Lorenzo Bianconi wrote:

> >> > Introduce xdp_shared_info data structure to contain info

> >> > about

> >> > "non-linear" xdp frame. xdp_shared_info will alias

> >> > skb_shared_info

> >> > allowing to keep most of the frags in the same cache-line.

> [...]

> >>

> >> > +  u16 nr_frags;

> >> > +  u16 data_length; /* paged area length */

> >> > +  skb_frag_t frags[MAX_SKB_FRAGS];

> >>

> >> why MAX_SKB_FRAGS ? just use a flexible array member

> >> skb_frag_t frags[];

> >>

> >> and enforce size via the n_frags and on the construction of the

> >> tailroom preserved buffer, which is already being done.

> >>

> >> this is waste of unnecessary space, at lease by definition of

> >> the

> >> struct, in your use case you do:

> >> memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) *

> >> num_frags);

> >> And the tailroom space was already preserved for a full

> >> skb_shinfo.

> >> so i don't see why you need this array to be of a fixed

> >> MAX_SKB_FRAGS

> >> size.

> >

> > In order to avoid cache-misses, xdp_shared info is built as a

> > variable

> > on mvneta_rx_swbm() stack and it is written to "shared_info"

> > area only on the

> > last fragment in mvneta_swbm_add_rx_fragment(). I used

> > MAX_SKB_FRAGS to be

> > aligned with skb_shared_info struct but probably we can use even

> > a smaller value.

> > Another approach would be to define two different struct, e.g.

> >

> > stuct xdp_frag_metadata {

> >       u16 nr_frags;

> >       u16 data_length; /* paged area length */

> > };

> >

> > struct xdp_frags {

> >       skb_frag_t frags[MAX_SKB_FRAGS];

> > };

> >

> > and then define xdp_shared_info as

> >

> > struct xdp_shared_info {

> >       stuct xdp_frag_metadata meta;

> >       skb_frag_t frags[];

> > };

> >

> > In this way we can probably optimize the space. What do you

> > think?

>

> We're still reserving ~sizeof(skb_shared_info) bytes at the end of

> the first buffer and it seems like in mvneta code you keep

> updating all three fields (frags, nr_frags and data_length).

> Can you explain how the space is optimized by splitting the

> structs please?


using xdp_shared_info struct we will have the first 3 fragments in the
same cacheline of nr_frags while using skb_shared_info struct only the
first fragment will be in the same cacheline of nr_frags. Moreover
skb_shared_info has multiple fields unused by xdp.

Regards,
Lorenzo

>

> >>

> >> > +};

> >> > +

> >> > +static inline struct xdp_shared_info *

> >> >  xdp_get_shared_info_from_buff(struct xdp_buff *xdp)

> >> >  {

> >> > -  return (struct skb_shared_info *)xdp_data_hard_end(xdp);

> >> > +  BUILD_BUG_ON(sizeof(struct xdp_shared_info) >

> >> > +               sizeof(struct skb_shared_info));

> >> > +  return (struct xdp_shared_info *)xdp_data_hard_end(xdp);

> >> > +}

> >> > +

> >>

> >> Back to my first comment, do we have plans to use this tail

> >> room buffer

> >> for other than frag_list use cases ? what will be the buffer

> >> format

> >> then ? should we push all new fields to the end of the

> >> xdp_shared_info

> >> struct ? or deal with this tailroom buffer as a stack ?

> >> my main concern is that for drivers that don't support frag

> >> list and

> >> still want to utilize the tailroom buffer for other usecases

> >> they will

> >> have to skip the first sizeof(xdp_shared_info) so they won't

> >> break the

> >> stack.

> >

> > for the moment I do not know if this area is used for other

> > purposes.

> > Do you think there are other use-cases for it?

> >

>

> Saeed, the stack receives skb_shared_info when the frames are

> passed to the stack (skb_add_rx_frag is used to add the whole

> information to skb's shared info), and for XDP_REDIRECT use case,

> it doesn't seem like all drivers check page's tailroom for more

> information anyway (ena doesn't at least).

> Can you please explain what do you mean by "break the stack"?

>

> Thanks, Shay

>

> >>

> [...]

> >

> >>

>

Lorenzo Bianconi Dec. 20, 2020, 5:56 p.m. UTC | #11

>

>

> Lorenzo Bianconi <lorenzo@kernel.org> writes:

>

> > Introduce __xdp_build_skb_from_frame and

> > xdp_build_skb_from_frame

> > utility routines to build the skb from xdp_frame.

> > Add xdp multi-buff support to cpumap

> >

> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

> > ---

> >  include/net/xdp.h   |  5 ++++

> >  kernel/bpf/cpumap.c | 45 +---------------------------

> >  net/core/xdp.c      | 73

> >  +++++++++++++++++++++++++++++++++++++++++++++

> >  3 files changed, 79 insertions(+), 44 deletions(-)

> >

> [...]

> > diff --git a/net/core/xdp.c b/net/core/xdp.c

> > index 6c8e743ad03a..55f3e9c69427 100644

> > --- a/net/core/xdp.c

> > +++ b/net/core/xdp.c

> > @@ -597,3 +597,76 @@ void xdp_warn(const char *msg, const char

> > *func, const int line)

> >       WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);

> >  };

> >  EXPORT_SYMBOL_GPL(xdp_warn);

> > +

> > +struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame

> > *xdpf,

> > +                                        struct sk_buff *skb,

> > +                                        struct net_device *dev)

> > +{

> > +     unsigned int headroom = sizeof(*xdpf) + xdpf->headroom;

> > +     void *hard_start = xdpf->data - headroom;

> > +     skb_frag_t frag_list[MAX_SKB_FRAGS];

> > +     struct xdp_shared_info *xdp_sinfo;

> > +     int i, num_frags = 0;

> > +

> > +     xdp_sinfo = xdp_get_shared_info_from_frame(xdpf);

> > +     if (unlikely(xdpf->mb)) {

> > +             num_frags = xdp_sinfo->nr_frags;

> > +             memcpy(frag_list, xdp_sinfo->frags,

> > +                    sizeof(skb_frag_t) * num_frags);

> > +     }

>

> nit, can you please move the xdp_sinfo assignment inside this 'if'

> ? This would help to emphasize that regarding xdp_frame tailroom

> as xdp_shared_info struct (rather than skb_shared_info) is correct

> only when the mb bit is set

>

> thanks,

> Shay


ack, will do in v6.

Regards,
Lorenzo

>

> > +

> > +     skb = build_skb_around(skb, hard_start, xdpf->frame_sz);

> > +     if (unlikely(!skb))

> > +             return NULL;

> [...]

>

Lorenzo Bianconi Dec. 20, 2020, 6:06 p.m. UTC | #12

On Sat, Dec 19, 2020 at 4:56 PM Shay Agroskin <shayagr@amazon.com> wrote:
>

>

> Lorenzo Bianconi <lorenzo@kernel.org> writes:

>

> > Introduce the capability to map non-linear xdp buffer running

> > mvneta_xdp_submit_frame() for XDP_TX and XDP_REDIRECT

> >

> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

> > ---

> >  drivers/net/ethernet/marvell/mvneta.c | 94

> >  ++++++++++++++++-----------

> >  1 file changed, 56 insertions(+), 38 deletions(-)

> [...]

> >                       if (napi && buf->type ==

> >  MVNETA_TYPE_XDP_TX)

> >                               xdp_return_frame_rx_napi(buf->xdpf);

> >                       else

> > @@ -2054,45 +2054,64 @@ mvneta_xdp_put_buff(struct mvneta_port

> > *pp, struct mvneta_rx_queue *rxq,

> >

> >  static int

> >  mvneta_xdp_submit_frame(struct mvneta_port *pp, struct

> >  mvneta_tx_queue *txq,

> > -                     struct xdp_frame *xdpf, bool dma_map)

> > +                     struct xdp_frame *xdpf, int *nxmit_byte,

> > bool dma_map)

> >  {

> > -     struct mvneta_tx_desc *tx_desc;

> > -     struct mvneta_tx_buf *buf;

> > -     dma_addr_t dma_addr;

> > +     struct xdp_shared_info *xdp_sinfo =

> > xdp_get_shared_info_from_frame(xdpf);

> > +     int i, num_frames = xdpf->mb ? xdp_sinfo->nr_frags + 1 :

> > 1;

> > +     struct mvneta_tx_desc *tx_desc = NULL;

> > +     struct page *page;

> >

> > -     if (txq->count >= txq->tx_stop_threshold)

> > +     if (txq->count + num_frames >= txq->size)

> >               return MVNETA_XDP_DROPPED;

> >

> > -     tx_desc = mvneta_txq_next_desc_get(txq);

> > +     for (i = 0; i < num_frames; i++) {

> > +             struct mvneta_tx_buf *buf =

> > &txq->buf[txq->txq_put_index];

> > +             skb_frag_t *frag = i ? &xdp_sinfo->frags[i - 1] :

> > NULL;

> > +             int len = frag ? xdp_get_frag_size(frag) :

> > xdpf->len;

>

> nit, from branch prediction point of view, maybe it would be

> better to write

>      int len = i ? xdp_get_frag_size(frag) : xdpf->len;

>


ack, I will fix it in v6.

Regards,
Lorenzo

> since the value of i is checked one line above

> Disclaimer: I'm far from a compiler expert, and don't know whether

> the compiler would know to group these two assignments together

> into a single branch prediction decision, but it feels like using

> 'i' would make this decision easier for it.

>

> Thanks,

> Shay

>

> [...]

>

Jesper Dangaard Brouer Dec. 21, 2020, 9:01 a.m. UTC | #13

On Sat, 19 Dec 2020 10:30:57 -0500
Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> On 2020-12-19 9:53 a.m., Shay Agroskin wrote:

> > 

> > Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:

> >   

> 

> >> for the moment I do not know if this area is used for other purposes.

> >> Do you think there are other use-cases for it?  

Yes, all the same use-cases as SKB have.  I wanted to keep this the
same as skb_shared_info, but Lorenzo choose to take John's advice and
it going in this direction (which is fine, we can always change and
adjust this later).

> Sorry to interject:

> Does it make sense to use it to store arbitrary metadata or a scratchpad

> in this space? Something equivalent to skb->cb which is lacking in

> XDP.

Well, XDP have the data_meta area.  But difficult to rely on because a
lot of driver don't implement it.  And Saeed and I plan to use this
area and populate it with driver info from RX-descriptor.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Jamal Hadi Salim Dec. 21, 2020, 1 p.m. UTC | #14

On 2020-12-21 4:01 a.m., Jesper Dangaard Brouer wrote:
> On Sat, 19 Dec 2020 10:30:57 -0500

>> Sorry to interject:

>> Does it make sense to use it to store arbitrary metadata or a scratchpad

>> in this space? Something equivalent to skb->cb which is lacking in

>> XDP.

> 

> Well, XDP have the data_meta area.  But difficult to rely on because a

> lot of driver don't implement it.  And Saeed and I plan to use this

> area and populate it with driver info from RX-descriptor.

> 

What i was thinking is some scratch pad that i can write to within
an XDP prog (not driver); example, in a prog array map the scratch
pad is written by one program in the array and read by another later on.
skb->cb allows for that. Unless you mean i can already write to some
XDP data_meta area?

cheers,
jamal

Shay Agroskin Dec. 21, 2020, 8:55 p.m. UTC | #15

Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:

>>

>>

>> Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:

>>

>> >> On Mon, 2020-12-07 at 17:32 +0100, Lorenzo Bianconi wrote:

>> >> > Introduce xdp_shared_info data structure to contain info

>> >> > about

>> >> > "non-linear" xdp frame. xdp_shared_info will alias

>> >> > skb_shared_info

>> >> > allowing to keep most of the frags in the same cache-line.

>> [...]

>> >>

>> >> > +  u16 nr_frags;

>> >> > +  u16 data_length; /* paged area length */

>> >> > +  skb_frag_t frags[MAX_SKB_FRAGS];

>> >>

>> >> why MAX_SKB_FRAGS ? just use a flexible array member

>> >> skb_frag_t frags[];

>> >>

>> >> and enforce size via the n_frags and on the construction of 

>> >> the

>> >> tailroom preserved buffer, which is already being done.

>> >>

>> >> this is waste of unnecessary space, at lease by definition 

>> >> of

>> >> the

>> >> struct, in your use case you do:

>> >> memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) *

>> >> num_frags);

>> >> And the tailroom space was already preserved for a full

>> >> skb_shinfo.

>> >> so i don't see why you need this array to be of a fixed

>> >> MAX_SKB_FRAGS

>> >> size.

>> >

>> > In order to avoid cache-misses, xdp_shared info is built as a

>> > variable

>> > on mvneta_rx_swbm() stack and it is written to "shared_info"

>> > area only on the

>> > last fragment in mvneta_swbm_add_rx_fragment(). I used

>> > MAX_SKB_FRAGS to be

>> > aligned with skb_shared_info struct but probably we can use 

>> > even

>> > a smaller value.

>> > Another approach would be to define two different struct, 

>> > e.g.

>> >

>> > stuct xdp_frag_metadata {

>> >       u16 nr_frags;

>> >       u16 data_length; /* paged area length */

>> > };

>> >

>> > struct xdp_frags {

>> >       skb_frag_t frags[MAX_SKB_FRAGS];

>> > };

>> >

>> > and then define xdp_shared_info as

>> >

>> > struct xdp_shared_info {

>> >       stuct xdp_frag_metadata meta;

>> >       skb_frag_t frags[];

>> > };

>> >

>> > In this way we can probably optimize the space. What do you

>> > think?

>>

>> We're still reserving ~sizeof(skb_shared_info) bytes at the end 

>> of

>> the first buffer and it seems like in mvneta code you keep

>> updating all three fields (frags, nr_frags and data_length).

>> Can you explain how the space is optimized by splitting the

>> structs please?

>

> using xdp_shared_info struct we will have the first 3 fragments 

> in the

> same cacheline of nr_frags while using skb_shared_info struct 

> only the

> first fragment will be in the same cacheline of 

> nr_frags. Moreover

> skb_shared_info has multiple fields unused by xdp.

>

> Regards,

> Lorenzo

>


Thanks for your reply. I was actually referring to your suggestion 
to Saeed. Namely, defining

struct xdp_shared_info {
       struct xdp_frag_metadata meta;
       skb_frag_t frags[];
}

I don't see what benefits there are to this scheme compared to the 
original patch

Thanks,
Shay

>>

>> >>

>> >> > +};

>> >> > +

[...]
>>

>> Saeed, the stack receives skb_shared_info when the frames are

>> passed to the stack (skb_add_rx_frag is used to add the whole

>> information to skb's shared info), and for XDP_REDIRECT use 

>> case,

>> it doesn't seem like all drivers check page's tailroom for more

>> information anyway (ena doesn't at least).

>> Can you please explain what do you mean by "break the stack"?

>>

>> Thanks, Shay

>>

>> >>

>> [...]

>> >

>> >>

>>

[v5,bpf-next,00/14] mvneta: introduce XDP multi-buffer support

Message

Comments