[PATCHv14,bpf-next,0/6] xdp: add a new helper for dev map multicast support

Message ID	20210114142321.2594697-1-liuhangbin@gmail.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Hangbin Liu <liuhangbin@gmail.com> To: bpf@vger.kernel.org Cc: netdev@vger.kernel.org, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>, Jiri Benc <jbenc@redhat.com>, Jesper Dangaard Brouer <brouer@redhat.com>, Eelco Chaudron <echaudro@redhat.com>, ast@kernel.org, Daniel Borkmann <daniel@iogearbox.net>, Lorenzo Bianconi <lorenzo.bianconi@redhat.com>, David Ahern <dsahern@gmail.com>, Andrii Nakryiko <andrii.nakryiko@gmail.com>, Alexei Starovoitov <alexei.starovoitov@gmail.com>, Hangbin Liu <liuhangbin@gmail.com> Subject: [PATCHv14 bpf-next 0/6] xdp: add a new helper for dev map multicast support Date: Thu, 14 Jan 2021 22:23:15 +0800 Message-Id: <20210114142321.2594697-1-liuhangbin@gmail.com> In-Reply-To: <20201221123505.1962185-1-liuhangbin@gmail.com> References: <20201221123505.1962185-1-liuhangbin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xdp: add a new helper for dev map multicast support \| expand [PATCHv14,bpf-next,0/6] xdp: add a new helper for dev map multicast support [PATCHv15,bpf-next,2/6] bpf: add a new bpf argument type ARG_CONST_MAP_PTR_OR_NULL [PATCHv15,bpf-next,3/6] xdp: add a new helper for dev map multicast support

Hangbin Liu Jan. 14, 2021, 2:23 p.m. UTC

This patch is for xdp multicast support. which has been discussed before[0],
The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
a software switch that can forward XDP frames to multiple ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because there
may have multi interfaces you want to exclude.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The 1st patch is Jesper's run devmap xdp_prog later in bulking step.
The 2st patch add a new bpf arg to allow NULL map pointer.
The 3rd patch add the new bpf_redirect_map_multi() helper.
The 4-6 patches are for usage sample and testing purpose.

I did same perf tests with the following topo:

---------------------             ---------------------
| Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
---------------------             |                   |
                                  |   Host B          |
---------------------             |                   |
| Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
---------------------    vlan2    |          -------- |
                                  | veth1 -- | veth0| |
                                  |          -------- |
                                  --------------------|
On Host A:
# pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64

On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
The veth0 in netns load dummy drop program. The forward_map max_entries in
xdp_redirect_map_multi is modify to 4.

Here is the perf result with 5.10 rc6:

The are about +/- 0.1M deviation for native testing
Version             | Test                                    | Generic | Native | Native + 2nd
5.10 rc6            | xdp_redirect_map        i40e->i40e      |    2.0M |   9.1M |  8.0M
5.10 rc6            | xdp_redirect_map        i40e->veth      |    1.7M |  11.0M |  9.7M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e      |    1.7M |   7.8M |  6.4M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->veth      |    1.4M |   9.3M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e+veth |    1.0M |   3.2M |  2.7M

Last but not least, thanks a lot to Toke, Jesper, Jiri and Eelco for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v14:
No code update, just rebase the code on latest bpf-next

v13:
Pass in xdp_prog through __xdp_enqueue() for patch 01. Update related
code in patch 03.

v12:
Add Jesper's xdp_prog patch, rebase my works on this and latest bpf-next
Add 2nd xdp_prog test on the sample and selftests.

v11:
Fix bpf_redirect_map_multi() helper description typo.
Add loop limit for devmap_get_next_obj() and dev_map_redirect_multi().

v10:
Rebase the code to latest bpf-next.
Update helper bpf_xdp_redirect_map_multi()
- No need to check map pointer as we will do the check in verifier.

v9:
Update helper bpf_xdp_redirect_map_multi()
- Use ARG_CONST_MAP_PTR_OR_NULL for helper arg2

v8:
a) Update function dev_in_exclude_map():
   - remove duplicate ex_map map_type check in
   - lookup the element in dev map by obj dev index directly instead
     of looping all the map

v7:
a) Fix helper flag check
b) Limit the *ex_map* to use DEVMAP_HASH only and update function
   dev_in_exclude_map() to get better performance.

v6: converted helper return types from int to long

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().
f) Split the tests from sample and add a bpf kernel selftest patch.

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Hangbin Liu (5):
  bpf: add a new bpf argument type ARG_CONST_MAP_PTR_OR_NULL
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test
  selftests/bpf: Add verifier tests for bpf arg
    ARG_CONST_MAP_PTR_OR_NULL
  selftests/bpf: add xdp_redirect_multi test

Jesper Dangaard Brouer (1):
  bpf: run devmap xdp_prog on flush instead of bulk enqueue

 include/linux/bpf.h                           |  21 ++
 include/linux/filter.h                        |   1 +
 include/net/xdp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  27 ++
 kernel/bpf/devmap.c                           | 235 +++++++++++---
 kernel/bpf/verifier.c                         |  16 +-
 net/core/filter.c                             | 118 ++++++-
 net/core/xdp.c                                |  29 ++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c     |  96 ++++++
 samples/bpf/xdp_redirect_map_multi_user.c     | 301 ++++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  27 ++
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       | 120 +++++++
 tools/testing/selftests/bpf/test_verifier.c   |  22 +-
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 208 ++++++++++++
 .../testing/selftests/bpf/verifier/map_ptr.c  |  70 ++++
 .../selftests/bpf/xdp_redirect_multi.c        | 252 +++++++++++++++
 18 files changed, 1502 insertions(+), 48 deletions(-)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c

John Fastabend Jan. 17, 2021, 10:57 p.m. UTC | #1

Hangbin Liu wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>

> 

> This changes the devmap XDP program support to run the program when the

> bulk queue is flushed instead of before the frame is enqueued. This has

> a couple of benefits:

> 

> - It "sorts" the packets by destination devmap entry, and then runs the

>   same BPF program on all the packets in sequence. This ensures that we

>   keep the XDP program and destination device properties hot in I-cache.

> 

> - It makes the multicast implementation simpler because it can just

>   enqueue packets using bq_enqueue() without having to deal with the

>   devmap program at all.

> 

> The drawback is that if the devmap program drops the packet, the enqueue

> step is redundant. However, arguably this is mostly visible in a

> micro-benchmark, and with more mixed traffic the I-cache benefit should

> win out. The performance impact of just this patch is as follows:

> 

> Using xdp_redirect_map(with a 2nd xdp_prog patch[1]) in sample/bpf and send

> pkts via pktgen cmd:

> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64

> 

> There are about +/- 0.1M deviation for native testing, the performance

> improved for the base-case, but some drop back with xdp devmap prog attached.

> 

> Version          | Test                           | Generic | Native | Native + 2nd xdp_prog

> 5.10 rc6         | xdp_redirect_map   i40e->i40e  |    2.0M |   9.1M |  8.0M

> 5.10 rc6         | xdp_redirect_map   i40e->veth  |    1.7M |  11.0M |  9.7M

> 5.10 rc6 + patch | xdp_redirect_map   i40e->i40e  |    2.0M |   9.5M |  7.5M

> 5.10 rc6 + patch | xdp_redirect_map   i40e->veth  |    1.7M |  11.6M |  9.1M

> 

> [1] https://patchwork.ozlabs.org/project/netdev/patch/20201208120159.2278277-1-liuhangbin@gmail.com/

> 

> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

> 

> --

> v14: no update, only rebase the code

> v13: pass in xdp_prog through __xdp_enqueue()

> v2-v12: no this patch

> ---

>  kernel/bpf/devmap.c | 115 +++++++++++++++++++++++++++-----------------

>  1 file changed, 72 insertions(+), 43 deletions(-)

> 

> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c

> index f6e9c68afdd4..84fe15950e44 100644

> --- a/kernel/bpf/devmap.c

> +++ b/kernel/bpf/devmap.c

> @@ -57,6 +57,7 @@ struct xdp_dev_bulk_queue {

>  	struct list_head flush_node;

>  	struct net_device *dev;

>  	struct net_device *dev_rx;

> +	struct bpf_prog *xdp_prog;

>  	unsigned int count;

>  };

>  

> @@ -327,40 +328,92 @@ bool dev_map_can_have_prog(struct bpf_map *map)

>  	return false;

>  }

>  

> +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,

> +				struct xdp_frame **frames, int n,

> +				struct net_device *dev)

> +{

> +	struct xdp_txq_info txq = { .dev = dev };

> +	struct xdp_buff xdp;

> +	int i, nframes = 0;

> +

> +	for (i = 0; i < n; i++) {

> +		struct xdp_frame *xdpf = frames[i];

> +		u32 act;

> +		int err;

> +

> +		xdp_convert_frame_to_buff(xdpf, &xdp);


Hi, slightly higher level question about the desgin. How come we have
to bounce the xdp_frame back and forth between an xdp_buff<->xdp-frame?
Seems a bit wasteful.

> +		xdp.txq = &txq;

> +

> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);

> +		switch (act) {

> +		case XDP_PASS:

> +			err = xdp_update_frame_from_buff(&xdp, xdpf);


xdp_update_frame_from_buff will then convert it back from the xdp_buff?

struct xdp_buff {
	void *data;
	void *data_end;
	void *data_meta;
	void *data_hard_start;
	struct xdp_rxq_info *rxq;
	struct xdp_txq_info *txq;
	u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
};

struct xdp_frame {
	void *data;
	u16 len;
	u16 headroom;
	u32 metasize:8;
	u32 frame_sz:24;
	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
	 * while mem info is valid on remote CPU.
	 */
	struct xdp_mem_info mem;
	struct net_device *dev_rx; /* used by cpumap */
};


It looks like we could embed xdp_buff in xdp_frame and then keep the metadata
at the end.

Because you are working performance here wdyt? <- @Jesper as well.


> +			if (unlikely(err < 0))

> +				xdp_return_frame_rx_napi(xdpf);

> +			else

> +				frames[nframes++] = xdpf;

> +			break;

> +		default:

> +			bpf_warn_invalid_xdp_action(act);

> +			fallthrough;

> +		case XDP_ABORTED:

> +			trace_xdp_exception(dev, xdp_prog, act);

> +			fallthrough;

> +		case XDP_DROP:

> +			xdp_return_frame_rx_napi(xdpf);

> +			break;

> +		}

> +	}

> +	return n - nframes; /* dropped frames count */

> +}

> +

>  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)

>  {

>  	struct net_device *dev = bq->dev;

>  	int sent = 0, drops = 0, err = 0;

> +	unsigned int cnt = bq->count;

> +	unsigned int xdp_drop;

>  	int i;

>  

> -	if (unlikely(!bq->count))

> +	if (unlikely(!cnt))

>  		return;

>  

> -	for (i = 0; i < bq->count; i++) {

> +	for (i = 0; i < cnt; i++) {

>  		struct xdp_frame *xdpf = bq->q[i];

>  

>  		prefetch(xdpf);

>  	}

>  

> -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);

> +	if (unlikely(bq->xdp_prog)) {


Whats the rational for making above unlikely()? Seems for users its not
unlikely. Can you measure a performance increase/decrease here? I think
its probably fine to just let compiler/prefetcher do its thing here. Or
I'm not reading this right, but seems users of bq->xdp_prog would disagree
on unlikely case?

Either way a comment might be nice to give us some insight in 6 months
why we decided this is unlikely.

> +		xdp_drop = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);

> +		cnt -= xdp_drop;

> +		if (!cnt) {



if dev_map_bpf_prog_run() returned sent packets this would read better
imo.

  sent = dev_map_bpf_prog_run(...)
  if (!sent)
        goto out;

> +			sent = 0;

> +			drops = xdp_drop;

> +			goto out;

> +		}

> +	}

> +

> +	sent = dev->netdev_ops->ndo_xdp_xmit(dev, cnt, bq->q, flags);


And,    sent = dev->netdev_ops->ndo_xdp_xmit(dev, sent, bq->q, flags);

>  	if (sent < 0) {

>  		err = sent;

>  		sent = 0;

>  		goto error;

>  	}

> -	drops = bq->count - sent;

> +	drops = (cnt - sent) + xdp_drop;


With about 'sent' logic then drops will still be just, drops = bq->count - sent
and move the calculation below the out label and I think you clean up above
as well. Did I miss something...

>  out:

>  	bq->count = 0;

>  

>  	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);

>  	bq->dev_rx = NULL;

> +	bq->xdp_prog = NULL;

>  	__list_del_clearprev(&bq->flush_node);

>  	return;

>  error:

>  	/* If ndo_xdp_xmit fails with an errno, no frames have been

>  	 * xmit'ed and it's our responsibility to them free all.

>  	 */

> -	for (i = 0; i < bq->count; i++) {

> +	for (i = 0; i < cnt; i++) {

>  		struct xdp_frame *xdpf = bq->q[i];


Patch looks overall good to me, but cleaning up the logic a bit seems like
a plus.

Thanks,
John

John Fastabend Jan. 18, 2021, 12:10 a.m. UTC | #2

Hangbin Liu wrote:
> This patch is for xdp multicast support. which has been discussed
> before[0], The goal is to be able to implement an OVS-like data plane in
> XDP, i.e., a software switch that can forward XDP frames to multiple ports.
> 
> To achieve this, an application needs to specify a group of interfaces
> to forward a packet to. It is also common to want to exclude one or more
> physical interfaces from the forwarding operation - e.g., to forward a
> packet to all interfaces in the multicast group except the interface it
> arrived on. While this could be done simply by adding more groups, this
> quickly leads to a combinatorial explosion in the number of groups an
> application has to maintain.
> 
> To avoid the combinatorial explosion, we propose to include the ability
> to specify an "exclude group" as part of the forwarding operation. This
> needs to be a group (instead of just a single port index), because a
> physical interface can be part of a logical grouping, such as a bond
> device.
> 
> Thus, the logical forwarding operation becomes a "set difference"
> operation, i.e. "forward to all ports in group A that are not also in
> group B". This series implements such an operation using device maps to
> represent the groups. This means that the XDP program specifies two
> device maps, one containing the list of netdevs to redirect to, and the
> other containing the exclude list.
> 
> To achieve this, I re-implement a new helper bpf_redirect_map_multi()
> to accept two maps, the forwarding map and exclude map. The forwarding
> map could be DEVMAP or DEVMAP_HASH, but the exclude map *must* be
> DEVMAP_HASH to get better performace. If user don't want to use exclude
> map and just want simply stop redirecting back to ingress device, they
> can use flag BPF_F_EXCLUDE_INGRESS.
> 
> As both bpf_xdp_redirect_map() and this new helpers are using struct
> bpf_redirect_info, I add a new ex_map and set tgt_value to NULL in the
> new helper to make a difference with bpf_xdp_redirect_map().
> 
> Also I keep the general data path in net/core/filter.c, the native data
> path in kernel/bpf/devmap.c so we can use direct calls to get better
> performace.

[...]

> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 0cf3976ce77c..0e6468cd0ab9 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -164,6 +164,7 @@ void xdp_warn(const char *msg, const char *func, const int line);
>  #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__)
>  
>  struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
>  
>  static inline
>  void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a1ad32456f89..ecf5d117b96a 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3830,6 +3830,27 @@ union bpf_attr {
>   *	Return
>   *		A pointer to a struct socket on success or NULL if the file is
>   *		not a socket.
> + *
> + * long bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
> + * 	Description
> + * 		This is a multicast implementation for XDP redirect. It will
> + * 		redirect the packet to ALL the interfaces in *map*, but
> + * 		exclude the interfaces in *ex_map*.
> + *
> + * 		The forwarding *map* could be either BPF_MAP_TYPE_DEVMAP or
> + * 		BPF_MAP_TYPE_DEVMAP_HASH. But the *ex_map* must be
> + * 		BPF_MAP_TYPE_DEVMAP_HASH to get better performance.

Would be good to add a note ex_map _must_ be keyed by ifindex for the
helper to work. Its the obvious way to key a hashmap, but not required
iirc.

> + *
> + * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> + * 		which additionally excludes the current ingress device.
> + *
> + * 		See also bpf_redirect_map() as a unicast implementation,
> + * 		which supports redirecting packet to a specific ifindex
> + * 		in the map. As both helpers use struct bpf_redirect_info
> + * 		to store the redirect info, we will use a a NULL tgt_value
> + * 		to distinguish multicast and unicast redirecting.
> + * 	Return
> + * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
>   */

[...]

> +
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, struct bpf_map *ex_map,
> +			  u32 flags)
> +{
> +	struct bpf_dtab_netdev *obj = NULL, *next_obj = NULL;
> +	struct xdp_frame *xdpf, *nxdpf;
> +	bool last_one = false;
> +	int ex_ifindex;
> +	u32 key, next_key;
> +
> +	ex_ifindex = flags & BPF_F_EXCLUDE_INGRESS ? dev_rx->ifindex : 0;
> +
> +	/* Find first available obj */
> +	obj = devmap_get_next_obj(xdp, map, ex_map, NULL, &key, ex_ifindex);
> +	if (!obj)
> +		return 0;
> +
> +	xdpf = xdp_convert_buff_to_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;
> +
> +	for (;;) {
> +		/* Check if we still have one more available obj */
> +		next_obj = devmap_get_next_obj(xdp, map, ex_map, &key,
> +					       &next_key, ex_ifindex);
> +		if (!next_obj)
> +			last_one = true;
> +
> +		if (last_one) {
> +			bq_enqueue(obj->dev, xdpf, dev_rx, obj->xdp_prog);
> +			return 0;
> +		}

Just collapse above to

  if (!next_obj) {
        bq_enqueue()
        return
  }

'last_one' is a bit pointless here.

> +
> +		nxdpf = xdpf_clone(xdpf);
> +		if (unlikely(!nxdpf)) {
> +			xdp_return_frame_rx_napi(xdpf);
> +			return -ENOMEM;
> +		}
> +
> +		bq_enqueue(obj->dev, nxdpf, dev_rx, obj->xdp_prog);
> +
> +		/* Deal with next obj */
> +		obj = next_obj;
> +		key = next_key;
> +	}
> +}
> +
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog)
>  {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 3e4b5d9fce78..2139398057cf 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4420,6 +4420,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
>  		if (func_id != BPF_FUNC_redirect_map &&
> +		    func_id != BPF_FUNC_redirect_map_multi &&
>  		    func_id != BPF_FUNC_map_lookup_elem)
>  			goto error;
>  		break;
> @@ -4524,6 +4525,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		    map->map_type != BPF_MAP_TYPE_XSKMAP)
>  			goto error;
>  		break;
> +	case BPF_FUNC_redirect_map_multi:
> +		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
> +		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
> +			goto error;
> +		break;
>  	case BPF_FUNC_sk_redirect_map:
>  	case BPF_FUNC_msg_redirect_map:
>  	case BPF_FUNC_sock_map_update:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 9ab94e90d660..123efaf4ab88 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3924,12 +3924,19 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>  };
>  
>  static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
> -			    struct bpf_map *map, struct xdp_buff *xdp)
> +			    struct bpf_map *map, struct xdp_buff *xdp,
> +			    struct bpf_map *ex_map, u32 flags)
>  {
>  	switch (map->map_type) {
>  	case BPF_MAP_TYPE_DEVMAP:
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
> -		return dev_map_enqueue(fwd, xdp, dev_rx);
> +		/* We use a NULL fwd value to distinguish multicast
> +		 * and unicast forwarding
> +		 */
> +		if (fwd)
> +			return dev_map_enqueue(fwd, xdp, dev_rx);
> +		else
> +			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map, flags);
>  	case BPF_MAP_TYPE_CPUMAP:
>  		return cpu_map_enqueue(fwd, xdp, dev_rx);
>  	case BPF_MAP_TYPE_XSKMAP:
> @@ -3986,12 +3993,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>  	struct bpf_map *map = READ_ONCE(ri->map);
> +	struct bpf_map *ex_map = ri->ex_map;

READ_ONCE(ri->ex_map)?

>  	u32 index = ri->tgt_index;
>  	void *fwd = ri->tgt_value;
>  	int err;
>  
>  	ri->tgt_index = 0;
>  	ri->tgt_value = NULL;
> +	ri->ex_map = NULL;

WRITE_ONCE(ri->ex_map)?

>  	WRITE_ONCE(ri->map, NULL);

So we needed write_once, read_once pairs for ri->map do we also need them in
the ex_map case?

>  
>  	if (unlikely(!map)) {
> @@ -4003,7 +4012,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  
>  		err = dev_xdp_enqueue(fwd, xdp, dev);
>  	} else {
> -		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
> +		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, ri->flags);
>  	}
>  
>  	if (unlikely(err))
> @@ -4017,6 +4026,62 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);

[...]

> +BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
> +	   struct bpf_map *, ex_map, u64, flags)
> +{
> +	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +
> +	/* Limit ex_map type to DEVMAP_HASH to get better performance */
> +	if (unlikely((ex_map && ex_map->map_type != BPF_MAP_TYPE_DEVMAP_HASH) ||
> +		     flags & ~BPF_F_EXCLUDE_INGRESS))
> +		return XDP_ABORTED;
> +
> +	ri->tgt_index = 0;
> +	/* Set the tgt_value to NULL to distinguish with bpf_xdp_redirect_map */
> +	ri->tgt_value = NULL;
> +	ri->flags = flags;
> +	ri->ex_map = ex_map;

WRITE_ONCE?

> +
> +	WRITE_ONCE(ri->map, map);
> +
> +	return XDP_REDIRECT;
> +}
> +
> +static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
> +	.func           = bpf_xdp_redirect_map_multi,
> +	.gpl_only       = false,
> +	.ret_type       = RET_INTEGER,
> +	.arg1_type      = ARG_CONST_MAP_PTR,
> +	.arg2_type      = ARG_CONST_MAP_PTR_OR_NULL,
> +	.arg3_type      = ARG_ANYTHING,
> +};
> +

Thanks,
John

Hangbin Liu Jan. 18, 2021, 8:44 a.m. UTC | #3

Hi John,

Thanks for the reviewing.

On Sun, Jan 17, 2021 at 04:10:40PM -0800, John Fastabend wrote:
> > + * 		The forwarding *map* could be either BPF_MAP_TYPE_DEVMAP or

> > + * 		BPF_MAP_TYPE_DEVMAP_HASH. But the *ex_map* must be

> > + * 		BPF_MAP_TYPE_DEVMAP_HASH to get better performance.

> 

> Would be good to add a note ex_map _must_ be keyed by ifindex for the

> helper to work. Its the obvious way to key a hashmap, but not required

> iirc.


OK, I will.
> > +		if (!next_obj)

> > +			last_one = true;

> > +

> > +		if (last_one) {

> > +			bq_enqueue(obj->dev, xdpf, dev_rx, obj->xdp_prog);

> > +			return 0;

> > +		}

> 

> Just collapse above to

> 

>   if (!next_obj) {

>         bq_enqueue()

>         return

>   }

> 

> 'last_one' is a bit pointless here.


Yes, thanks.

> > @@ -3986,12 +3993,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,

> >  {

> >  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);

> >  	struct bpf_map *map = READ_ONCE(ri->map);

> > +	struct bpf_map *ex_map = ri->ex_map;

> 

> READ_ONCE(ri->ex_map)?

> 

> >  	u32 index = ri->tgt_index;

> >  	void *fwd = ri->tgt_value;

> >  	int err;

> >  

> >  	ri->tgt_index = 0;

> >  	ri->tgt_value = NULL;

> > +	ri->ex_map = NULL;

> 

> WRITE_ONCE(ri->ex_map)?

> 

> >  	WRITE_ONCE(ri->map, NULL);

> 

> So we needed write_once, read_once pairs for ri->map do we also need them in

> the ex_map case?


Toke said this is no need for this read/write_once as there is already one.

https://lore.kernel.org/bpf/87r1wd2bqu.fsf@toke.dk/

Thanks
Hangbin

Hangbin Liu Jan. 18, 2021, 10:07 a.m. UTC | #4

On Sun, Jan 17, 2021 at 02:57:02PM -0800, John Fastabend wrote:
[...]
> It looks like we could embed xdp_buff in xdp_frame and then keep the metadata

> at the end.

> 

> Because you are working performance here wdyt? <- @Jesper as well.


Leave this question to Jesper.

> >  

> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);

> > +	if (unlikely(bq->xdp_prog)) {

> 

> Whats the rational for making above unlikely()? Seems for users its not

> unlikely. Can you measure a performance increase/decrease here? I think

> its probably fine to just let compiler/prefetcher do its thing here. Or

> I'm not reading this right, but seems users of bq->xdp_prog would disagree

> on unlikely case?

> 

> Either way a comment might be nice to give us some insight in 6 months

> why we decided this is unlikely.


I agree that there is no need to use unlikely() here.
> 

> > +		xdp_drop = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);

> > +		cnt -= xdp_drop;

> > +		if (!cnt) {

> 

> 

> if dev_map_bpf_prog_run() returned sent packets this would read better

> imo.

> 

>   sent = dev_map_bpf_prog_run(...)

>   if (!sent)

>         goto out;

> 

> > +			sent = 0;

> > +			drops = xdp_drop;

> > +			goto out;

> > +		}

> > +	}

> > +

> > +	sent = dev->netdev_ops->ndo_xdp_xmit(dev, cnt, bq->q, flags);

> 

> And,    sent = dev->netdev_ops->ndo_xdp_xmit(dev, sent, bq->q, flags);

> 

> >  	if (sent < 0) {

> >  		err = sent;

> >  		sent = 0;

> >  		goto error;

> >  	}

> > -	drops = bq->count - sent;

> > +	drops = (cnt - sent) + xdp_drop;

> 

> With about 'sent' logic then drops will still be just, drops = bq->count - sent

> and move the calculation below the out label and I think you clean up above


If we use the 'sent' logic, we should also backup the drop value before
xmit as the erro label also need it.

> as well. Did I miss something...

> 

> >  out:

> >  	bq->count = 0;

> >  

> >  	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);

> >  	bq->dev_rx = NULL;

> > +	bq->xdp_prog = NULL;

> >  	__list_del_clearprev(&bq->flush_node);

> >  	return;

> >  error:

> >  	/* If ndo_xdp_xmit fails with an errno, no frames have been

> >  	 * xmit'ed and it's our responsibility to them free all.

> >  	 */

> > -	for (i = 0; i < bq->count; i++) {

> > +	for (i = 0; i < cnt; i++) {

> >  		struct xdp_frame *xdpf = bq->q[i];


here it will be "for (i = 0; i < cnt - drops; i++)" to free none xmit'ed
frames.

To make the logic more clear, here is the full code:

	[...]
        if (bq->xdp_prog) {
                sent = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
                if (!sent)
                        goto out;
        }

	/* Backup drops value before xmit as we may need it in error label */
        drops = cnt - sent;
        sent = dev->netdev_ops->ndo_xdp_xmit(dev, sent, bq->q, flags);
        if (sent < 0) {
                err = sent;
                sent = 0;
                goto error;
        }
out:
        drops = cnt - sent;
        bq->count = 0;

        trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);
        bq->dev_rx = NULL;
        bq->xdp_prog = NULL;
        __list_del_clearprev(&bq->flush_node);
        return;
error:
        /* If ndo_xdp_xmit fails with an errno, no frames have been
         * xmit'ed and it's our responsibility to them free all.
         */
        for (i = 0; i < cnt - drops; i++) {
                struct xdp_frame *xdpf = bq->q[i];
                xdp_return_frame_rx_napi(xdpf);
        }
        goto out;
}

Thanks
hangbin

Toke Høiland-Jørgensen Jan. 18, 2021, 10:47 a.m. UTC | #5

Hangbin Liu <liuhangbin@gmail.com> writes:

> Hi John,

>

> Thanks for the reviewing.

>

> On Sun, Jan 17, 2021 at 04:10:40PM -0800, John Fastabend wrote:

>> > + * 		The forwarding *map* could be either BPF_MAP_TYPE_DEVMAP or

>> > + * 		BPF_MAP_TYPE_DEVMAP_HASH. But the *ex_map* must be

>> > + * 		BPF_MAP_TYPE_DEVMAP_HASH to get better performance.

>> 

>> Would be good to add a note ex_map _must_ be keyed by ifindex for the

>> helper to work. Its the obvious way to key a hashmap, but not required

>> iirc.

>

> OK, I will.

>> > +		if (!next_obj)

>> > +			last_one = true;

>> > +

>> > +		if (last_one) {

>> > +			bq_enqueue(obj->dev, xdpf, dev_rx, obj->xdp_prog);

>> > +			return 0;

>> > +		}

>> 

>> Just collapse above to

>> 

>>   if (!next_obj) {

>>         bq_enqueue()

>>         return

>>   }

>> 

>> 'last_one' is a bit pointless here.

>

> Yes, thanks.

>

>> > @@ -3986,12 +3993,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,

>> >  {

>> >  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);

>> >  	struct bpf_map *map = READ_ONCE(ri->map);

>> > +	struct bpf_map *ex_map = ri->ex_map;

>> 

>> READ_ONCE(ri->ex_map)?

>> 

>> >  	u32 index = ri->tgt_index;

>> >  	void *fwd = ri->tgt_value;

>> >  	int err;

>> >  

>> >  	ri->tgt_index = 0;

>> >  	ri->tgt_value = NULL;

>> > +	ri->ex_map = NULL;

>> 

>> WRITE_ONCE(ri->ex_map)?

>> 

>> >  	WRITE_ONCE(ri->map, NULL);

>> 

>> So we needed write_once, read_once pairs for ri->map do we also need them in

>> the ex_map case?

>

> Toke said this is no need for this read/write_once as there is already one.

>

> https://lore.kernel.org/bpf/87r1wd2bqu.fsf@toke.dk/


And then I corrected that after I figured out the real reason :)

https://lore.kernel.org/bpf/878si2h3sb.fsf@toke.dk/ - Quote:

> The READ_ONCE() is not needed because the ex_map field is only ever read

> from or written to by the CPU owning the per-cpu pointer. Whereas the

> 'map' field is manipulated by remote CPUs in bpf_clear_redirect_map().

> So you need neither READ_ONCE() nor WRITE_ONCE() on ex_map, just like

> there are none on tgt_index and tgt_value.


-Toke

John Fastabend Jan. 18, 2021, 3:14 p.m. UTC | #6

Toke Høiland-Jørgensen wrote:
> Hangbin Liu <liuhangbin@gmail.com> writes:

> 

> > Hi John,

> >

> > Thanks for the reviewing.

> >

> > On Sun, Jan 17, 2021 at 04:10:40PM -0800, John Fastabend wrote:

> >> > + * 		The forwarding *map* could be either BPF_MAP_TYPE_DEVMAP or

> >> > + * 		BPF_MAP_TYPE_DEVMAP_HASH. But the *ex_map* must be

> >> > + * 		BPF_MAP_TYPE_DEVMAP_HASH to get better performance.

> >> 

> >> Would be good to add a note ex_map _must_ be keyed by ifindex for the

> >> helper to work. Its the obvious way to key a hashmap, but not required

> >> iirc.

> >

> > OK, I will.


[...]

> >> WRITE_ONCE(ri->ex_map)?

> >> 

> >> >  	WRITE_ONCE(ri->map, NULL);

> >> 

> >> So we needed write_once, read_once pairs for ri->map do we also need them in

> >> the ex_map case?

> >

> > Toke said this is no need for this read/write_once as there is already one.

> >

> > https://lore.kernel.org/bpf/87r1wd2bqu.fsf@toke.dk/

> 

> And then I corrected that after I figured out the real reason :)

> 

> https://lore.kernel.org/bpf/878si2h3sb.fsf@toke.dk/ - Quote:

> 

> > The READ_ONCE() is not needed because the ex_map field is only ever read

> > from or written to by the CPU owning the per-cpu pointer. Whereas the

> > 'map' field is manipulated by remote CPUs in bpf_clear_redirect_map().

> > So you need neither READ_ONCE() nor WRITE_ONCE() on ex_map, just like

> > there are none on tgt_index and tgt_value.

> 

> -Toke

> 


Hi Hangbin, please add a comment above that code block to remind us
why the READ_ONCE/WRITE_ONCE is not needed or add it in the commit
message so we don't lose it. It seems we've hashed it over already,
but I forgot after the holidays/break so presumably I'll forget next
time I read this code as well and commit-msg or comment will help.

Thanks,
John

Hangbin Liu Jan. 20, 2021, 2:25 a.m. UTC | #7

This patch is for xdp multicast support. which has been discussed before[0],
The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
a software switch that can forward XDP frames to multiple ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because there
may have multi interfaces you want to exclude.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The 1st patch is Jesper's run devmap xdp_prog later in bulking step.
The 2st patch add a new bpf arg to allow NULL map pointer.
The 3rd patch add the new bpf_redirect_map_multi() helper.
The 4-6 patches are for usage sample and testing purpose.

I did same perf tests with the following topo:

---------------------             ---------------------
| Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
---------------------             |                   |
                                  |   Host B          |
---------------------             |                   |
| Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
---------------------    vlan2    |          -------- |
                                  | veth1 -- | veth0| |
                                  |          -------- |
                                  --------------------|
On Host A:
# pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64

On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
The veth0 in netns load dummy drop program. The forward_map max_entries in
xdp_redirect_map_multi is modify to 4.

Here is the perf result with 5.10 rc6:

The are about +/- 0.1M deviation for native testing
Version             | Test                                    | Generic | Native | Native + 2nd
5.10 rc6            | xdp_redirect_map        i40e->i40e      |    2.0M |   9.1M |  8.0M
5.10 rc6            | xdp_redirect_map        i40e->veth      |    1.7M |  11.0M |  9.7M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e      |    1.7M |   7.8M |  6.4M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->veth      |    1.4M |   9.3M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e+veth |    1.0M |   3.2M |  2.7M

Last but not least, thanks a lot to Toke, Jesper, Jiri and Eelco for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v15:
Update bq_xmit_all() logic for patch 01.
Add some comments and remove useless variable for patch 03.
Use bpf_object__find_program_by_title() for patch 04 and 06.

v14:
No code update, just rebase the code on latest bpf-next

v13:
Pass in xdp_prog through __xdp_enqueue() for patch 01. Update related
code in patch 03.

v12:
Add Jesper's xdp_prog patch, rebase my works on this and latest bpf-next
Add 2nd xdp_prog test on the sample and selftests.

v11:
Fix bpf_redirect_map_multi() helper description typo.
Add loop limit for devmap_get_next_obj() and dev_map_redirect_multi().

v10:
Rebase the code to latest bpf-next.
Update helper bpf_xdp_redirect_map_multi()
- No need to check map pointer as we will do the check in verifier.

v9:
Update helper bpf_xdp_redirect_map_multi()
- Use ARG_CONST_MAP_PTR_OR_NULL for helper arg2

v8:
a) Update function dev_in_exclude_map():
   - remove duplicate ex_map map_type check in
   - lookup the element in dev map by obj dev index directly instead
     of looping all the map

v7:
a) Fix helper flag check
b) Limit the *ex_map* to use DEVMAP_HASH only and update function
   dev_in_exclude_map() to get better performance.

v6: converted helper return types from int to long

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().
f) Split the tests from sample and add a bpf kernel selftest patch.

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Hangbin Liu (5):
  bpf: add a new bpf argument type ARG_CONST_MAP_PTR_OR_NULL
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test
  selftests/bpf: Add verifier tests for bpf arg
    ARG_CONST_MAP_PTR_OR_NULL
  selftests/bpf: add xdp_redirect_multi test

Jesper Dangaard Brouer (1):
  bpf: run devmap xdp_prog on flush instead of bulk enqueue

 include/linux/bpf.h                           |  21 ++
 include/linux/filter.h                        |   1 +
 include/net/xdp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  28 ++
 kernel/bpf/devmap.c                           | 232 +++++++++++---
 kernel/bpf/verifier.c                         |  16 +-
 net/core/filter.c                             | 124 ++++++-
 net/core/xdp.c                                |  29 ++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c     |  87 +++++
 samples/bpf/xdp_redirect_map_multi_user.c     | 302 ++++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  28 ++
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       | 111 +++++++
 tools/testing/selftests/bpf/test_verifier.c   |  22 +-
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 208 ++++++++++++
 .../testing/selftests/bpf/verifier/map_ptr.c  |  70 ++++
 .../selftests/bpf/xdp_redirect_multi.c        | 252 +++++++++++++++
 18 files changed, 1488 insertions(+), 50 deletions(-)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c

Jesper Dangaard Brouer Jan. 21, 2021, 2:33 p.m. UTC | #8

On Mon, 18 Jan 2021 18:07:17 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> On Sun, Jan 17, 2021 at 02:57:02PM -0800, John Fastabend wrote:

> [...]

> > It looks like we could embed xdp_buff in xdp_frame and then keep the metadata

> > at the end.

> > 

> > Because you are working performance here wdyt? <- @Jesper as well.  

> 

> Leave this question to Jesper.

The struct xdp_buff is larger than struct xdp_frame.  The size of
xdp_frame matters. It is a reserved areas in top of the frame.
An XDP BPF-program cannot access this area (and limit headroom grow).
This is why this code works, as afterwards xdp_frame is still valid.
Looking at the code xdp_update_frame_from_buff() we do seem to update
more fields than actually needed.

> > >  

> > > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);

> > > +	if (unlikely(bq->xdp_prog)) {  

> > 

> > Whats the rational for making above unlikely()? Seems for users its not

> > unlikely. Can you measure a performance increase/decrease here? I think

> > its probably fine to just let compiler/prefetcher do its thing here. Or

> > I'm not reading this right, but seems users of bq->xdp_prog would disagree

> > on unlikely case?

> > 

> > Either way a comment might be nice to give us some insight in 6 months

> > why we decided this is unlikely.  

> 

> I agree that there is no need to use unlikely() here.

I added the unlikely() to preserve the baseline performance when not
having the 2nd prog loaded.  But I'm fine with removing that.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCHv14,bpf-next,0/6] xdp: add a new helper for dev map multicast support

Message

Comments