mbox series

[bpf-next,v2,0/8] Introduce bpf_redirect_xsk() helper

Message ID 20210119155013.154808-1-bjorn.topel@gmail.com
Headers show
Series Introduce bpf_redirect_xsk() helper | expand

Message

Björn Töpel Jan. 19, 2021, 3:50 p.m. UTC
This series extends bind() for XDP sockets, so that the bound socket
is added to the netdev_rx_queue _rx array in the netdevice. We call
this to register the socket. To redirect packets to the registered
socket, a new BPF helper is used: bpf_redirect_xsk().

For shared XDP sockets, only the first bound socket is
registered. Users that need more complex setup has to use XSKMAP and
bpf_redirect_map().

Now, why would one use bpf_redirect_xsk() over the regular
bpf_redirect_map() helper?

* Better performance!
* Convenience; Most user use one socket per queue. This scenario is
  what registered sockets support. There is no need to create an
  XSKMAP. This can also reduce complexity from containerized setups,
  where users might what to use XDP sockets without CAP_SYS_ADMIN
  capabilities.

The first patch restructures xdp_do_redirect() a bit, to make it
easier to add the new helper. This restructure also give us a slight
performance benefit. The following three patches extends bind() and
adds the new helper. After that, two libbpf patches that selects XDP
program based on what kernel is running. Finally, selftests for the new
functionality is added.

Note that the libbpf "auto-selection" is based on kernel version, so
it is hard coded to the "-next" version (5.12). If you would like to
try this is out, you will need to change the libbpf patch locally!

Thanks to Maciej and Magnus for the internal review/comments!

Performance (rxdrop, zero-copy)

Baseline
Two cores:                   21.3 Mpps
One core:                    24.5 Mpps

Patched
Two cores, bpf_redirect_map: 21.7 Mpps + 2%
One core, bpf_redirect_map:  24.9 Mpps + 2%

Two cores, bpf_redirect_xsk: 24.0 Mpps +13%
One core, bpf_redirect_xsk:  25.5 Mpps + 4%

Thanks!
Björn

v1->v2: 
  * Added missing XDP programs to selftests.
  * Fixed checkpatch warning in selftests.

Björn Töpel (8):
  xdp: restructure redirect actions
  xsk: remove explicit_free parameter from __xsk_rcv()
  xsk: fold xp_assign_dev and __xp_assign_dev
  xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  libbpf, xsk: select AF_XDP BPF program based on kernel version
  libbpf, xsk: select bpf_redirect_xsk(), if supported
  selftest/bpf: add XDP socket tests for bpf_redirect_{xsk, map}()
  selftest/bpf: remove a lot of ifobject casting in xdpxceiver

 include/linux/filter.h                        |  10 +
 include/linux/netdevice.h                     |   1 +
 include/net/xdp_sock.h                        |  12 +
 include/net/xsk_buff_pool.h                   |   2 +-
 include/trace/events/xdp.h                    |  46 ++--
 include/uapi/linux/bpf.h                      |   7 +
 net/core/filter.c                             | 205 ++++++++++--------
 net/xdp/xsk.c                                 | 112 ++++++++--
 net/xdp/xsk_buff_pool.c                       |  12 +-
 tools/include/uapi/linux/bpf.h                |   7 +
 tools/lib/bpf/libbpf.c                        |   2 +-
 tools/lib/bpf/libbpf_internal.h               |   2 +
 tools/lib/bpf/libbpf_probes.c                 |  16 --
 tools/lib/bpf/xsk.c                           |  83 ++++++-
 .../selftests/bpf/progs/xdpxceiver_ext1.c     |  15 ++
 .../selftests/bpf/progs/xdpxceiver_ext2.c     |   9 +
 tools/testing/selftests/bpf/test_xsk.sh       |  48 ++++
 tools/testing/selftests/bpf/xdpxceiver.c      | 164 +++++++++-----
 tools/testing/selftests/bpf/xdpxceiver.h      |   2 +
 19 files changed, 554 insertions(+), 201 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c


base-commit: 95204c9bfa48d2f4d3bab7df55c1cc823957ff81

Comments

Toke Høiland-Jørgensen Jan. 20, 2021, 12:44 p.m. UTC | #1
Björn Töpel <bjorn.topel@gmail.com> writes:

> From: Björn Töpel <bjorn.topel@intel.com>

>

> The XDP_REDIRECT implementations for maps and non-maps are fairly

> similar, but obviously need to take different code paths depending on

> if the target is using a map or not. Today, the redirect targets for

> XDP either uses a map, or is based on ifindex.

>

> Future commits will introduce yet another redirect target via the a

> new helper, bpf_redirect_xsk(). To pave the way for that, we introduce

> an explicit redirect type to bpf_redirect_info. This makes the code

> easier to follow, and makes it easier to add new redirect targets.

>

> Further, using an explicit type in bpf_redirect_info has a slight

> positive performance impact by avoiding a pointer indirection for the

> map type lookup, and instead use the hot cacheline for

> bpf_redirect_info.

>

> The bpf_redirect_info flags member is not used by XDP, and not

> read/written any more. The map member is only written to when

> required/used, and not unconditionally.


I like the simplification. However, the handling of map clearing becomes
a bit murky with this change:

You're not changing anything in bpf_clear_redirect_map(), and you're
removing most of the reads and writes of ri->map. Instead,
bpf_xdp_redirect_map() will store the bpf_dtab_netdev pointer in
ri->tgt_value, which xdp_do_redirect() will just read and use without
checking. But if the map element (or the entire map) has been freed in
the meantime that will be a dangling pointer. I *think* the RCU callback
in dev_map_delete_elem() and the rcu_barrier() in dev_map_free()
protects against this, but that is by no means obvious. So confirming
this, and explaining it in a comment would be good.

Also, as far as I can tell after this, ri->map is only used for the
tracepoint. So how about just storing the map ID and getting rid of the
READ/WRITE_ONCE() entirely?

(Oh, and related to this I think this patch set will conflict with
Hangbin's multi-redirect series, so maybe you two ought to coordinate? :))

-Toke
Toke Høiland-Jørgensen Jan. 20, 2021, 12:50 p.m. UTC | #2
Björn Töpel <bjorn.topel@gmail.com> writes:

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

> index c001766adcbc..bbc7d9a57262 100644

> --- a/include/uapi/linux/bpf.h

> +++ b/include/uapi/linux/bpf.h

> @@ -3836,6 +3836,12 @@ union bpf_attr {

>   *	Return

>   *		A pointer to a struct socket on success or NULL if the file is

>   *		not a socket.

> + *

> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)

> + *	Description

> + *		Redirect to the registered AF_XDP socket.

> + *	Return

> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.

>   */


I think it would be better to make the second argument a 'flags'
argument and make values > XDP_TX invalid (like we do in
bpf_xdp_redirect_map() now). By allowing any value as return you lose
the ability to turn it into a flags argument later...

-Toke
Maxim Mikityanskiy Jan. 20, 2021, 1:15 p.m. UTC | #3
On 2021-01-19 17:50, Björn Töpel wrote:
> This series extends bind() for XDP sockets, so that the bound socket

> is added to the netdev_rx_queue _rx array in the netdevice. We call

> this to register the socket. To redirect packets to the registered

> socket, a new BPF helper is used: bpf_redirect_xsk().

> 

> For shared XDP sockets, only the first bound socket is

> registered. Users that need more complex setup has to use XSKMAP and

> bpf_redirect_map().

> 

> Now, why would one use bpf_redirect_xsk() over the regular

> bpf_redirect_map() helper?

> 

> * Better performance!

> * Convenience; Most user use one socket per queue. This scenario is

>    what registered sockets support. There is no need to create an

>    XSKMAP. This can also reduce complexity from containerized setups,

>    where users might what to use XDP sockets without CAP_SYS_ADMIN

>    capabilities.

> 

> The first patch restructures xdp_do_redirect() a bit, to make it

> easier to add the new helper. This restructure also give us a slight

> performance benefit. The following three patches extends bind() and

> adds the new helper. After that, two libbpf patches that selects XDP

> program based on what kernel is running. Finally, selftests for the new

> functionality is added.

> 

> Note that the libbpf "auto-selection" is based on kernel version, so

> it is hard coded to the "-next" version (5.12). If you would like to

> try this is out, you will need to change the libbpf patch locally!

> 

> Thanks to Maciej and Magnus for the internal review/comments!

> 

> Performance (rxdrop, zero-copy)

> 

> Baseline

> Two cores:                   21.3 Mpps

> One core:                    24.5 Mpps


Two cores is slower? It used to be faster all the time, didn't it?

> Patched

> Two cores, bpf_redirect_map: 21.7 Mpps + 2%

> One core, bpf_redirect_map:  24.9 Mpps + 2%

> 

> Two cores, bpf_redirect_xsk: 24.0 Mpps +13%


Nice, impressive improvement!

> One core, bpf_redirect_xsk:  25.5 Mpps + 4%

> 

> Thanks!

> Björn

> 

> v1->v2:

>    * Added missing XDP programs to selftests.

>    * Fixed checkpatch warning in selftests.

> 

> Björn Töpel (8):

>    xdp: restructure redirect actions

>    xsk: remove explicit_free parameter from __xsk_rcv()

>    xsk: fold xp_assign_dev and __xp_assign_dev

>    xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper

>    libbpf, xsk: select AF_XDP BPF program based on kernel version

>    libbpf, xsk: select bpf_redirect_xsk(), if supported

>    selftest/bpf: add XDP socket tests for bpf_redirect_{xsk, map}()

>    selftest/bpf: remove a lot of ifobject casting in xdpxceiver

> 

>   include/linux/filter.h                        |  10 +

>   include/linux/netdevice.h                     |   1 +

>   include/net/xdp_sock.h                        |  12 +

>   include/net/xsk_buff_pool.h                   |   2 +-

>   include/trace/events/xdp.h                    |  46 ++--

>   include/uapi/linux/bpf.h                      |   7 +

>   net/core/filter.c                             | 205 ++++++++++--------

>   net/xdp/xsk.c                                 | 112 ++++++++--

>   net/xdp/xsk_buff_pool.c                       |  12 +-

>   tools/include/uapi/linux/bpf.h                |   7 +

>   tools/lib/bpf/libbpf.c                        |   2 +-

>   tools/lib/bpf/libbpf_internal.h               |   2 +

>   tools/lib/bpf/libbpf_probes.c                 |  16 --

>   tools/lib/bpf/xsk.c                           |  83 ++++++-

>   .../selftests/bpf/progs/xdpxceiver_ext1.c     |  15 ++

>   .../selftests/bpf/progs/xdpxceiver_ext2.c     |   9 +

>   tools/testing/selftests/bpf/test_xsk.sh       |  48 ++++

>   tools/testing/selftests/bpf/xdpxceiver.c      | 164 +++++++++-----

>   tools/testing/selftests/bpf/xdpxceiver.h      |   2 +

>   19 files changed, 554 insertions(+), 201 deletions(-)

>   create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c

>   create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c

> 

> 

> base-commit: 95204c9bfa48d2f4d3bab7df55c1cc823957ff81

>
Björn Töpel Jan. 20, 2021, 1:25 p.m. UTC | #4
On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@gmail.com> writes:

> 

>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

>> index c001766adcbc..bbc7d9a57262 100644

>> --- a/include/uapi/linux/bpf.h

>> +++ b/include/uapi/linux/bpf.h

>> @@ -3836,6 +3836,12 @@ union bpf_attr {

>>    *	Return

>>    *		A pointer to a struct socket on success or NULL if the file is

>>    *		not a socket.

>> + *

>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)

>> + *	Description

>> + *		Redirect to the registered AF_XDP socket.

>> + *	Return

>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.

>>    */

> 

> I think it would be better to make the second argument a 'flags'

> argument and make values > XDP_TX invalid (like we do in

> bpf_xdp_redirect_map() now). By allowing any value as return you lose

> the ability to turn it into a flags argument later...

>


Yes, but that adds a run-time check. I prefer this non-checked version,
even though it is a bit less futureproof.


Björn
Björn Töpel Jan. 20, 2021, 1:27 p.m. UTC | #5
On 2021-01-20 14:15, Maxim Mikityanskiy wrote:
> On 2021-01-19 17:50, Björn Töpel wrote:

>> This series extends bind() for XDP sockets, so that the bound socket

>> is added to the netdev_rx_queue _rx array in the netdevice. We call

>> this to register the socket. To redirect packets to the registered

>> socket, a new BPF helper is used: bpf_redirect_xsk().

>>

>> For shared XDP sockets, only the first bound socket is

>> registered. Users that need more complex setup has to use XSKMAP and

>> bpf_redirect_map().

>>

>> Now, why would one use bpf_redirect_xsk() over the regular

>> bpf_redirect_map() helper?

>>

>> * Better performance!

>> * Convenience; Most user use one socket per queue. This scenario is

>>    what registered sockets support. There is no need to create an

>>    XSKMAP. This can also reduce complexity from containerized setups,

>>    where users might what to use XDP sockets without CAP_SYS_ADMIN

>>    capabilities.

>>

>> The first patch restructures xdp_do_redirect() a bit, to make it

>> easier to add the new helper. This restructure also give us a slight

>> performance benefit. The following three patches extends bind() and

>> adds the new helper. After that, two libbpf patches that selects XDP

>> program based on what kernel is running. Finally, selftests for the new

>> functionality is added.

>>

>> Note that the libbpf "auto-selection" is based on kernel version, so

>> it is hard coded to the "-next" version (5.12). If you would like to

>> try this is out, you will need to change the libbpf patch locally!

>>

>> Thanks to Maciej and Magnus for the internal review/comments!

>>

>> Performance (rxdrop, zero-copy)

>>

>> Baseline

>> Two cores:                   21.3 Mpps

>> One core:                    24.5 Mpps

> 

> Two cores is slower? It used to be faster all the time, didn't it?

>


Up until busy-polling. Note that I'm using a busy-poll napi budget of 512.


>> Patched

>> Two cores, bpf_redirect_map: 21.7 Mpps + 2%

>> One core, bpf_redirect_map:  24.9 Mpps + 2%

>>

>> Two cores, bpf_redirect_xsk: 24.0 Mpps +13%

> 

> Nice, impressive improvement!

>


Thanks! Getting rid of the queue/netdev checks really payed off!


Björn
Björn Töpel Jan. 20, 2021, 1:40 p.m. UTC | #6
On 2021-01-20 13:44, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@gmail.com> writes:

> 

>> From: Björn Töpel <bjorn.topel@intel.com>

>>

>> The XDP_REDIRECT implementations for maps and non-maps are fairly

>> similar, but obviously need to take different code paths depending on

>> if the target is using a map or not. Today, the redirect targets for

>> XDP either uses a map, or is based on ifindex.

>>

>> Future commits will introduce yet another redirect target via the a

>> new helper, bpf_redirect_xsk(). To pave the way for that, we introduce

>> an explicit redirect type to bpf_redirect_info. This makes the code

>> easier to follow, and makes it easier to add new redirect targets.

>>

>> Further, using an explicit type in bpf_redirect_info has a slight

>> positive performance impact by avoiding a pointer indirection for the

>> map type lookup, and instead use the hot cacheline for

>> bpf_redirect_info.

>>

>> The bpf_redirect_info flags member is not used by XDP, and not

>> read/written any more. The map member is only written to when

>> required/used, and not unconditionally.

> 

> I like the simplification. However, the handling of map clearing becomes

> a bit murky with this change:

> 

> You're not changing anything in bpf_clear_redirect_map(), and you're

> removing most of the reads and writes of ri->map. Instead,

> bpf_xdp_redirect_map() will store the bpf_dtab_netdev pointer in

> ri->tgt_value, which xdp_do_redirect() will just read and use without

> checking. But if the map element (or the entire map) has been freed in

> the meantime that will be a dangling pointer. I *think* the RCU callback

> in dev_map_delete_elem() and the rcu_barrier() in dev_map_free()

> protects against this, but that is by no means obvious. So confirming

> this, and explaining it in a comment would be good.

>


Yes, *most* of the READ_ONCE(ri->map) are removed, it's pretty much only 
the bpf_redirect_map(), and as you write, the tracepoints.

The content/element of the map is RCU protected, and actually even the
map will be around until the XDP processing is complete. Note the
synchronize_rcu() followed after all bpf_clear_redirect_map() calls.

I'll try to make it clearer in the commit message! Thanks for pointing 
that out!

> Also, as far as I can tell after this, ri->map is only used for the

> tracepoint. So how about just storing the map ID and getting rid of the

> READ/WRITE_ONCE() entirely?

>


...and the bpf_redirect_map() helper. Don't you think the current
READ_ONCE(ri->map) scheme is more obvious/clear?


> (Oh, and related to this I think this patch set will conflict with

> Hangbin's multi-redirect series, so maybe you two ought to coordinate? :))

>


Yeah, good idea! I would guess Hangbin's would go in before this, so I
would need to adapt.


Thanks for taking of look at the series, Toke! Much appreciated!


Björn
Toke Høiland-Jørgensen Jan. 20, 2021, 2:52 p.m. UTC | #7
Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 13:44, Toke Høiland-Jørgensen wrote:

>> Björn Töpel <bjorn.topel@gmail.com> writes:

>> 

>>> From: Björn Töpel <bjorn.topel@intel.com>

>>>

>>> The XDP_REDIRECT implementations for maps and non-maps are fairly

>>> similar, but obviously need to take different code paths depending on

>>> if the target is using a map or not. Today, the redirect targets for

>>> XDP either uses a map, or is based on ifindex.

>>>

>>> Future commits will introduce yet another redirect target via the a

>>> new helper, bpf_redirect_xsk(). To pave the way for that, we introduce

>>> an explicit redirect type to bpf_redirect_info. This makes the code

>>> easier to follow, and makes it easier to add new redirect targets.

>>>

>>> Further, using an explicit type in bpf_redirect_info has a slight

>>> positive performance impact by avoiding a pointer indirection for the

>>> map type lookup, and instead use the hot cacheline for

>>> bpf_redirect_info.

>>>

>>> The bpf_redirect_info flags member is not used by XDP, and not

>>> read/written any more. The map member is only written to when

>>> required/used, and not unconditionally.

>> 

>> I like the simplification. However, the handling of map clearing becomes

>> a bit murky with this change:

>> 

>> You're not changing anything in bpf_clear_redirect_map(), and you're

>> removing most of the reads and writes of ri->map. Instead,

>> bpf_xdp_redirect_map() will store the bpf_dtab_netdev pointer in

>> ri->tgt_value, which xdp_do_redirect() will just read and use without

>> checking. But if the map element (or the entire map) has been freed in

>> the meantime that will be a dangling pointer. I *think* the RCU callback

>> in dev_map_delete_elem() and the rcu_barrier() in dev_map_free()

>> protects against this, but that is by no means obvious. So confirming

>> this, and explaining it in a comment would be good.

>>

>

> Yes, *most* of the READ_ONCE(ri->map) are removed, it's pretty much only 

> the bpf_redirect_map(), and as you write, the tracepoints.

>

> The content/element of the map is RCU protected, and actually even the

> map will be around until the XDP processing is complete. Note the

> synchronize_rcu() followed after all bpf_clear_redirect_map() calls.

>

> I'll try to make it clearer in the commit message! Thanks for pointing 

> that out!

>

>> Also, as far as I can tell after this, ri->map is only used for the

>> tracepoint. So how about just storing the map ID and getting rid of the

>> READ/WRITE_ONCE() entirely?

>>

>

> ...and the bpf_redirect_map() helper. Don't you think the current

> READ_ONCE(ri->map) scheme is more obvious/clear?


Yeah, after your patch we WRITE_ONCE() the pointer in
bpf_redirect_map(), but the only place it is actually *read* is in the
tracepoint. So the only purpose of bpf_clear_redirect_map() is to ensure
that an invalid pointer is not read in the tracepoint function. Which
seems a bit excessive when we could just store the map ID for direct use
in the tracepoint and get rid of bpf_clear_redirect_map() entirely, no?

Besides, from a UX point of view, having the tracepoint display the map
ID even if that map ID is no longer valid seems to me like it makes more
sense than just displaying a map ID of 0 and leaving it up to the user
to figure out that this is because the map was cleared. I mean, at the
time the redirect was made, that *was* the map ID that was used...

Oh, and as you say due to the synchronize_rcu() call in dev_map_free() I
think this whole discussion is superfluous anyway, since it can't
actually happen that the map gets freed between the setting and reading
of ri->map, no?

>> (Oh, and related to this I think this patch set will conflict with

>> Hangbin's multi-redirect series, so maybe you two ought to coordinate? :))

>>

>

> Yeah, good idea! I would guess Hangbin's would go in before this, so I

> would need to adapt.

>

>

> Thanks for taking of look at the series, Toke! Much appreciated!


You're welcome :)

-Toke
Toke Høiland-Jørgensen Jan. 20, 2021, 2:54 p.m. UTC | #8
Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:

>> Björn Töpel <bjorn.topel@gmail.com> writes:

>> 

>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

>>> index c001766adcbc..bbc7d9a57262 100644

>>> --- a/include/uapi/linux/bpf.h

>>> +++ b/include/uapi/linux/bpf.h

>>> @@ -3836,6 +3836,12 @@ union bpf_attr {

>>>    *	Return

>>>    *		A pointer to a struct socket on success or NULL if the file is

>>>    *		not a socket.

>>> + *

>>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)

>>> + *	Description

>>> + *		Redirect to the registered AF_XDP socket.

>>> + *	Return

>>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.

>>>    */

>> 

>> I think it would be better to make the second argument a 'flags'

>> argument and make values > XDP_TX invalid (like we do in

>> bpf_xdp_redirect_map() now). By allowing any value as return you lose

>> the ability to turn it into a flags argument later...

>>

>

> Yes, but that adds a run-time check. I prefer this non-checked version,

> even though it is a bit less futureproof.


That...seems a bit short-sighted? :)
Can you actually see a difference in your performance numbers?

-Toke
Björn Töpel Jan. 20, 2021, 3:18 p.m. UTC | #9
On 2021-01-20 15:54, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@intel.com> writes:

> 

>> On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:

>>> Björn Töpel <bjorn.topel@gmail.com> writes:

>>>

>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

>>>> index c001766adcbc..bbc7d9a57262 100644

>>>> --- a/include/uapi/linux/bpf.h

>>>> +++ b/include/uapi/linux/bpf.h

>>>> @@ -3836,6 +3836,12 @@ union bpf_attr {

>>>>     *	Return

>>>>     *		A pointer to a struct socket on success or NULL if the file is

>>>>     *		not a socket.

>>>> + *

>>>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)

>>>> + *	Description

>>>> + *		Redirect to the registered AF_XDP socket.

>>>> + *	Return

>>>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.

>>>>     */

>>>

>>> I think it would be better to make the second argument a 'flags'

>>> argument and make values > XDP_TX invalid (like we do in

>>> bpf_xdp_redirect_map() now). By allowing any value as return you lose

>>> the ability to turn it into a flags argument later...

>>>

>>

>> Yes, but that adds a run-time check. I prefer this non-checked version,

>> even though it is a bit less futureproof.

> 

> That...seems a bit short-sighted? :)

> Can you actually see a difference in your performance numbers?

>


I would rather add an additional helper *if* we see the need for flags,
instead of paying for that upfront. For me, BPF is about being able to
specialize, and not having one call with tons of checks.

(Related; Going forward, the growing switch() for redirect targets in
xdp_do_redirect() is a concern for me...)

And yes, even with all those fancy branch predictors, less instructions
is still less. :-) (It shows in my ubenchs.)


Björn
Björn Töpel Jan. 20, 2021, 3:49 p.m. UTC | #10
On 2021-01-20 15:52, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@intel.com> writes:

> 

>> On 2021-01-20 13:44, Toke Høiland-Jørgensen wrote:

>>> Björn Töpel <bjorn.topel@gmail.com> writes:

>>>

>>>> From: Björn Töpel <bjorn.topel@intel.com>

>>>>

>>>> The XDP_REDIRECT implementations for maps and non-maps are fairly

>>>> similar, but obviously need to take different code paths depending on

>>>> if the target is using a map or not. Today, the redirect targets for

>>>> XDP either uses a map, or is based on ifindex.

>>>>

>>>> Future commits will introduce yet another redirect target via the a

>>>> new helper, bpf_redirect_xsk(). To pave the way for that, we introduce

>>>> an explicit redirect type to bpf_redirect_info. This makes the code

>>>> easier to follow, and makes it easier to add new redirect targets.

>>>>

>>>> Further, using an explicit type in bpf_redirect_info has a slight

>>>> positive performance impact by avoiding a pointer indirection for the

>>>> map type lookup, and instead use the hot cacheline for

>>>> bpf_redirect_info.

>>>>

>>>> The bpf_redirect_info flags member is not used by XDP, and not

>>>> read/written any more. The map member is only written to when

>>>> required/used, and not unconditionally.

>>>

>>> I like the simplification. However, the handling of map clearing becomes

>>> a bit murky with this change:

>>>

>>> You're not changing anything in bpf_clear_redirect_map(), and you're

>>> removing most of the reads and writes of ri->map. Instead,

>>> bpf_xdp_redirect_map() will store the bpf_dtab_netdev pointer in

>>> ri->tgt_value, which xdp_do_redirect() will just read and use without

>>> checking. But if the map element (or the entire map) has been freed in

>>> the meantime that will be a dangling pointer. I *think* the RCU callback

>>> in dev_map_delete_elem() and the rcu_barrier() in dev_map_free()

>>> protects against this, but that is by no means obvious. So confirming

>>> this, and explaining it in a comment would be good.

>>>

>>

>> Yes, *most* of the READ_ONCE(ri->map) are removed, it's pretty much only

>> the bpf_redirect_map(), and as you write, the tracepoints.

>>

>> The content/element of the map is RCU protected, and actually even the

>> map will be around until the XDP processing is complete. Note the

>> synchronize_rcu() followed after all bpf_clear_redirect_map() calls.

>>

>> I'll try to make it clearer in the commit message! Thanks for pointing

>> that out!

>>

>>> Also, as far as I can tell after this, ri->map is only used for the

>>> tracepoint. So how about just storing the map ID and getting rid of the

>>> READ/WRITE_ONCE() entirely?

>>>

>>

>> ...and the bpf_redirect_map() helper. Don't you think the current

>> READ_ONCE(ri->map) scheme is more obvious/clear?

> 

> Yeah, after your patch we WRITE_ONCE() the pointer in

> bpf_redirect_map(), but the only place it is actually *read* is in the

> tracepoint. So the only purpose of bpf_clear_redirect_map() is to ensure

> that an invalid pointer is not read in the tracepoint function. Which

> seems a bit excessive when we could just store the map ID for direct use

> in the tracepoint and get rid of bpf_clear_redirect_map() entirely, no?

> 

> Besides, from a UX point of view, having the tracepoint display the map

> ID even if that map ID is no longer valid seems to me like it makes more

> sense than just displaying a map ID of 0 and leaving it up to the user

> to figure out that this is because the map was cleared. I mean, at the

> time the redirect was made, that *was* the map ID that was used...

>


Convinced! Getting rid of bpf_clear_redirect_map() would be good! I'll
take a stab at this for v3!


> Oh, and as you say due to the synchronize_rcu() call in dev_map_free() I

> think this whole discussion is superfluous anyway, since it can't

> actually happen that the map gets freed between the setting and reading

> of ri->map, no?

>


It can't be free'd but, ri->map can be cleared via
bpf_clear_redirect_map(). So, between the helper (setting) and the
tracepoint in xdp_do_redirect() it can be cleared (say if the XDP
program is swapped out, prior running xdp_do_redirect()).

Moving to the scheme you suggested, does make the discussion
superfluous. :-)


Thanks for the input!
Björn
Jesper Dangaard Brouer Jan. 20, 2021, 3:57 p.m. UTC | #11
On Wed, 20 Jan 2021 15:15:22 +0200
Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

> On 2021-01-19 17:50, Björn Töpel wrote:

> > This series extends bind() for XDP sockets, so that the bound socket

> > is added to the netdev_rx_queue _rx array in the netdevice. We call

> > this to register the socket. To redirect packets to the registered

> > socket, a new BPF helper is used: bpf_redirect_xsk().

> > 

> > For shared XDP sockets, only the first bound socket is

> > registered. Users that need more complex setup has to use XSKMAP and

> > bpf_redirect_map().

> > 

> > Now, why would one use bpf_redirect_xsk() over the regular

> > bpf_redirect_map() helper?

> > 

> > * Better performance!

> > * Convenience; Most user use one socket per queue. This scenario is

> >    what registered sockets support. There is no need to create an

> >    XSKMAP. This can also reduce complexity from containerized setups,

> >    where users might what to use XDP sockets without CAP_SYS_ADMIN

> >    capabilities.


I'm buying into the convenience and reduce complexity, and XDP sockets
without CAP_SYS_ADMIN into containers.

People might be surprised that I'm actually NOT buying into the better
performance argument here.  At these speeds we are basically comparing
how close we are to zero (and have to use nanosec time scale for our
comparisons), more below.

 
> > The first patch restructures xdp_do_redirect() a bit, to make it

> > easier to add the new helper. This restructure also give us a slight

> > performance benefit. The following three patches extends bind() and

> > adds the new helper. After that, two libbpf patches that selects XDP

> > program based on what kernel is running. Finally, selftests for the new

> > functionality is added.

> > 

> > Note that the libbpf "auto-selection" is based on kernel version, so

> > it is hard coded to the "-next" version (5.12). If you would like to

> > try this is out, you will need to change the libbpf patch locally!

> > 

> > Thanks to Maciej and Magnus for the internal review/comments!

> > 

> > Performance (rxdrop, zero-copy)

> > 

> > Baseline

> > Two cores:                   21.3 Mpps

> > One core:                    24.5 Mpps  

> 

> Two cores is slower? It used to be faster all the time, didn't it?

> 

> > Patched

> > Two cores, bpf_redirect_map: 21.7 Mpps + 2%

> > One core, bpf_redirect_map:  24.9 Mpps + 2%

> > 

> > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%  

> 

> Nice, impressive improvement!


I do appreciate you work and performance optimizations at this level,
because when we are using this few CPU cycles per packet, then it is
really hard to find new ways to reduce cycles further.

Thank for you saying +13% instead of saying +2.7 Mpps.
It *is* impressive to basically reduce cycles with 13%.

 21.3 Mpps = 46.94 nanosec per packet
 24.0 Mpps = 41.66 nanosec per packet

 21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved

On my 3.60GHz testlab machine that gives me 19 cycles.

 
> > One core, bpf_redirect_xsk:  25.5 Mpps + 4%

> >


 24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved

At this point with saving 1.6 ns this is around the cost of a function call 1.3 ns.


We still need these optimization in the kernel, but end-users in
userspace are very quickly going to waste the 19 cycles we found.
I still support/believe that the OS need to have a little overhead as
possible, but for me 42 nanosec overhead is close to zero overhead. For
comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for
delivery of UDP packets into socket (almost 15 times slower).

I guess my point is that with XDP we have already achieved and exceeded
(my original) performance goals, making it even faster is just showing off ;-P
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c
[3] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
Maciej Fijalkowski Jan. 20, 2021, 4:19 p.m. UTC | #12
On Wed, Jan 20, 2021 at 04:57:08PM +0100, Jesper Dangaard Brouer wrote:
> On Wed, 20 Jan 2021 15:15:22 +0200

> Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

> 

> > On 2021-01-19 17:50, Björn Töpel wrote:

> > > This series extends bind() for XDP sockets, so that the bound socket

> > > is added to the netdev_rx_queue _rx array in the netdevice. We call

> > > this to register the socket. To redirect packets to the registered

> > > socket, a new BPF helper is used: bpf_redirect_xsk().

> > > 

> > > For shared XDP sockets, only the first bound socket is

> > > registered. Users that need more complex setup has to use XSKMAP and

> > > bpf_redirect_map().

> > > 

> > > Now, why would one use bpf_redirect_xsk() over the regular

> > > bpf_redirect_map() helper?

> > > 

> > > * Better performance!

> > > * Convenience; Most user use one socket per queue. This scenario is

> > >    what registered sockets support. There is no need to create an

> > >    XSKMAP. This can also reduce complexity from containerized setups,

> > >    where users might what to use XDP sockets without CAP_SYS_ADMIN

> > >    capabilities.

> 

> I'm buying into the convenience and reduce complexity, and XDP sockets

> without CAP_SYS_ADMIN into containers.

> 

> People might be surprised that I'm actually NOT buying into the better

> performance argument here.  At these speeds we are basically comparing

> how close we are to zero (and have to use nanosec time scale for our

> comparisons), more below.

> 

>  

> > > The first patch restructures xdp_do_redirect() a bit, to make it

> > > easier to add the new helper. This restructure also give us a slight

> > > performance benefit. The following three patches extends bind() and

> > > adds the new helper. After that, two libbpf patches that selects XDP

> > > program based on what kernel is running. Finally, selftests for the new

> > > functionality is added.

> > > 

> > > Note that the libbpf "auto-selection" is based on kernel version, so

> > > it is hard coded to the "-next" version (5.12). If you would like to

> > > try this is out, you will need to change the libbpf patch locally!

> > > 

> > > Thanks to Maciej and Magnus for the internal review/comments!

> > > 

> > > Performance (rxdrop, zero-copy)

> > > 

> > > Baseline

> > > Two cores:                   21.3 Mpps

> > > One core:                    24.5 Mpps  

> > 

> > Two cores is slower? It used to be faster all the time, didn't it?

> > 

> > > Patched

> > > Two cores, bpf_redirect_map: 21.7 Mpps + 2%

> > > One core, bpf_redirect_map:  24.9 Mpps + 2%

> > > 

> > > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%  

> > 

> > Nice, impressive improvement!

> 

> I do appreciate you work and performance optimizations at this level,

> because when we are using this few CPU cycles per packet, then it is

> really hard to find new ways to reduce cycles further.

> 

> Thank for you saying +13% instead of saying +2.7 Mpps.

> It *is* impressive to basically reduce cycles with 13%.

> 

>  21.3 Mpps = 46.94 nanosec per packet

>  24.0 Mpps = 41.66 nanosec per packet

> 

>  21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved

> 

> On my 3.60GHz testlab machine that gives me 19 cycles.

> 

>  

> > > One core, bpf_redirect_xsk:  25.5 Mpps + 4%

> > >

> 

>  24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved

> 

> At this point with saving 1.6 ns this is around the cost of a function call 1.3 ns.

> 

> 

> We still need these optimization in the kernel, but end-users in

> userspace are very quickly going to waste the 19 cycles we found.

> I still support/believe that the OS need to have a little overhead as

> possible, but for me 42 nanosec overhead is close to zero overhead. For

> comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for

> delivery of UDP packets into socket (almost 15 times slower).

> 

> I guess my point is that with XDP we have already achieved and exceeded

> (my original) performance goals, making it even faster is just showing off ;-P


Even though I'll let Bjorn elaborating on this, we're talking here about
AF-XDP which is a bit different pair of shoes to me in terms of
performance. AFAIK we still have a gap when compared to DPDK's numbers. So
I'm really not sure why better performance bothers you? :)

Let's rather be more harsh on changes that actually decrease the
performance, not the other way around. And I suppose you were the one that
always was bringing up the 'death by a 1000 paper cuts' of XDP. So yeah,
I'm a bit confused with your statement.

> -- 

> Best regards,

>   Jesper Dangaard Brouer

>   MSc.CS, Principal Kernel Engineer at Red Hat

>   LinkedIn: http://www.linkedin.com/in/brouer

> 

> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c

> [3] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c

>
Toke Høiland-Jørgensen Jan. 20, 2021, 4:30 p.m. UTC | #13
Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 15:52, Toke Høiland-Jørgensen wrote:

>> Björn Töpel <bjorn.topel@intel.com> writes:

>> 

>>> On 2021-01-20 13:44, Toke Høiland-Jørgensen wrote:

>>>> Björn Töpel <bjorn.topel@gmail.com> writes:

>>>>

>>>>> From: Björn Töpel <bjorn.topel@intel.com>

>>>>>

>>>>> The XDP_REDIRECT implementations for maps and non-maps are fairly

>>>>> similar, but obviously need to take different code paths depending on

>>>>> if the target is using a map or not. Today, the redirect targets for

>>>>> XDP either uses a map, or is based on ifindex.

>>>>>

>>>>> Future commits will introduce yet another redirect target via the a

>>>>> new helper, bpf_redirect_xsk(). To pave the way for that, we introduce

>>>>> an explicit redirect type to bpf_redirect_info. This makes the code

>>>>> easier to follow, and makes it easier to add new redirect targets.

>>>>>

>>>>> Further, using an explicit type in bpf_redirect_info has a slight

>>>>> positive performance impact by avoiding a pointer indirection for the

>>>>> map type lookup, and instead use the hot cacheline for

>>>>> bpf_redirect_info.

>>>>>

>>>>> The bpf_redirect_info flags member is not used by XDP, and not

>>>>> read/written any more. The map member is only written to when

>>>>> required/used, and not unconditionally.

>>>>

>>>> I like the simplification. However, the handling of map clearing becomes

>>>> a bit murky with this change:

>>>>

>>>> You're not changing anything in bpf_clear_redirect_map(), and you're

>>>> removing most of the reads and writes of ri->map. Instead,

>>>> bpf_xdp_redirect_map() will store the bpf_dtab_netdev pointer in

>>>> ri->tgt_value, which xdp_do_redirect() will just read and use without

>>>> checking. But if the map element (or the entire map) has been freed in

>>>> the meantime that will be a dangling pointer. I *think* the RCU callback

>>>> in dev_map_delete_elem() and the rcu_barrier() in dev_map_free()

>>>> protects against this, but that is by no means obvious. So confirming

>>>> this, and explaining it in a comment would be good.

>>>>

>>>

>>> Yes, *most* of the READ_ONCE(ri->map) are removed, it's pretty much only

>>> the bpf_redirect_map(), and as you write, the tracepoints.

>>>

>>> The content/element of the map is RCU protected, and actually even the

>>> map will be around until the XDP processing is complete. Note the

>>> synchronize_rcu() followed after all bpf_clear_redirect_map() calls.

>>>

>>> I'll try to make it clearer in the commit message! Thanks for pointing

>>> that out!

>>>

>>>> Also, as far as I can tell after this, ri->map is only used for the

>>>> tracepoint. So how about just storing the map ID and getting rid of the

>>>> READ/WRITE_ONCE() entirely?

>>>>

>>>

>>> ...and the bpf_redirect_map() helper. Don't you think the current

>>> READ_ONCE(ri->map) scheme is more obvious/clear?

>> 

>> Yeah, after your patch we WRITE_ONCE() the pointer in

>> bpf_redirect_map(), but the only place it is actually *read* is in the

>> tracepoint. So the only purpose of bpf_clear_redirect_map() is to ensure

>> that an invalid pointer is not read in the tracepoint function. Which

>> seems a bit excessive when we could just store the map ID for direct use

>> in the tracepoint and get rid of bpf_clear_redirect_map() entirely, no?

>> 

>> Besides, from a UX point of view, having the tracepoint display the map

>> ID even if that map ID is no longer valid seems to me like it makes more

>> sense than just displaying a map ID of 0 and leaving it up to the user

>> to figure out that this is because the map was cleared. I mean, at the

>> time the redirect was made, that *was* the map ID that was used...

>>

>

> Convinced! Getting rid of bpf_clear_redirect_map() would be good! I'll

> take a stab at this for v3!


Cool!

>> Oh, and as you say due to the synchronize_rcu() call in dev_map_free() I

>> think this whole discussion is superfluous anyway, since it can't

>> actually happen that the map gets freed between the setting and reading

>> of ri->map, no?

>>

>

> It can't be free'd but, ri->map can be cleared via

> bpf_clear_redirect_map(). So, between the helper (setting) and the

> tracepoint in xdp_do_redirect() it can be cleared (say if the XDP

> program is swapped out, prior running xdp_do_redirect()).


But xdp_do_redirect() should be called on driver flush before exiting
the NAPI cycle, so how can the XDP program be swapped out?

> Moving to the scheme you suggested, does make the discussion

> superfluous. :-)


Yup, awesome :)

-Toke
Björn Töpel Jan. 20, 2021, 5:26 p.m. UTC | #14
On 2021-01-20 17:30, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@intel.com> writes:

> 

[...]
>>

>> It can't be free'd but, ri->map can be cleared via

>> bpf_clear_redirect_map(). So, between the helper (setting) and the

>> tracepoint in xdp_do_redirect() it can be cleared (say if the XDP

>> program is swapped out, prior running xdp_do_redirect()).

> 

> But xdp_do_redirect() should be called on driver flush before exiting

> the NAPI cycle, so how can the XDP program be swapped out?

> 


To clarify; xdp_do_redirect() is called for each packet in the NAPI poll 
loop.

Yeah, you're right. The xdp_do_redirect() is within the RCU scope, so
the program wont be destroyed (but can be swapped though!).

Hmm, so IOW the bpf_clear_redirect_map() is not needed anymore...


Björn

[...]
Toke Høiland-Jørgensen Jan. 20, 2021, 5:29 p.m. UTC | #15
Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 15:54, Toke Høiland-Jørgensen wrote:

>> Björn Töpel <bjorn.topel@intel.com> writes:

>> 

>>> On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:

>>>> Björn Töpel <bjorn.topel@gmail.com> writes:

>>>>

>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

>>>>> index c001766adcbc..bbc7d9a57262 100644

>>>>> --- a/include/uapi/linux/bpf.h

>>>>> +++ b/include/uapi/linux/bpf.h

>>>>> @@ -3836,6 +3836,12 @@ union bpf_attr {

>>>>>     *	Return

>>>>>     *		A pointer to a struct socket on success or NULL if the file is

>>>>>     *		not a socket.

>>>>> + *

>>>>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)

>>>>> + *	Description

>>>>> + *		Redirect to the registered AF_XDP socket.

>>>>> + *	Return

>>>>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.

>>>>>     */

>>>>

>>>> I think it would be better to make the second argument a 'flags'

>>>> argument and make values > XDP_TX invalid (like we do in

>>>> bpf_xdp_redirect_map() now). By allowing any value as return you lose

>>>> the ability to turn it into a flags argument later...

>>>>

>>>

>>> Yes, but that adds a run-time check. I prefer this non-checked version,

>>> even though it is a bit less futureproof.

>> 

>> That...seems a bit short-sighted? :)

>> Can you actually see a difference in your performance numbers?

>>

>

> I would rather add an additional helper *if* we see the need for flags,

> instead of paying for that upfront. For me, BPF is about being able to

> specialize, and not having one call with tons of checks.


I get that, I'm just pushing back because omitting a 'flags' argument is
literally among the most frequent reasons for having to replace a
syscall (see e.g., [0]) instead of extending it. And yeah, I do realise
that the performance implications are different for XDP than for
syscalls, but maintainability of the API is also important; it's all a
tradeoff. This will be the third redirect helper variant for XDP and I'd
hate for the fourth one to have to be bpf_redirect_xsk_flags() because
it did turn out to be needed...

(One potential concrete reason for this: I believe Magnus was talking
about an API that would allow a BPF program to redirect a packet into
more than one socket (cloning it in the process), or to redirect to a
socket+another target. How would you do that with this new helper?)

[0] https://lwn.net/Articles/585415/

> (Related; Going forward, the growing switch() for redirect targets in

> xdp_do_redirect() is a concern for me...)

>

> And yes, even with all those fancy branch predictors, less instructions

> is still less. :-) (It shows in my ubenchs.)


Right, I do agree that the run-time performance hit of checking the flag
sucks (along with being hard to check for, cf. our parallel discussion
about version checks). So ideally this would be fixed by having the
verifier enforce the argument ranges instead; but if we merge this
without the runtime check now we can't add that later without
potentially breaking programs... :(

-Toke
Björn Töpel Jan. 20, 2021, 6:22 p.m. UTC | #16
On 2021-01-20 18:29, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@intel.com> writes:

> 

>> On 2021-01-20 15:54, Toke Høiland-Jørgensen wrote:

>>> Björn Töpel <bjorn.topel@intel.com> writes:

>>>

>>>> On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:

>>>>> Björn Töpel <bjorn.topel@gmail.com> writes:

>>>>>

>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

>>>>>> index c001766adcbc..bbc7d9a57262 100644

>>>>>> --- a/include/uapi/linux/bpf.h

>>>>>> +++ b/include/uapi/linux/bpf.h

>>>>>> @@ -3836,6 +3836,12 @@ union bpf_attr {

>>>>>>      *	Return

>>>>>>      *		A pointer to a struct socket on success or NULL if the file is

>>>>>>      *		not a socket.

>>>>>> + *

>>>>>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)

>>>>>> + *	Description

>>>>>> + *		Redirect to the registered AF_XDP socket.

>>>>>> + *	Return

>>>>>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.

>>>>>>      */

>>>>>

>>>>> I think it would be better to make the second argument a 'flags'

>>>>> argument and make values > XDP_TX invalid (like we do in

>>>>> bpf_xdp_redirect_map() now). By allowing any value as return you lose

>>>>> the ability to turn it into a flags argument later...

>>>>>

>>>>

>>>> Yes, but that adds a run-time check. I prefer this non-checked version,

>>>> even though it is a bit less futureproof.

>>>

>>> That...seems a bit short-sighted? :)

>>> Can you actually see a difference in your performance numbers?

>>>

>>

>> I would rather add an additional helper *if* we see the need for flags,

>> instead of paying for that upfront. For me, BPF is about being able to

>> specialize, and not having one call with tons of checks.

> 

> I get that, I'm just pushing back because omitting a 'flags' argument is

> literally among the most frequent reasons for having to replace a

> syscall (see e.g., [0]) instead of extending it. And yeah, I do realise

> that the performance implications are different for XDP than for

> syscalls, but maintainability of the API is also important; it's all a

> tradeoff. This will be the third redirect helper variant for XDP and I'd

> hate for the fourth one to have to be bpf_redirect_xsk_flags() because

> it did turn out to be needed...

> 

> (One potential concrete reason for this: I believe Magnus was talking

> about an API that would allow a BPF program to redirect a packet into

> more than one socket (cloning it in the process), or to redirect to a

> socket+another target. How would you do that with this new helper?)

> 

> [0] https://lwn.net/Articles/585415/

>


I have a bit of different view. One of the really nice parts about BPF
is exactly specialization. A user can tailor the kernel do a specific
thing. I *don't* see an issue with yet another helper, if that is needed
in the future. I think that is better than bloated helpers trying to
cope for all scenarios. I don't mean we should just add helpers all over
the place, but I do see more lightly on adding helpers, than adding
syscalls.

Elaborating a bit on this: many device drivers try to handle all the
things in the fast-path. I see BPF as one way forward to moving away
from that. Setup what you need, and only run what you currently need,
instead of the current "Is bleh on, then baz? Is this on, then that."

So, I would like to avoid "future proofing" the helpers, if that makes
sense. Use what you need. That's why BPF is so good (one of the things)!

As for bpf_redirect_xsk() it's a leaner version of bpf_redirect_map().
You want flags/shared sockets/...? Well go use bpf_redirect_map() and
XSKMAP. bpf_redirect_xsk() is not for you.

A lot of back-and-forth for *one* if-statement, but it's kind of a
design thing for me. ;-)


Björn


>> (Related; Going forward, the growing switch() for redirect targets in

>> xdp_do_redirect() is a concern for me...)

>>

>> And yes, even with all those fancy branch predictors, less instructions

>> is still less. :-) (It shows in my ubenchs.)

> 

> Right, I do agree that the run-time performance hit of checking the flag

> sucks (along with being hard to check for, cf. our parallel discussion

> about version checks). So ideally this would be fixed by having the

> verifier enforce the argument ranges instead; but if we merge this

> without the runtime check now we can't add that later without

> potentially breaking programs... :(

>

> -Toke

>
Toke Høiland-Jørgensen Jan. 20, 2021, 8:26 p.m. UTC | #17
Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 18:29, Toke Høiland-Jørgensen wrote:

>> Björn Töpel <bjorn.topel@intel.com> writes:

>> 

>>> On 2021-01-20 15:54, Toke Høiland-Jørgensen wrote:

>>>> Björn Töpel <bjorn.topel@intel.com> writes:

>>>>

>>>>> On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:

>>>>>> Björn Töpel <bjorn.topel@gmail.com> writes:

>>>>>>

>>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

>>>>>>> index c001766adcbc..bbc7d9a57262 100644

>>>>>>> --- a/include/uapi/linux/bpf.h

>>>>>>> +++ b/include/uapi/linux/bpf.h

>>>>>>> @@ -3836,6 +3836,12 @@ union bpf_attr {

>>>>>>>      *	Return

>>>>>>>      *		A pointer to a struct socket on success or NULL if the file is

>>>>>>>      *		not a socket.

>>>>>>> + *

>>>>>>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)

>>>>>>> + *	Description

>>>>>>> + *		Redirect to the registered AF_XDP socket.

>>>>>>> + *	Return

>>>>>>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.

>>>>>>>      */

>>>>>>

>>>>>> I think it would be better to make the second argument a 'flags'

>>>>>> argument and make values > XDP_TX invalid (like we do in

>>>>>> bpf_xdp_redirect_map() now). By allowing any value as return you lose

>>>>>> the ability to turn it into a flags argument later...

>>>>>>

>>>>>

>>>>> Yes, but that adds a run-time check. I prefer this non-checked version,

>>>>> even though it is a bit less futureproof.

>>>>

>>>> That...seems a bit short-sighted? :)

>>>> Can you actually see a difference in your performance numbers?

>>>>

>>>

>>> I would rather add an additional helper *if* we see the need for flags,

>>> instead of paying for that upfront. For me, BPF is about being able to

>>> specialize, and not having one call with tons of checks.

>> 

>> I get that, I'm just pushing back because omitting a 'flags' argument is

>> literally among the most frequent reasons for having to replace a

>> syscall (see e.g., [0]) instead of extending it. And yeah, I do realise

>> that the performance implications are different for XDP than for

>> syscalls, but maintainability of the API is also important; it's all a

>> tradeoff. This will be the third redirect helper variant for XDP and I'd

>> hate for the fourth one to have to be bpf_redirect_xsk_flags() because

>> it did turn out to be needed...

>> 

>> (One potential concrete reason for this: I believe Magnus was talking

>> about an API that would allow a BPF program to redirect a packet into

>> more than one socket (cloning it in the process), or to redirect to a

>> socket+another target. How would you do that with this new helper?)

>> 

>> [0] https://lwn.net/Articles/585415/

>>

>

> I have a bit of different view. One of the really nice parts about BPF

> is exactly specialization. A user can tailor the kernel do a specific

> thing. I *don't* see an issue with yet another helper, if that is needed

> in the future. I think that is better than bloated helpers trying to

> cope for all scenarios. I don't mean we should just add helpers all over

> the place, but I do see more lightly on adding helpers, than adding

> syscalls.

>

> Elaborating a bit on this: many device drivers try to handle all the

> things in the fast-path. I see BPF as one way forward to moving away

> from that. Setup what you need, and only run what you currently need,

> instead of the current "Is bleh on, then baz? Is this on, then that."

>

> So, I would like to avoid "future proofing" the helpers, if that makes

> sense. Use what you need. That's why BPF is so good (one of the

> things)!


Well, it's a tradeoff. We're still defining an API that should not be
(too) confusing...

> As for bpf_redirect_xsk() it's a leaner version of bpf_redirect_map().

> You want flags/shared sockets/...? Well go use bpf_redirect_map() and

> XSKMAP. bpf_redirect_xsk() is not for you.


This argument, however, I buy: bpf_redirect() is the single-purpose
helper for redirecting to an ifindex, bpf_redirect_xsk() is the
single-purpose helper for redirecting to an XSK, and bpf_redirect_map()
is the generic one that does both of those and more. Fair enough,
consider me convinced :)

> A lot of back-and-forth for *one* if-statement, but it's kind of a

> design thing for me. ;-)


Surely you don't mean to imply that you have *better* things to do with
your time than have a 10-emails-long argument over a single if
statement? ;)

-Toke
Alexei Starovoitov Jan. 20, 2021, 9:15 p.m. UTC | #18
On Wed, Jan 20, 2021 at 12:26 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>

> This argument, however, I buy: bpf_redirect() is the single-purpose

> helper for redirecting to an ifindex, bpf_redirect_xsk() is the

> single-purpose helper for redirecting to an XSK, and bpf_redirect_map()

> is the generic one that does both of those and more. Fair enough,

> consider me convinced :)

>

> > A lot of back-and-forth for *one* if-statement, but it's kind of a

> > design thing for me. ;-)

>

> Surely you don't mean to imply that you have *better* things to do with

> your time than have a 10-emails-long argument over a single if

> statement? ;)


After reading this thread I think I have to pour cold water on the design.

The performance blip comes from hard coded assumptions:
+ queue_id = xdp->rxq->queue_index;
+ xs = READ_ONCE(dev->_rx[queue_id].xsk);

bpf can have specialized helpers, but imo this is beyond what's reasonable.
Please move such things into the program and try to make
bpf_redirect_map faster.

Making af_xdp non-root is orthogonal. If there is actual need for that
it has to be designed thoroughly and not presented as "this helper may
help to do that".
I don't think "may" will materialize unless people actually work
toward the goal of non-root.
Björn Töpel Jan. 21, 2021, 8:18 a.m. UTC | #19
On 2021-01-20 22:15, Alexei Starovoitov wrote:
> On Wed, Jan 20, 2021 at 12:26 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:

>>

>> This argument, however, I buy: bpf_redirect() is the single-purpose

>> helper for redirecting to an ifindex, bpf_redirect_xsk() is the

>> single-purpose helper for redirecting to an XSK, and bpf_redirect_map()

>> is the generic one that does both of those and more. Fair enough,

>> consider me convinced :)

>>

>>> A lot of back-and-forth for *one* if-statement, but it's kind of a

>>> design thing for me. ;-)

>>

>> Surely you don't mean to imply that you have *better* things to do with

>> your time than have a 10-emails-long argument over a single if

>> statement? ;)

> 

> After reading this thread I think I have to pour cold water on the design.

> 

> The performance blip comes from hard coded assumptions:

> + queue_id = xdp->rxq->queue_index;

> + xs = READ_ONCE(dev->_rx[queue_id].xsk);

>


Yes, one can see this as a constrained map:

* The map belongs to a certain netdev.
* Each entry corresponds to a certain queue id.

I.e if we do a successful (non-NULL) lookup, we *know* that all sockets
in that map belong to the netdev, and has the correct queue id.

By doing that we can get rid of two run-time checks: "Is the socket
bound to this netdev?" and "Is this the correct queue id?".

> bpf can have specialized helpers, but imo this is beyond what's reasonable.

> Please move such things into the program and try to make

> bpf_redirect_map faster.

>


I obviously prefer this path, and ideally combined with a way to even
more specialize xdp_do_redirect(). Then again, you are the maintainer! :-)

Maybe an alternative be adding a new type of XSKMAP constrained in the
similar way as above, and continue with bpf_redirect_map(), but with
this new map the index argument would be ignored. Unfortunately the BPF
context (xdp_buff in this case) is not passed to bpf_redirect_map(), so
getting the actual queue_id in the helper is hard. Adding the context as
an additional argument would be a new helper...

I'll need to think a bit more about it. Input/ideas are welcome!


> Making af_xdp non-root is orthogonal. If there is actual need for that

> it has to be designed thoroughly and not presented as "this helper may

> help to do that".

> I don't think "may" will materialize unless people actually work

> toward the goal of non-root.

> 


Fair enough! Same goal could be reached using the existing map approach.


Björn
Jesper Dangaard Brouer Jan. 21, 2021, 5:01 p.m. UTC | #20
On Wed, 20 Jan 2021 17:19:31 +0100
Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote:

> On Wed, Jan 20, 2021 at 04:57:08PM +0100, Jesper Dangaard Brouer wrote:

> > On Wed, 20 Jan 2021 15:15:22 +0200

> > Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

> >   

> > > On 2021-01-19 17:50, Björn Töpel wrote:  

> > > > This series extends bind() for XDP sockets, so that the bound socket

> > > > is added to the netdev_rx_queue _rx array in the netdevice. We call

> > > > this to register the socket. To redirect packets to the registered

> > > > socket, a new BPF helper is used: bpf_redirect_xsk().

> > > > 

> > > > For shared XDP sockets, only the first bound socket is

> > > > registered. Users that need more complex setup has to use XSKMAP and

> > > > bpf_redirect_map().

> > > > 

> > > > Now, why would one use bpf_redirect_xsk() over the regular

> > > > bpf_redirect_map() helper?

> > > > 

> > > > * Better performance!

> > > > * Convenience; Most user use one socket per queue. This scenario is

> > > >    what registered sockets support. There is no need to create an

> > > >    XSKMAP. This can also reduce complexity from containerized setups,

> > > >    where users might what to use XDP sockets without CAP_SYS_ADMIN

> > > >    capabilities.  

> > 

> > I'm buying into the convenience and reduce complexity, and XDP sockets

> > without CAP_SYS_ADMIN into containers.

> > 

> > People might be surprised that I'm actually NOT buying into the better

> > performance argument here.  At these speeds we are basically comparing

> > how close we are to zero (and have to use nanosec time scale for our

> > comparisons), more below.

> > 

> >    

> > > > The first patch restructures xdp_do_redirect() a bit, to make it

> > > > easier to add the new helper. This restructure also give us a slight

> > > > performance benefit. The following three patches extends bind() and

> > > > adds the new helper. After that, two libbpf patches that selects XDP

> > > > program based on what kernel is running. Finally, selftests for the new

> > > > functionality is added.

> > > > 

> > > > Note that the libbpf "auto-selection" is based on kernel version, so

> > > > it is hard coded to the "-next" version (5.12). If you would like to

> > > > try this is out, you will need to change the libbpf patch locally!

> > > > 

> > > > Thanks to Maciej and Magnus for the internal review/comments!

> > > > 

> > > > Performance (rxdrop, zero-copy)

> > > > 

> > > > Baseline

> > > > Two cores:                   21.3 Mpps

> > > > One core:                    24.5 Mpps    

> > > 

> > > Two cores is slower? It used to be faster all the time, didn't it?

> > >   

> > > > Patched

> > > > Two cores, bpf_redirect_map: 21.7 Mpps + 2%

> > > > One core, bpf_redirect_map:  24.9 Mpps + 2%

> > > > 

> > > > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%    

> > > 

> > > Nice, impressive improvement!  

> > 

> > I do appreciate you work and performance optimizations at this level,

> > because when we are using this few CPU cycles per packet, then it is

> > really hard to find new ways to reduce cycles further.

> > 

> > Thank for you saying +13% instead of saying +2.7 Mpps.

> > It *is* impressive to basically reduce cycles with 13%.

> > 

> >  21.3 Mpps = 46.94 nanosec per packet

> >  24.0 Mpps = 41.66 nanosec per packet

> > 

> >  21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved

> > 

> > On my 3.60GHz testlab machine that gives me 19 cycles.

> > 

> >    

> > > > One core, bpf_redirect_xsk:  25.5 Mpps + 4%

> > > >  

> > 

> >  24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved

> > 

> > At this point with saving 1.6 ns this is around the cost of a function call 1.3 ns.

> > 

> > 

> > We still need these optimization in the kernel, but end-users in

> > userspace are very quickly going to waste the 19 cycles we found.

> > I still support/believe that the OS need to have a little overhead as

> > possible, but for me 42 nanosec overhead is close to zero overhead. For

> > comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for

> > delivery of UDP packets into socket (almost 15 times slower).

> > 

> > I guess my point is that with XDP we have already achieved and exceeded

> > (my original) performance goals, making it even faster is just showing off ;-P  

> 

> Even though I'll let Bjorn elaborating on this, we're talking here about

> AF-XDP which is a bit different pair of shoes to me in terms of

> performance. AFAIK we still have a gap when compared to DPDK's numbers. So


AFAIK the gap on this i40e hardware is DPDK=36Mpps, and lets say
AF_XDP=25.5 Mpps, that looks large +10.5Mpps, but it is only 11.4
nanosec.

> I'm really not sure why better performance bothers you? :)


Finding those 11.4 nanosec is going to be really difficult and take a
huge effort. My fear is also that some of these optimizations will make
the code harder to maintain and understand.

If you are ready to do this huge effort, then I'm actually ready to
provide ideas on what you can optimize.

E.g. you can get 2-4 nanosec if we prefetch to L2 in driver (assuming
data is DDIO in L3 already).  (CPU supports L2 prefetch, but kernel
have no users of that feature).

The base design of XDP redirect API, that have a number of function
calls per packet, in itself becomes a limiting factor.  Even the cost
of getting the per-CPU pointer to bpf_redirect_info becomes something
to look at.


> Let's rather be more harsh on changes that actually decrease the

> performance, not the other way around. And I suppose you were the one that

> always was bringing up the 'death by a 1000 paper cuts' of XDP. So yeah,

> I'm a bit confused with your statement.


Yes, we really need to protect the existing the performance.

I think I'm mostly demotivated by the 'death by a 1000 paper cuts'.
People with good intentions are adding features, and performance slowly
disappears.  I've been spending weeks/months finding nanosec level
improvements, which I don't have time to "defend". Recently I realized
that the 2nd XDP prog on egress, even when no prog is attached, is
slowing down the performance of base redirect (I don't fully understand
why, maybe it is simply the I-cache increase, I didn't realize when
reviewing the code).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
Magnus Karlsson Jan. 22, 2021, 8:59 a.m. UTC | #21
On Thu, Jan 21, 2021 at 6:07 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>

> On Wed, 20 Jan 2021 17:19:31 +0100

> Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote:

>

> > On Wed, Jan 20, 2021 at 04:57:08PM +0100, Jesper Dangaard Brouer wrote:

> > > On Wed, 20 Jan 2021 15:15:22 +0200

> > > Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

> > >

> > > > On 2021-01-19 17:50, Björn Töpel wrote:

> > > > > This series extends bind() for XDP sockets, so that the bound socket

> > > > > is added to the netdev_rx_queue _rx array in the netdevice. We call

> > > > > this to register the socket. To redirect packets to the registered

> > > > > socket, a new BPF helper is used: bpf_redirect_xsk().

> > > > >

> > > > > For shared XDP sockets, only the first bound socket is

> > > > > registered. Users that need more complex setup has to use XSKMAP and

> > > > > bpf_redirect_map().

> > > > >

> > > > > Now, why would one use bpf_redirect_xsk() over the regular

> > > > > bpf_redirect_map() helper?

> > > > >

> > > > > * Better performance!

> > > > > * Convenience; Most user use one socket per queue. This scenario is

> > > > >    what registered sockets support. There is no need to create an

> > > > >    XSKMAP. This can also reduce complexity from containerized setups,

> > > > >    where users might what to use XDP sockets without CAP_SYS_ADMIN

> > > > >    capabilities.

> > >

> > > I'm buying into the convenience and reduce complexity, and XDP sockets

> > > without CAP_SYS_ADMIN into containers.

> > >

> > > People might be surprised that I'm actually NOT buying into the better

> > > performance argument here.  At these speeds we are basically comparing

> > > how close we are to zero (and have to use nanosec time scale for our

> > > comparisons), more below.

> > >

> > >

> > > > > The first patch restructures xdp_do_redirect() a bit, to make it

> > > > > easier to add the new helper. This restructure also give us a slight

> > > > > performance benefit. The following three patches extends bind() and

> > > > > adds the new helper. After that, two libbpf patches that selects XDP

> > > > > program based on what kernel is running. Finally, selftests for the new

> > > > > functionality is added.

> > > > >

> > > > > Note that the libbpf "auto-selection" is based on kernel version, so

> > > > > it is hard coded to the "-next" version (5.12). If you would like to

> > > > > try this is out, you will need to change the libbpf patch locally!

> > > > >

> > > > > Thanks to Maciej and Magnus for the internal review/comments!

> > > > >

> > > > > Performance (rxdrop, zero-copy)

> > > > >

> > > > > Baseline

> > > > > Two cores:                   21.3 Mpps

> > > > > One core:                    24.5 Mpps

> > > >

> > > > Two cores is slower? It used to be faster all the time, didn't it?

> > > >

> > > > > Patched

> > > > > Two cores, bpf_redirect_map: 21.7 Mpps + 2%

> > > > > One core, bpf_redirect_map:  24.9 Mpps + 2%

> > > > >

> > > > > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%

> > > >

> > > > Nice, impressive improvement!

> > >

> > > I do appreciate you work and performance optimizations at this level,

> > > because when we are using this few CPU cycles per packet, then it is

> > > really hard to find new ways to reduce cycles further.

> > >

> > > Thank for you saying +13% instead of saying +2.7 Mpps.

> > > It *is* impressive to basically reduce cycles with 13%.

> > >

> > >  21.3 Mpps = 46.94 nanosec per packet

> > >  24.0 Mpps = 41.66 nanosec per packet

> > >

> > >  21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved

> > >

> > > On my 3.60GHz testlab machine that gives me 19 cycles.

> > >

> > >

> > > > > One core, bpf_redirect_xsk:  25.5 Mpps + 4%

> > > > >

> > >

> > >  24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved

> > >

> > > At this point with saving 1.6 ns this is around the cost of a function call 1.3 ns.

> > >

> > >

> > > We still need these optimization in the kernel, but end-users in

> > > userspace are very quickly going to waste the 19 cycles we found.

> > > I still support/believe that the OS need to have a little overhead as

> > > possible, but for me 42 nanosec overhead is close to zero overhead. For

> > > comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for

> > > delivery of UDP packets into socket (almost 15 times slower).

> > >

> > > I guess my point is that with XDP we have already achieved and exceeded

> > > (my original) performance goals, making it even faster is just showing off ;-P

> >

> > Even though I'll let Bjorn elaborating on this, we're talking here about

> > AF-XDP which is a bit different pair of shoes to me in terms of

> > performance. AFAIK we still have a gap when compared to DPDK's numbers. So

>

> AFAIK the gap on this i40e hardware is DPDK=36Mpps, and lets say

> AF_XDP=25.5 Mpps, that looks large +10.5Mpps, but it is only 11.4

> nanosec.

>

> > I'm really not sure why better performance bothers you? :)

>

> Finding those 11.4 nanosec is going to be really difficult and take a

> huge effort. My fear is also that some of these optimizations will make

> the code harder to maintain and understand.

>

> If you are ready to do this huge effort, then I'm actually ready to

> provide ideas on what you can optimize.

>

> E.g. you can get 2-4 nanosec if we prefetch to L2 in driver (assuming

> data is DDIO in L3 already).  (CPU supports L2 prefetch, but kernel

> have no users of that feature).


In my experience, I would caution starting to use L2 prefetch at this
point since it is somewhat finicky. It is possible to get better
performance using it on one platform, but significantly harder to get
it to work also on a three year older one, not to mention other
platforms and architectures.

Instead, I would start with optimizations that always work such as not
having to execute code (or removing code) that is not necessary for
the functionality that is provided. One example of this is having to
execute all the bpf_redirect code when we are using packet steering in
HW to AF_XDP sockets and other potential consumers. In this scenario,
the core bpf_redirect code (xdp_do_redirect + bpf_xdp_redirect_map +
__xsk_map_redirect for the XSKMAP case) provides no value and amounts
to 19% of the total execution time on my machine (application plus
kernel, and the XDP program itself is another 7% on top of the 19%).
By removing this, you could cut down a large chunk of those 11.4 ns we
would need to find in our optimization work. Just to be clear, there
are plenty of other scenarios where the bpf_redirect functionality
provides great value, so it is really good to have. I just think it is
weird that it has to be executed all the time even when it is not
needed. The big question is just how to achieve this in an
architecturally sound way. Any ideas?

Just note that there are plenty that could be done in other areas too.
For example, the Rx part of the driver is around 22% of the execution
time and I am sure we can optimize this part too and increase the
readability and maintainability of the code at the same time. But not
trim it down to 0% as some of this is actually necessary work in all
cases where traffic is received :-). xp_alloc is another big time
consumer of cycles, but for this one I do have a patch set that cuts
it down from 21% to 8% that I plan on submitting in February.

perf top data:
21.58%  [ice]                      [k] ice_clean_rx_irq_zc
21.21%  [kernel]                   [k] xp_alloc
13.93%  [kernel]                   [k] __xsk_rcv_zc
8.53%  [kernel]                   [k] xdp_do_redirect
8.15%  [kernel]                   [k] xsk_rcv
6.66%  [kernel]                   [k] bpf_xdp_redirect_map
5.08%  bpf_prog_992d9ddc835e5629  [k] bpf_prog_992d9ddc835e5629
3.73%  [ice]                      [k] ice_alloc_rx_bufs_zc
3.36%  [kernel]                   [k] __xsk_map_redirect
2.62%  xdpsock                    [.] main
0.96%  [kernel]                   [k] rcu_read_unlock_strict
0.90%  bpf_dispatcher_xdp         [k] bpf_dispatcher_xdp
0.87%  [kernel]                   [k] bpf_dispatcher_xdp_func
0.51%  [kernel]                   [k] xsk_flush
0.38%  [kernel]                   [k] lapic_next_deadline
0.13%  [ice]                      [k] ice_napi_poll

> The base design of XDP redirect API, that have a number of function

> calls per packet, in itself becomes a limiting factor.  Even the cost

> of getting the per-CPU pointer to bpf_redirect_info becomes something

> to look at.

>

>

> > Let's rather be more harsh on changes that actually decrease the

> > performance, not the other way around. And I suppose you were the one that

> > always was bringing up the 'death by a 1000 paper cuts' of XDP. So yeah,

> > I'm a bit confused with your statement.

>

> Yes, we really need to protect the existing the performance.

>

> I think I'm mostly demotivated by the 'death by a 1000 paper cuts'.

> People with good intentions are adding features, and performance slowly

> disappears.  I've been spending weeks/months finding nanosec level

> improvements, which I don't have time to "defend". Recently I realized

> that the 2nd XDP prog on egress, even when no prog is attached, is

> slowing down the performance of base redirect (I don't fully understand

> why, maybe it is simply the I-cache increase, I didn't realize when

> reviewing the code).

>

> --

> Best regards,

>   Jesper Dangaard Brouer

>   MSc.CS, Principal Kernel Engineer at Red Hat

>   LinkedIn: http://www.linkedin.com/in/brouer

>
Maciej Fijalkowski Jan. 22, 2021, 9:45 a.m. UTC | #22
On Fri, Jan 22, 2021 at 09:59:53AM +0100, Magnus Karlsson wrote:
> On Thu, Jan 21, 2021 at 6:07 PM Jesper Dangaard Brouer

> <brouer@redhat.com> wrote:

> >

> > On Wed, 20 Jan 2021 17:19:31 +0100

> > Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote:

> >

> > > On Wed, Jan 20, 2021 at 04:57:08PM +0100, Jesper Dangaard Brouer wrote:

> > > > On Wed, 20 Jan 2021 15:15:22 +0200

> > > > Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

> > > >

> > > > > On 2021-01-19 17:50, Björn Töpel wrote:

> > > > > > This series extends bind() for XDP sockets, so that the bound socket

> > > > > > is added to the netdev_rx_queue _rx array in the netdevice. We call

> > > > > > this to register the socket. To redirect packets to the registered

> > > > > > socket, a new BPF helper is used: bpf_redirect_xsk().

> > > > > >

> > > > > > For shared XDP sockets, only the first bound socket is

> > > > > > registered. Users that need more complex setup has to use XSKMAP and

> > > > > > bpf_redirect_map().

> > > > > >

> > > > > > Now, why would one use bpf_redirect_xsk() over the regular

> > > > > > bpf_redirect_map() helper?

> > > > > >

> > > > > > * Better performance!

> > > > > > * Convenience; Most user use one socket per queue. This scenario is

> > > > > >    what registered sockets support. There is no need to create an

> > > > > >    XSKMAP. This can also reduce complexity from containerized setups,

> > > > > >    where users might what to use XDP sockets without CAP_SYS_ADMIN

> > > > > >    capabilities.

> > > >

> > > > I'm buying into the convenience and reduce complexity, and XDP sockets

> > > > without CAP_SYS_ADMIN into containers.

> > > >

> > > > People might be surprised that I'm actually NOT buying into the better

> > > > performance argument here.  At these speeds we are basically comparing

> > > > how close we are to zero (and have to use nanosec time scale for our

> > > > comparisons), more below.

> > > >

> > > >

> > > > > > The first patch restructures xdp_do_redirect() a bit, to make it

> > > > > > easier to add the new helper. This restructure also give us a slight

> > > > > > performance benefit. The following three patches extends bind() and

> > > > > > adds the new helper. After that, two libbpf patches that selects XDP

> > > > > > program based on what kernel is running. Finally, selftests for the new

> > > > > > functionality is added.

> > > > > >

> > > > > > Note that the libbpf "auto-selection" is based on kernel version, so

> > > > > > it is hard coded to the "-next" version (5.12). If you would like to

> > > > > > try this is out, you will need to change the libbpf patch locally!

> > > > > >

> > > > > > Thanks to Maciej and Magnus for the internal review/comments!

> > > > > >

> > > > > > Performance (rxdrop, zero-copy)

> > > > > >

> > > > > > Baseline

> > > > > > Two cores:                   21.3 Mpps

> > > > > > One core:                    24.5 Mpps

> > > > >

> > > > > Two cores is slower? It used to be faster all the time, didn't it?

> > > > >

> > > > > > Patched

> > > > > > Two cores, bpf_redirect_map: 21.7 Mpps + 2%

> > > > > > One core, bpf_redirect_map:  24.9 Mpps + 2%

> > > > > >

> > > > > > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%

> > > > >

> > > > > Nice, impressive improvement!

> > > >

> > > > I do appreciate you work and performance optimizations at this level,

> > > > because when we are using this few CPU cycles per packet, then it is

> > > > really hard to find new ways to reduce cycles further.

> > > >

> > > > Thank for you saying +13% instead of saying +2.7 Mpps.

> > > > It *is* impressive to basically reduce cycles with 13%.

> > > >

> > > >  21.3 Mpps = 46.94 nanosec per packet

> > > >  24.0 Mpps = 41.66 nanosec per packet

> > > >

> > > >  21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved

> > > >

> > > > On my 3.60GHz testlab machine that gives me 19 cycles.

> > > >

> > > >

> > > > > > One core, bpf_redirect_xsk:  25.5 Mpps + 4%

> > > > > >

> > > >

> > > >  24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved

> > > >

> > > > At this point with saving 1.6 ns this is around the cost of a function call 1.3 ns.

> > > >

> > > >

> > > > We still need these optimization in the kernel, but end-users in

> > > > userspace are very quickly going to waste the 19 cycles we found.

> > > > I still support/believe that the OS need to have a little overhead as

> > > > possible, but for me 42 nanosec overhead is close to zero overhead. For

> > > > comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for

> > > > delivery of UDP packets into socket (almost 15 times slower).

> > > >

> > > > I guess my point is that with XDP we have already achieved and exceeded

> > > > (my original) performance goals, making it even faster is just showing off ;-P

> > >

> > > Even though I'll let Bjorn elaborating on this, we're talking here about

> > > AF-XDP which is a bit different pair of shoes to me in terms of

> > > performance. AFAIK we still have a gap when compared to DPDK's numbers. So

> >

> > AFAIK the gap on this i40e hardware is DPDK=36Mpps, and lets say

> > AF_XDP=25.5 Mpps, that looks large +10.5Mpps, but it is only 11.4

> > nanosec.

> >

> > > I'm really not sure why better performance bothers you? :)

> >

> > Finding those 11.4 nanosec is going to be really difficult and take a

> > huge effort. My fear is also that some of these optimizations will make

> > the code harder to maintain and understand.

> >

> > If you are ready to do this huge effort, then I'm actually ready to

> > provide ideas on what you can optimize.

> >

> > E.g. you can get 2-4 nanosec if we prefetch to L2 in driver (assuming

> > data is DDIO in L3 already).  (CPU supports L2 prefetch, but kernel

> > have no users of that feature).


Just to add to what Magnus wrote below, when comparing technologies, XDP
is confronted with kernel's netstack and as we know, difference is really
big and there's no secret who's the winner in terms of performance in
specific scenarios. So, I'm not saying I'm super OK with that, but we can
allow ourselves for having a cool feature that boosts our control plane,
but degrades performance a bit, as you found out with egress XDP prog. We
are still going to be fine.

For AF_XDP the case is the opposite, meaning, we are the chaser in this
scenario. Performance work is still the top priority for AF_XDP in my
eyes.

> 

> In my experience, I would caution starting to use L2 prefetch at this

> point since it is somewhat finicky. It is possible to get better

> performance using it on one platform, but significantly harder to get

> it to work also on a three year older one, not to mention other

> platforms and architectures.

> 

> Instead, I would start with optimizations that always work such as not

> having to execute code (or removing code) that is not necessary for

> the functionality that is provided. One example of this is having to

> execute all the bpf_redirect code when we are using packet steering in

> HW to AF_XDP sockets and other potential consumers. In this scenario,

> the core bpf_redirect code (xdp_do_redirect + bpf_xdp_redirect_map +

> __xsk_map_redirect for the XSKMAP case) provides no value and amounts

> to 19% of the total execution time on my machine (application plus

> kernel, and the XDP program itself is another 7% on top of the 19%).

> By removing this, you could cut down a large chunk of those 11.4 ns we

> would need to find in our optimization work. Just to be clear, there

> are plenty of other scenarios where the bpf_redirect functionality

> provides great value, so it is really good to have. I just think it is

> weird that it has to be executed all the time even when it is not

> needed. The big question is just how to achieve this in an

> architecturally sound way. Any ideas?

> 

> Just note that there are plenty that could be done in other areas too.

> For example, the Rx part of the driver is around 22% of the execution

> time and I am sure we can optimize this part too and increase the

> readability and maintainability of the code at the same time. But not

> trim it down to 0% as some of this is actually necessary work in all

> cases where traffic is received :-). xp_alloc is another big time

> consumer of cycles, but for this one I do have a patch set that cuts

> it down from 21% to 8% that I plan on submitting in February.

> 

> perf top data:

> 21.58%  [ice]                      [k] ice_clean_rx_irq_zc

> 21.21%  [kernel]                   [k] xp_alloc

> 13.93%  [kernel]                   [k] __xsk_rcv_zc

> 8.53%  [kernel]                   [k] xdp_do_redirect

> 8.15%  [kernel]                   [k] xsk_rcv

> 6.66%  [kernel]                   [k] bpf_xdp_redirect_map

> 5.08%  bpf_prog_992d9ddc835e5629  [k] bpf_prog_992d9ddc835e5629

> 3.73%  [ice]                      [k] ice_alloc_rx_bufs_zc

> 3.36%  [kernel]                   [k] __xsk_map_redirect

> 2.62%  xdpsock                    [.] main

> 0.96%  [kernel]                   [k] rcu_read_unlock_strict

> 0.90%  bpf_dispatcher_xdp         [k] bpf_dispatcher_xdp

> 0.87%  [kernel]                   [k] bpf_dispatcher_xdp_func

> 0.51%  [kernel]                   [k] xsk_flush

> 0.38%  [kernel]                   [k] lapic_next_deadline

> 0.13%  [ice]                      [k] ice_napi_poll

> 

> > The base design of XDP redirect API, that have a number of function

> > calls per packet, in itself becomes a limiting factor.  Even the cost

> > of getting the per-CPU pointer to bpf_redirect_info becomes something

> > to look at.

> >

> >

> > > Let's rather be more harsh on changes that actually decrease the

> > > performance, not the other way around. And I suppose you were the one that

> > > always was bringing up the 'death by a 1000 paper cuts' of XDP. So yeah,

> > > I'm a bit confused with your statement.

> >

> > Yes, we really need to protect the existing the performance.

> >

> > I think I'm mostly demotivated by the 'death by a 1000 paper cuts'.

> > People with good intentions are adding features, and performance slowly

> > disappears.  I've been spending weeks/months finding nanosec level

> > improvements, which I don't have time to "defend". Recently I realized

> > that the 2nd XDP prog on egress, even when no prog is attached, is

> > slowing down the performance of base redirect (I don't fully understand

> > why, maybe it is simply the I-cache increase, I didn't realize when

> > reviewing the code).

> >

> > --

> > Best regards,

> >   Jesper Dangaard Brouer

> >   MSc.CS, Principal Kernel Engineer at Red Hat

> >   LinkedIn: http://www.linkedin.com/in/brouer

> >