mbox series

[pull,request,net-next,00/10] mlx5 Multi packet tx descriptors for SKBs

Message ID 20200903210022.22774-1-saeedm@nvidia.com
Headers show
Series mlx5 Multi packet tx descriptors for SKBs | expand

Message

Saeed Mahameed Sept. 3, 2020, 9 p.m. UTC
Hi Dave & Jakub

This series adds support for Multi packet tx descriptors for SKBs.
For more information please see tag log below.

One note that is worth mentioning here, is that Maxim had to do some
manual function inlining in the tx c file to avoid performance drop due
to refactoring and functions extraction for re-use, I hope this is not a
big deal, as the other way around this is to avoid code reuse which
makes the mlx5 tx path __uglier__.

Please pull and let me know if there is any problem.

Thanks,
Saeed.

---
The following changes since commit 08aaa0819d5cce78a10c2fcea17057d07698691f:

  Merge branch 'l2tp-miscellaneous-cleanups' (2020-09-03 12:19:04 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2020-09-03

for you to fetch changes up to b960a70f9cf90524f2b4e67d202e7240c3ef6928:

  net/mlx5e: Enhanced TX MPWQE for SKBs (2020-09-03 13:56:10 -0700)

----------------------------------------------------------------
mlx5-updates-2020-09-03

Multi packet TX descriptor support for SKBs.

This series introduces some refactoring of the regular TX data path in
mlx5 and adds the Enhanced TX MPWQE feature support. MPWQE stands for
multi-packet work queue element, and it can serve multiple packets,
reducing the PCI bandwidth spent on control traffic. It should improve
performance in scenarios where PCI is the bottleneck, and xmit_more is
signaled by the kernel. The refactoring done in this series also
improves the packet rate on its own.

MPWQE is already implemented in the XDP tx path, this series adds the
support of MPWQE for regular kernel SKB tx path.

MPWQE is supported from ConnectX-5 and onward, for legacy devices we need
to keep backward compatibility for regular (Single packet) WQE descriptor.

MPWQE is not compatible with certain offloads and features, such as TLS
offload, TSO, nonlinear SKBs. If such incompatible features are in use,
the driver gracefully falls back to non-MPWQE per SKB.

Prior to the final patch "net/mlx5e: Enhanced TX MPWQE for SKBs" that adds
the actual support, Maxim did some refactoring to the tx data path to
split it into stages and smaller helper functions that can be utilized and
reused for both legacy and new MPWQE feature.

Due to this refactoring and the increase of helper functions,
Maxim had to manually tune inlining of these functions in the tx.c file to
get the maximum performance and the expected results of MPWQE.

Performance effect:

All of the changes below are tested with packet rate udp and pktgen
tests, no performance impact seen on TCP single stream test and
XDP_TX single stream test.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
NIC: Mellanox ConnectX-6 Dx

1)  Refactoring #1: Refactor xmit functions & manual inlining
This change has no performance impact in TCP single stream test and
XDP_TX single stream test.

UDP pktgen (burst 32), single stream:
  Packet rate: 17.55 Mpps -> 19.23 Mpps
  Instructions per packet: 420 -> 360
  Cycles per packet: 165 -> 142

2) Refactoring #2: Support multiple SKBs in a TX WQE
First building block needed to support multple SKBs in downstream MPWQE

UDP pktgen (burst 32), single stream:
  Packet rate: 19.23 Mpps -> 19.12 Mpps
  Instructions per packet: 360 -> 354
  Cycles per packet: 142 -> 140

3) MPWQE Feature for SKBs (final patch)

UDP pktgen, 64-byte packets, single stream, MPWQE off:
  Packet rate: 19.12 Mpps -> 20.02 Mpps
  Instructions per packet: 354 -> 347
  Cycles per packet: 140 -> 129

UDP pktgen, 64-byte packets, single stream, MPWQE on:
  Packet rate: 19.12 Mpps -> 20.67 Mpps
  Instructions per packet: 354 -> 335
  Cycles per packet: 140 -> 124

Enabling MPWQE can reduce PCI bandwidth:
  PCI Gen2, pktgen at fixed rate of 36864000 pps on 24 CPU cores:
    Inbound PCI utilization with MPWQE off: 81.3%
    Inbound PCI utilization with MPWQE on: 59.3%
  PCI Gen3, pktgen at fixed rate of 56064005 pps on 24 CPU cores:
    Inbound PCI utilization with MPWQE off: 65.8%
    Inbound PCI utilization with MPWQE on: 49.2%

Enabling MPWQE can also reduce CPU load, increasing the packet rate in
case of CPU bottleneck:
  PCI Gen2, pktgen at full rate on 24 CPU cores:
    Packet rate with MPWQE off: 37.4 Mpps
    Packet rate with MPWQE on: 49.1 Mpps
  PCI Gen3, pktgen at full rate on 24 CPU cores:
    Packet rate with MPWQE off: 56.2 Mpps
    Packet rate with MPWQE on: 67.0 Mpps

Burst size in all pktgen tests is 32.

To avoid performance degradation when MPWQE is off, manual optimizations
of function inlining were performed. It's especially important to have
mlx5e_sq_xmit_mpwqe noinline, otherwise gcc inlines it automatically and
bloats mlx5e_xmit, slowing it down, which reduces the maximum gain seen by
MPWQE, and in order to avoid this, we had two options
1. drop the refactoring and duplicate the TX data path to have 2 huge
functions.
2. refactoring and code reuse with manual inlining, as we did in this
series.

-Saeed.

----------------------------------------------------------------
Maxim Mikityanskiy (10):
      net/mlx5e: Refactor inline header size calculation in the TX path
      net/mlx5e: Refactor xmit functions
      net/mlx5e: Small improvements for XDP TX MPWQE logic
      net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT
      net/mlx5e: Move the TLS resync check out of the function
      net/mlx5e: Support multiple SKBs in a TX WQE
      net/mlx5e: Generalize TX MPWQE checks for full session
      net/mlx5e: Rename xmit-related structs to generalize them
      net/mlx5e: Move TX code into functions to be used by MPWQE
      net/mlx5e: Enhanced TX MPWQE for SKBs

 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  30 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h  | 102 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c   |  33 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h   |  60 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.c    |   2 +-
 .../mellanox/mlx5/core/en_accel/en_accel.h         |  32 +-
 .../ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c |   3 -
 .../mellanox/mlx5/core/en_accel/ktls_txrx.h        |  20 +-
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         |   8 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  15 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  18 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |   6 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    | 653 +++++++++++++++------
 14 files changed, 659 insertions(+), 327 deletions(-)

Comments

Willem de Bruijn Sept. 4, 2020, 3:06 p.m. UTC | #1
On Thu, Sep 3, 2020 at 11:01 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
>
> From: Maxim Mikityanskiy <maximmi@mellanox.com>
>
> mlx5e_txwqe_complete performs some actions that can be taken to separate
> functions:
>
> 1. Update the flags needed for hardware timestamping.
>
> 2. Stop the TX queue if it's full.
>
> Take these actions into separate functions to be reused by the MPWQE
> code in the following commit and to maintain clear responsibilities of
> functions.
>
> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 23 ++++++++++++++-----
>  1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> index 9ced350150b3..3b68c8333875 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> @@ -311,6 +311,20 @@ static inline void mlx5e_sq_calc_wqe_attr(struct sk_buff *skb,
>         };
>  }
>
> +static inline void mlx5e_tx_skb_update_hwts_flags(struct sk_buff *skb)
> +{
> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
> +               skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +}

Subjective, but this helper adds a level of indirection and introduces
code churn without simplying anything, imho.

> +static inline void mlx5e_tx_check_stop(struct mlx5e_txqsq *sq)
> +{
> +       if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, sq->stop_room))) {
> +               netif_tx_stop_queue(sq->txq);
> +               sq->stats->stopped++;
> +       }
> +}
> +
>  static inline void
>  mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>                      const struct mlx5e_tx_attr *attr,
> @@ -332,14 +346,11 @@ mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>         cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | attr->opcode);
>         cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | wqe_attr->ds_cnt);
>
> -       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
> -               skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +       mlx5e_tx_skb_update_hwts_flags(skb);
>
>         sq->pc += wi->num_wqebbs;
> -       if (unlikely(!mlx5e_wqc_has_room_for(wq, sq->cc, sq->pc, sq->stop_room))) {
> -               netif_tx_stop_queue(sq->txq);
> -               sq->stats->stopped++;
> -       }
> +
> +       mlx5e_tx_check_stop(sq);
>
>         send_doorbell = __netdev_tx_sent_queue(sq->txq, attr->num_bytes, xmit_more);
>         if (send_doorbell)
> --
> 2.26.2
>
Maxim Mikityanskiy Sept. 8, 2020, 8:59 a.m. UTC | #2
On 2020-09-04 18:05, Willem de Bruijn wrote:
> On Thu, Sep 3, 2020 at 11:01 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
>>
>> From: Maxim Mikityanskiy <maximmi@mellanox.com>
>>
>> A constant for the number of DS in an empty WQE (i.e. a WQE without data
>> segments) is needed in multiple places (normal TX data path, MPWQE in
>> XDP), but currently we have a constant for XDP and an inline formula in
>> normal TX. This patch introduces a common constant.
>>
>> Additionally, mlx5e_xdp_mpwqe_session_start is converted to use struct
>> assignment, because the code nearby is touched.
>>
>> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
>> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>> ---
>>   .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  2 ++
>>   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 13 +++++++-----
>>   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  | 21 +++++++------------
>>   .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  2 +-
>>   4 files changed, 19 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> index d4ee22789ab0..155b89998891 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> @@ -7,6 +7,8 @@
>>   #include "en.h"
>>   #include <linux/indirect_call_wrapper.h>
>>
>> +#define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
>> +
> 
> Out of curiosity, what is the logic for dividing this struct by 16?

The hardware needs the size of a WQE in DS units (16 bytes). An empty 
WQE takes 2 DS (for the ctrl and eth segments), and this macro is this 
initial size of an empty WQE (2 DS). As data segments are added to the 
WQE, it grows, and its size in DS also grows.

> struct mlx5e_tx_wqe {
>          struct mlx5_wqe_ctrl_seg ctrl;
>          struct mlx5_wqe_eth_seg  eth;
>          struct mlx5_wqe_data_seg data[0];
> };
> 
>>   #define INL_HDR_START_SZ (sizeof(((struct mlx5_wqe_eth_seg *)NULL)->inline_hdr.start))
>>
>>   enum mlx5e_icosq_wqe_type {
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> index 7fccd2ea7dc9..81cd9a04bcb0 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> @@ -196,16 +196,19 @@ static void mlx5e_xdp_mpwqe_session_start(struct mlx5e_xdpsq *sq)
>>   {
>>          struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
>>          struct mlx5e_xdpsq_stats *stats = sq->stats;
>> +       struct mlx5e_tx_wqe *wqe;
>>          u16 pi;
>>
>>          pi = mlx5e_xdpsq_get_next_pi(sq, MLX5E_XDP_MPW_MAX_WQEBBS);
>> -       session->wqe = MLX5E_TX_FETCH_WQE(sq, pi);
>> -
>> +       wqe = MLX5E_TX_FETCH_WQE(sq, pi);
>>          net_prefetchw(session->wqe->data);
> 
> Is this prefetch still valid?

It should be:

net_prefetchw(wqe->data);

Probably a bad rebase.

> And is the temporary variable wqe still
> needed at all?

We want to prefetch as early as possible (before filling *session).

> 
>> -       session->ds_count  = MLX5E_XDP_TX_EMPTY_DS_COUNT;
>> -       session->pkt_count = 0;
>>
>> -       mlx5e_xdp_update_inline_state(sq);
>> +       *session = (struct mlx5e_xdp_mpwqe) {
>> +               .wqe = wqe,
>> +               .ds_count = MLX5E_TX_WQE_EMPTY_DS_COUNT,
>> +               .pkt_count = 0,
>> +               .inline_on = mlx5e_xdp_get_inline_state(sq, session->inline_on),
>> +       };
>>
>>          stats->mpwqe++;
>>   }
Willem de Bruijn Sept. 8, 2020, 9:06 a.m. UTC | #3
On Tue, Sep 8, 2020 at 10:59 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> On 2020-09-04 18:05, Willem de Bruijn wrote:
> > On Thu, Sep 3, 2020 at 11:01 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
> >>
> >> From: Maxim Mikityanskiy <maximmi@mellanox.com>
> >>
> >> A constant for the number of DS in an empty WQE (i.e. a WQE without data
> >> segments) is needed in multiple places (normal TX data path, MPWQE in
> >> XDP), but currently we have a constant for XDP and an inline formula in
> >> normal TX. This patch introduces a common constant.
> >>
> >> Additionally, mlx5e_xdp_mpwqe_session_start is converted to use struct
> >> assignment, because the code nearby is touched.
> >>
> >> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> >> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> >> ---
> >>   .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  2 ++
> >>   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 13 +++++++-----
> >>   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  | 21 +++++++------------
> >>   .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  2 +-
> >>   4 files changed, 19 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> >> index d4ee22789ab0..155b89998891 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> >> @@ -7,6 +7,8 @@
> >>   #include "en.h"
> >>   #include <linux/indirect_call_wrapper.h>
> >>
> >> +#define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
> >> +
> >
> > Out of curiosity, what is the logic for dividing this struct by 16?
>
> The hardware needs the size of a WQE in DS units (16 bytes). An empty
> WQE takes 2 DS (for the ctrl and eth segments), and this macro is this
> initial size of an empty WQE (2 DS). As data segments are added to the
> WQE, it grows, and its size in DS also grows.
>
> > struct mlx5e_tx_wqe {
> >          struct mlx5_wqe_ctrl_seg ctrl;
> >          struct mlx5_wqe_eth_seg  eth;
> >          struct mlx5_wqe_data_seg data[0];
> > };

Thanks. It was not obvious to me that the first two are the size as
data_segs. But that actually is pretty logical. Ack.