[v5,net-next,00/36] nvme-tcp receive and tarnsmit offloads

Message ID	20210722110325.371-1-borisp@nvidia.com
Headers	show Return-Path: <netdev-owner@kernel.org> Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.112.32 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.112.32; helo=mail.nvidia.com; From: Boris Pismenny <borisp@nvidia.com> To: <dsahern@gmail.com>, <kuba@kernel.org>, <davem@davemloft.net>, <saeedm@nvidia.com>, <hch@lst.de>, <sagi@grimberg.me>, <axboe@fb.com>, <kbusch@kernel.org>, <viro@zeniv.linux.org.uk>, <edumazet@google.com>, <smalin@marvell.com> CC: <boris.pismenny@gmail.com>, <linux-nvme@lists.infradead.org>, <netdev@vger.kernel.org>, <benishay@nvidia.com>, <ogerlitz@nvidia.com>, <yorayz@nvidia.com>, Boris Pismenny <borisp@mellanox.com> Subject: [PATCH v5 net-next 00/36] nvme-tcp receive and tarnsmit offloads Date: Thu, 22 Jul 2021 14:02:49 +0300 Message-ID: <20210722110325.371-1-borisp@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk
Series	nvme-tcp receive and tarnsmit offloads \| expand [v5,net-next,00/36] nvme-tcp receive and tarnsmit offloads [v5,net-next,01/36] net: Introduce direct data placement tcp offload [v5,net-next,02/36] iov_iter: DDP copy to iter/pages [v5,net-next,03/36] net: skb copy(+hash) iterators for DDP offloads [v5,net-next,04/36] net/tls: expose get_netdev_for_sock [v5,net-next,05/36] nvme-tcp: Add DDP offload control path [v5,net-next,06/36] nvme-tcp: Add DDP data-path [v5,net-next,07/36] nvme-tcp: RX DDGST offload [v5,net-next,08/36] nvme-tcp: Deal with netdevice DOWN events [v5,net-next,09/36] net/mlx5: Header file changes for nvme-tcp offload [v5,net-next,10/36] net/mlx5: Add 128B CQE for NVMEoTCP offload [v5,net-next,11/36] net/mlx5e: TCP flow steering for nvme-tcp [v5,net-next,12/36] net/mlx5e: NVMEoTCP offload initialization [v5,net-next,13/36] net/mlx5e: KLM UMR helper macros [v5,net-next,14/36] net/mlx5e: NVMEoTCP use KLM UMRs [v5,net-next,15/36] net/mlx5e: NVMEoTCP queue init/teardown [v5,net-next,16/36] net/mlx5e: NVMEoTCP async ddp invalidation [v5,net-next,17/36] net/mlx5e: NVMEoTCP ddp setup and resync [v5,net-next,18/36] net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload [v5,net-next,19/36] net/mlx5e: NVMEoTCP statistics [v5,net-next,20/36] Documentation: add ULP DDP offload documentation [v5,net-next,21/36] net: drop ULP DDP HW offload feature if no CSUM offload feature [v5,net-next,22/36] net: Add ulp_ddp_pdu_info struct [v5,net-next,23/36] net: Add to ulp_ddp support for fallback flow [v5,net-next,24/36] net: Add MSG_DDP_CRC flag [v5,net-next,25/36] nvme-tcp: TX DDGST offload [v5,net-next,26/36] nvme-tcp: Mapping between Tx NVMEoTCP pdu and TCP sequence [v5,net-next,27/36] mlx5e: make preparation in TLS code for NVMEoTCP CRC Tx offload [v5,net-next,28/36] mlx5: Add sq state test bit for nvmeotcp [v5,net-next,29/36] mlx5: Add support to NETIF_F_HW_TCP_DDP_CRC_TX feature [v5,net-next,30/36] net/mlx5e: NVMEoTCP DDGST TX offload TIS [v5,net-next,31/36] net/mlx5e: NVMEoTCP DDGST Tx offload queue init/teardown [v5,net-next,32/36] net/mlx5e: NVMEoTCP DDGST TX BSF and PSV [v5,net-next,33/36] net/mlx5e: NVMEoTCP DDGST TX Data path [v5,net-next,34/36] net/mlx5e: NVMEoTCP DDGST TX handle OOO packets [v5,net-next,35/36] net/mlx5e: NVMEoTCP DDGST TX offload optimization [v5,net-next,36/36] net/mlx5e: NVMEoTCP DDGST TX statistics

Boris Pismenny July 22, 2021, 11:02 a.m. UTC

From: Boris Pismenny <borisp@mellanox.com>

Changes since v4:
=========================================
* Add transmit offload patches
* Use one feature bit for both receive and transmit offload

Changes since v3:
=========================================
* Use DDP_TCP ifdefs in iov_iter and skb iterators to minimize impact
when compiled out (Christoph)
* Simplify netdev references and reduce the use of
get_netdev_for_sock (Sagi)
* Avoid "static" in it's own line, move it one line down (Christoph)
* Pass (queue, skb, *offset) and retrieve the pdu_seq in
nvme_tcp_resync_response (Sagi)
* Add missing assignment of offloading_netdev to null in offload_limits
error case (Sagi)
* Set req->offloaded = false once -- the lifetime rules are:
set to false on cmd_setup / set to true when ddp setup succeeds (Sagi)
* Replace pr_info_ratelimited with dev_info_ratelimited (Sagi)
* Add nvme_tcp_complete_request and invoke it from two similar call
sites (Sagi)
* Introduce nvme_tcp_req_map_sg earlier in the series (Sagi)
* Add nvme_tcp_consume_skb and put into it a hunk from
nvme_tcp_recv_data to handle copy with and without offload

Changes since v2:
=========================================
* Use skb->ddp_crc for copy offload to avoid skb_condense
* Default mellanox driver support to no (experimental feature)
* In iov_iter use non-ddp functions for kvec and iovec
* Remove typecasting in nvme-tcp

Changes since v1:
=========================================
* Rework iov_iter copy skip if src==dst to be less intrusive (David Ahern)
* Add tcp-ddp documentation (David Ahern)
* Refactor mellanox driver patches into more patches (Saeed Mahameed)
* Avoid pointer casting (David Ahern)
* Rename nvme-tcp offload flags (Shai Malin)
* Update cover-letter according to the above

Changes since RFC v1:
=========================================
* Split mlx5 driver patches to several commits
* Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod
  init/teardown to the start/stop functions.

# Overview
=========================================
This series adds support for nvme-tcp receive and transmit offloads
which do not mandate the offload of the network stack to the device.
Instead, these work together with TCP to offload:
1. copy from SKB to the block layer buffers
2. CRC calculation and verification for received PDU

The series implements these as a generic offload infrastructure for storage
protocols, which we call TCP Direct Data Placement (TCP_DDP) and TCP DDP CRC,
respectively. We use this infrastructure to implement NVMe-TCP offload for copy
and CRC. Future implementations can reuse the same infrastructure for other
protcols such as iSCSI.

Note:
These offloads are similar in nature to the packet-based NIC TLS offloads,
which are already upstream (see net/tls/tls_device.c).
You can read more about TLS offload here:
https://www.kernel.org/doc/html/latest/networking/tls-offload.html

# Initialization and teardown:
=========================================
The offload for IO queues is initialized after the handshake of the
NVMe-TCP protocol is finished by calling `nvme_tcp_offload_socket`
with the tcp socket of the nvme_tcp_queue:
This operation sets all relevant hardware contexts in
hardware. If it fails, then the IO queue proceeds as usually with no offload.
If it succeeds then `nvme_tcp_setup_ddp` and `nvme_tcp_teardown_ddp` may be
called to perform copy offload, and crc offload will be used.
This initialization does not change the normal operation of nvme-tcp in any
way besides adding the option to call the above mentioned NDO operations.

For the admin queue, nvme-tcp does not initialize the offload.
Instead, nvme-tcp calls the driver to configure limits for the controller,
such as max_hw_sectors and max_segments; these must be limited to accomodate
potential HW resource limits, and to improve performance.

If some error occured, and the IO queue must be closed or reconnected, then
offload is teardown and initialized again. Additionally, we handle netdev
down events via the existing error recovery flow.

# Copy offload works as follows:
=========================================
The nvme-tcp layer calls the NIC drive to map block layer buffers to ccid using
`nvme_tcp_setup_ddp` before sending the read request. When the repsonse is
received, then the NIC HW will write the PDU payload directly into the
designated buffer, and build an SKB such that it points into the destination
buffer; this SKB represents the entire packet received on the wire, but it
points to the block layer buffers. Once nvme-tcp attempts to copy data from
this SKB to the block layer buffer it can skip the copy by checking in the
copying function (memcpy_to_page):
if (src == dst) -> skip copy
Finally, when the PDU has been processed to completion, the nvme-tcp layer
releases the NIC HW context be calling `nvme_tcp_teardown_ddp` which
asynchronously unmaps the buffers from NIC HW.

As the copy skip change is in a sensative function, we are careful to avoid
changing it. To that end, we create alternative skb copy and hash iterators
that skip copy/hash if (src == dst). Nvme-tcp is the first user for these.

# Asynchronous completion:
=========================================
The NIC must release its mapping between command IDs and the target buffers.
This mapping is released when NVMe-TCP calls the NIC
driver (`nvme_tcp_offload_socket`).
As completing IOs is performance criticial, we introduce asynchronous
completions for NVMe-TCP, i.e. NVMe-TCP calls the NIC, which will later
call NVMe-TCP to complete the IO (`nvme_tcp_ddp_teardown_done`).

An alternative approach is to move all the functions related to coping from
SKBs to the block layer buffers inside the nvme-tcp code - about 200 LOC.

# CRC receive offload works as follows:
=========================================
After offload is initialized, we use the SKB's ddp_crc bit to indicate that:
"there was no problem with the verification of all CRC fields in this packet's
payload". The bit is set to zero if there was an error, or if HW skipped
offload for some reason. If *any* SKB in a PDU has (ddp_crc != 1), then software
must compute the CRC, and check it. We perform this check, and
accompanying software fallback at the end of the processing of a received PDU.

# CRC transmit offload works as follows:
=========================================
The sending layer (e.g., nvme-tcp) sets the MSG_DDP_CRC when sending messages
down to TCP. Thereafter TCP will mark corresponding SKBs with the ddp_crc bit.
This ensures that CRC offload takes place for this packets, similarly to how
skb->decrtypted is used for TLS.

Additionally, nvme-tcp maintains a mapping between TCP sequence numbers
and PDU data which allows the driver to handle reordered/retransmitted
packet offload by resynchronizing the device's CRC state.

# SKB changes:
=========================================
The CRC offload requires an additional bit in the SKB, which is useful for
preventing the coalescing of SKB with different crc offload values. This bit
is similar in concept to the "decrypted" bit. 

# Performance:
=========================================
The expected performance gain from this offload varies with the block size.
We perform a CPU cycles breakdown of the copy/CRC operations in nvme-tcp
fio random read workloads:
For 4K blocks we see up to 11% improvement for a 100% read fio workload,
while for 128K blocks we see upto 52%. If we run nvme-tcp, and skip these
operations, then we observe a gain of about 1.1x and 2x respectively.

# Resynchronization:
=========================================
The resynchronization flow is performed to reset the hardware tracking of
NVMe-TCP PDUs within the TCP stream. The flow consists of a request from
the driver, regarding a possible location of a PDU header. Followed by
a response from the nvme-tcp driver.

This flow is rare, and it should happen only after packet loss or
reordering events that involve nvme-tcp PDU headers.

# The patches are organized as follows:
=========================================
Patches 1,3     the infrastructure for all TCP DDP.
                and TCP DDP CRC offloads, respectively.
Patch 2         the iov_iter change to skip copy if (src == dst).
Patch 4         exposes the get_netdev_for_sock function from TLS.
Patch 5         NVMe-TCP changes to call NIC driver on queue init/teardown.
Patches 6       NVMe-TCP changes to call NIC driver on IO operation.
                setup/teardown, and support async completions.
Patches 7       NVMe-TCP changes to support CRC offload on receive.
                Also, this patch moves CRC calculation to the end of PDU
                in case offload requires software fallback.
Patches 8       NVMe-TCP handling of netdev events: stop the offload if
                netdev is going down.
Patches 9-19    implement support for NVMe-TCP copy and CRC offload in
                the mlx5 NIC driver as the first user.
Patches 20      Document TCP DDP offload.
Patches 21-24   Net core support for transmit offload
Patches 25-26   NVMe-TCP transmit offload support 
Patches 27-36   Mellanox NVMe-TCP transmit offload support

Testing:
=========================================
This series was tested using fio with various configurations of IO sizes,
depths, MTUs, and with both the SPDK and kernel NVMe-TCP targets.
Also, we have used QEMU and gate-level simulation to verify these patches.

Future work:
=========================================
A follow-up series will introduce support for TLS in NVMe-TCP and combining the
two offloads.

Ben Ben-Ishay (8):
  net/mlx5e: NVMEoTCP offload initialization
  net/mlx5e: KLM UMR helper macros
  net/mlx5e: NVMEoTCP use KLM UMRs
  net/mlx5e: NVMEoTCP queue init/teardown
  net/mlx5e: NVMEoTCP async ddp invalidation
  net/mlx5e: NVMEoTCP ddp setup and resync
  net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload
  net/mlx5e: NVMEoTCP statistics

Ben Ben-ishay (2):
  net/mlx5: Header file changes for nvme-tcp offload
  net/mlx5: Add 128B CQE for NVMEoTCP offload

Boris Pismenny (8):
  net: Introduce direct data placement tcp offload
  iov_iter: DDP copy to iter/pages
  net: skb copy(+hash) iterators for DDP offloads
  net/tls: expose get_netdev_for_sock
  nvme-tcp: Add DDP offload control path
  nvme-tcp: Add DDP data-path
  net/mlx5e: TCP flow steering for nvme-tcp
  Documentation: add ULP DDP offload documentation

Or Gerlitz (1):
  nvme-tcp: Deal with netdevice DOWN events

Yoray Zack (17):
  nvme-tcp: RX DDGST offload
  net: drop ULP DDP HW offload feature if no CSUM offload feature
  net: Add ulp_ddp_pdu_info struct
  net: Add to ulp_ddp support for fallback flow
  net: Add MSG_DDP_CRC flag
  nvme-tcp: TX DDGST offload
  nvme-tcp: Mapping between Tx NVMEoTCP pdu and TCP sequence
  mlx5e: make preparation in TLS code for NVMEoTCP CRC Tx offload
  mlx5: Add sq state test bit for nvmeotcp
  mlx5: Add support to NETIF_F_HW_TCP_DDP_CRC_TX feature
  net/mlx5e: NVMEoTCP DDGST TX offload TIS
  net/mlx5e: NVMEoTCP DDGST Tx offload queue init/teardown
  net/mlx5e: NVMEoTCP DDGST TX BSF and PSV
  net/mlx5e: NVMEoTCP DDGST TX Data path
  net/mlx5e: NVMEoTCP DDGST TX handle OOO packets
  net/mlx5e: NVMEoTCP DDGST TX offload optimization
  net/mlx5e: NVMEoTCP DDGST TX statistics

 Documentation/networking/index.rst            |    1 +
 Documentation/networking/ulp-ddp-offload.rst  |  415 +++++
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |   10 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |    2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   36 +-
 .../net/ethernet/mellanox/mlx5/core/en/fs.h   |    4 +-
 .../ethernet/mellanox/mlx5/core/en/params.c   |   11 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |    3 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |   20 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |    1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |    1 +
 .../mellanox/mlx5/core/en_accel/en_accel.h    |   22 +-
 .../mellanox/mlx5/core/en_accel/fs_tcp.c      |   10 +
 .../mellanox/mlx5/core/en_accel/fs_tcp.h      |    2 +-
 .../mellanox/mlx5/core/en_accel/ktls_tx.c     |   16 +-
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 1555 +++++++++++++++++
 .../mellanox/mlx5/core/en_accel/nvmeotcp.h    |  138 ++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.c        |  264 +++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.h        |   43 +
 .../mlx5/core/en_accel/nvmeotcp_utils.h       |   80 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   30 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   66 +-
 .../ethernet/mellanox/mlx5/core/en_stats.c    |   74 +
 .../ethernet/mellanox/mlx5/core/en_stats.h    |   47 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |   11 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |   17 +
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |    6 +
 drivers/nvme/host/tcp.c                       |  567 +++++-
 include/linux/mlx5/device.h                   |   44 +-
 include/linux/mlx5/mlx5_ifc.h                 |  104 +-
 include/linux/mlx5/qp.h                       |    1 +
 include/linux/netdev_features.h               |    3 +-
 include/linux/netdevice.h                     |    5 +
 include/linux/skbuff.h                        |   13 +
 include/linux/socket.h                        |    1 +
 include/linux/uio.h                           |   17 +
 include/net/inet_connection_sock.h            |    4 +
 include/net/sock.h                            |   23 +
 include/net/ulp_ddp.h                         |  192 ++
 lib/iov_iter.c                                |   55 +
 net/Kconfig                                   |   10 +
 net/core/Makefile                             |    1 +
 net/core/datagram.c                           |   48 +
 net/core/dev.c                                |    2 +
 net/core/skbuff.c                             |    8 +-
 net/core/sock.c                               |    7 +
 net/core/ulp_ddp.c                            |  235 +++
 net/ethtool/common.c                          |    1 +
 net/ipv4/tcp.c                                |    6 +
 net/ipv4/tcp_input.c                          |    8 +
 net/ipv4/tcp_ipv4.c                           |    3 +
 net/ipv4/tcp_offload.c                        |    3 +
 net/tls/tls_device.c                          |   20 +-
 53 files changed, 4192 insertions(+), 74 deletions(-)
 create mode 100644 Documentation/networking/ulp-ddp-offload.rst
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
 create mode 100644 include/net/ulp_ddp.h
 create mode 100644 net/core/ulp_ddp.c

Eric Dumazet July 22, 2021, 11:26 a.m. UTC | #1

On Thu, Jul 22, 2021 at 1:04 PM Boris Pismenny <borisp@nvidia.com> wrote:
>
> From: Boris Pismenny <borisp@mellanox.com>
>
>
...

>  };
>
>  const char
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e6ca5a1f3b59..4a7160bba09b 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5149,6 +5149,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
>                 memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
>  #ifdef CONFIG_TLS_DEVICE
>                 nskb->decrypted = skb->decrypted;
> +#endif
> +#ifdef CONFIG_ULP_DDP
> +               nskb->ddp_crc = skb->ddp_crc;

Probably you do not want to attempt any collapse if skb->ddp_crc is
set right there.

>  #endif
>                 TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
>                 if (list)
> @@ -5182,6 +5185,11 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
>  #ifdef CONFIG_TLS_DEVICE
>                                 if (skb->decrypted != nskb->decrypted)
>                                         goto end;
> +#endif
> +#ifdef CONFIG_ULP_DDP
> +
> +                               if (skb->ddp_crc != nskb->ddp_crc)

This checks only the second, third, and remaining skbs.

> +                                       goto end;
>  #endif
>                         }
>                 }


tcp_collapse() is copying data from small skbs to pack it to bigger
skb (one page of payload), in case
of memory emergency/pressure (socket queues are full)

If your changes are trying to avoid 'needless'  copies, maybe you
should reconsider and let the emergency packing be done.

If the copy is not _possible_, you should rephrase your changelog to
clearly state the kernel _cannot_ access this memory in any way.

Eric Dumazet July 22, 2021, 1:10 p.m. UTC | #2

On Thu, Jul 22, 2021 at 2:18 PM Boris Pismenny <borispismenny@gmail.com> wrote:
>
> On 22/07/2021 14:26, Eric Dumazet wrote:
> > On Thu, Jul 22, 2021 at 1:04 PM Boris Pismenny <borisp@nvidia.com> wrote:
> >>
> >> From: Boris Pismenny <borisp@mellanox.com>
> >>
> >>
> > ...
> >
> >>  };
> >>
> >>  const char
> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >> index e6ca5a1f3b59..4a7160bba09b 100644
> >> --- a/net/ipv4/tcp_input.c
> >> +++ b/net/ipv4/tcp_input.c
> >> @@ -5149,6 +5149,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
> >>                 memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
> >>  #ifdef CONFIG_TLS_DEVICE
> >>                 nskb->decrypted = skb->decrypted;
> >> +#endif
> >> +#ifdef CONFIG_ULP_DDP
> >> +               nskb->ddp_crc = skb->ddp_crc;
> >
> > Probably you do not want to attempt any collapse if skb->ddp_crc is
> > set right there.
> >
>
> Right.
>
> >>  #endif
> >>                 TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
> >>                 if (list)
> >> @@ -5182,6 +5185,11 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
> >>  #ifdef CONFIG_TLS_DEVICE
> >>                                 if (skb->decrypted != nskb->decrypted)
> >>                                         goto end;
> >> +#endif
> >> +#ifdef CONFIG_ULP_DDP
> >> +
> >> +                               if (skb->ddp_crc != nskb->ddp_crc)
> >
> > This checks only the second, third, and remaining skbs.
> >
>
> Right, as we handle the head skb above. Could you clarify?

I was simply saying you missed the first skb.

>
> >> +                                       goto end;
> >>  #endif
> >>                         }
> >>                 }
> >
> >
> > tcp_collapse() is copying data from small skbs to pack it to bigger
> > skb (one page of payload), in case
> > of memory emergency/pressure (socket queues are full)
> >
> > If your changes are trying to avoid 'needless'  copies, maybe you
> > should reconsider and let the emergency packing be done.
> >
> > If the copy is not _possible_, you should rephrase your changelog to
> > clearly state the kernel _cannot_ access this memory in any way.
> >
>
> The issue is that skb_condense also gets called on many skbs in
> tcp_add_backlog and it will identify skbs that went through DDP as ideal
> for packing, even though they are not small and packing is
> counter-productive as data already resides in its destination.
>
> As mentioned above, it is possible to copy, but it is counter-productive
> in this case. If there was a real need to access this memory, then it is
> allowed.

Standard GRO packets from high perf drivers have no room in their
skb->head (ie skb_tailroom() should be 0)

If you have a driver using GRO and who pulled some payload in
skb->head, it is already too late for DDP.

So I think you are trying to add code in TCP that should not be
needed. Perhaps mlx5 driver is doing something it should not ?
(If this is ' copybreak'  this has been documented as being
suboptimal, transports have better strategies)

Secondly, tcp_collapse() should absolutely not be called under regular
workloads.

Trying to optimize this last-resort thing is a lost cause:
If an application is dumb enough to send small packets back-to-back,
it should be fixed (sender side has this thing called autocork, for
applications that do not know about MSG_MORE or TC_CORK.)

(tcp_collapse is a severe source of latencies)

Boris Pismenny July 22, 2021, 1:33 p.m. UTC | #3

On 22/07/2021 16:10, Eric Dumazet wrote:
> On Thu, Jul 22, 2021 at 2:18 PM Boris Pismenny <borispismenny@gmail.com> wrote:
>>
>> On 22/07/2021 14:26, Eric Dumazet wrote:
>>> On Thu, Jul 22, 2021 at 1:04 PM Boris Pismenny <borisp@nvidia.com> wrote:
>>>>
>>>> From: Boris Pismenny <borisp@mellanox.com>
>>>>
>>>>
>>> ...
>>>
>>>>  };
>>>>
>>>>  const char
>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>> index e6ca5a1f3b59..4a7160bba09b 100644
>>>> --- a/net/ipv4/tcp_input.c
>>>> +++ b/net/ipv4/tcp_input.c
>>>> @@ -5149,6 +5149,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
>>>>                 memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
>>>>  #ifdef CONFIG_TLS_DEVICE
>>>>                 nskb->decrypted = skb->decrypted;
>>>> +#endif
>>>> +#ifdef CONFIG_ULP_DDP
>>>> +               nskb->ddp_crc = skb->ddp_crc;
>>>
>>> Probably you do not want to attempt any collapse if skb->ddp_crc is
>>> set right there.
>>>
>>
>> Right.
>>
>>>>  #endif
>>>>                 TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
>>>>                 if (list)
>>>> @@ -5182,6 +5185,11 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
>>>>  #ifdef CONFIG_TLS_DEVICE
>>>>                                 if (skb->decrypted != nskb->decrypted)
>>>>                                         goto end;
>>>> +#endif
>>>> +#ifdef CONFIG_ULP_DDP
>>>> +
>>>> +                               if (skb->ddp_crc != nskb->ddp_crc)
>>>
>>> This checks only the second, third, and remaining skbs.
>>>
>>
>> Right, as we handle the head skb above. Could you clarify?
> 
> I was simply saying you missed the first skb.
> 

But, the first SKB got handled in the change above. The code here is the
same for TLS, if it is wrong, then we already have an issue here.

>>
>>>> +                                       goto end;
>>>>  #endif
>>>>                         }
>>>>                 }
>>>
>>>
>>> tcp_collapse() is copying data from small skbs to pack it to bigger
>>> skb (one page of payload), in case
>>> of memory emergency/pressure (socket queues are full)
>>>
>>> If your changes are trying to avoid 'needless'  copies, maybe you
>>> should reconsider and let the emergency packing be done.
>>>
>>> If the copy is not _possible_, you should rephrase your changelog to
>>> clearly state the kernel _cannot_ access this memory in any way.
>>>
>>
>> The issue is that skb_condense also gets called on many skbs in
>> tcp_add_backlog and it will identify skbs that went through DDP as ideal
>> for packing, even though they are not small and packing is
>> counter-productive as data already resides in its destination.
>>
>> As mentioned above, it is possible to copy, but it is counter-productive
>> in this case. If there was a real need to access this memory, then it is
>> allowed.
> 
> Standard GRO packets from high perf drivers have no room in their
> skb->head (ie skb_tailroom() should be 0)
> 
> If you have a driver using GRO and who pulled some payload in
> skb->head, it is already too late for DDP.
> 
> So I think you are trying to add code in TCP that should not be
> needed. Perhaps mlx5 driver is doing something it should not ?
> (If this is ' copybreak'  this has been documented as being
> suboptimal, transports have better strategies)
> 
> Secondly, tcp_collapse() should absolutely not be called under regular
> workloads.
> 
> Trying to optimize this last-resort thing is a lost cause:
> If an application is dumb enough to send small packets back-to-back,
> it should be fixed (sender side has this thing called autocork, for
> applications that do not know about MSG_MORE or TC_CORK.)
> 
> (tcp_collapse is a severe source of latencies)
> 


Sorry. My response above was about skb_condense which I've confused with
tcp_collapse.

In tcp_collapse, we could allow the copy, but the problem is CRC, which
like TLS's skb->decrypted marks that the data passed the digest
validation in the NIC. If we allow collapsing SKBs with mixed marks, we
will need to force software copy+crc verification. As TCP collapse is
indeed rare and the offload is opportunistic in nature, we can make this
change and submit another version, but I'm confused; why was it OK for
TLS, while it is not OK for DDP+CRC?

Boris Pismenny July 22, 2021, 2:02 p.m. UTC | #4

On 22/07/2021 16:39, Eric Dumazet wrote:
> On Thu, Jul 22, 2021 at 3:33 PM Boris Pismenny <borispismenny@gmail.com> wrote:
> 
>> Sorry. My response above was about skb_condense which I've confused with
>> tcp_collapse.
>>
>> In tcp_collapse, we could allow the copy, but the problem is CRC, which
>> like TLS's skb->decrypted marks that the data passed the digest
>> validation in the NIC. If we allow collapsing SKBs with mixed marks, we
>> will need to force software copy+crc verification. As TCP collapse is
>> indeed rare and the offload is opportunistic in nature, we can make this
>> change and submit another version, but I'm confused; why was it OK for
>> TLS, while it is not OK for DDP+CRC?
>>
> 
> Ah.... I guess I was focused on the DDP part, while all your changes
> are really about the CRC part.
> 
> Perhaps having an accessor to express the CRC status (and not be
> confused by the DDP part)
>  could help the intent of the code.
> 

An accessor function sounds like a great idea for readability, thanks Eric!

We will re-spin the series and add it to v6.

Christoph Hellwig July 23, 2021, 5:56 a.m. UTC | #5

On Thu, Jul 22, 2021 at 02:02:49PM +0300, Boris Pismenny wrote:
> From: Boris Pismenny <borisp@mellanox.com>

> 

> Changes since v4:

> =========================================

Explaining what the series does should go before the changelog.

> * Add transmit offload patches

But to be honest, the previous one was already mostly unreviewable,
but this is now far beyond this.   Please try to get anything that
is generally useful first in smaller series and the come back with
a somewhat reviewable series.  That also means that at least for the
code I care about (nvme) the patches should be grouped together,
and actually provide meaningful functionality in each patch.  Right
now even trying to understand what you add to the nvme code requires
me to jump all over a gigantic series.

Christoph Hellwig July 23, 2021, 6:09 a.m. UTC | #6

On Thu, Jul 22, 2021 at 02:03:12PM +0300, Boris Pismenny wrote:
>  	/* NIC driver informs the ulp that ddp teardown is done - used for async completions*/

>  	void (*ddp_teardown_done)(void *ddp_ctx);

> +	/* NIC request ulp to calculate the ddgst and store it in pdu_info->ddgst */

> +	void (*ddp_ddgst_fallback)(struct ulp_ddp_pdu_info *pdu_info);

Overly long line.  More importantly this whole struct should probably
use a kerneldoc comment anyway.

>  } EXPORT_SYMBOL(ulp_ddp_get_pdu_info);

> +	if (!pdu_info || !between(seq, pdu_info->start_seq, pdu_info->end_seq - 1)) {

More overly lone lines.  Please make sure to stick to 80 character lines
unless you have a really good to go over that.

> +	//check if this skb contains ddgst field

Plase avoid //-style comments.

> +	return ulp_ddp_fallback_skb(ctx, skb, sk);

> +} EXPORT_SYMBOL(ulp_ddp_validate_xmit_skb);

This is not how EXPORT_SYMBOLs are place.  Also please export any
such deep internal interfaces using EXPORT_SYMBOL_GPL.

Sagi Grimberg July 23, 2021, 7:58 p.m. UTC | #7

>> * Add transmit offload patches

> 

> But to be honest, the previous one was already mostly unreviewable,

> but this is now far beyond this.   Please try to get anything that

> is generally useful first in smaller series and the come back with

> a somewhat reviewable series.  That also means that at least for the

> code I care about (nvme) the patches should be grouped together,

> and actually provide meaningful functionality in each patch.  Right

> now even trying to understand what you add to the nvme code requires

> me to jump all over a gigantic series.


I agree as well. It is difficult to review.
The order should be:
1. ulp_ddp interface
2. nvme-tcp changes
3. mlx5e changes

Also even beyond grouping patches together I have 2 requests:
1. Please consolidate ddp routines under a single ifdef (also minimize
the ifdef in call-sites).
2. When consolidating functions, try to do this as prep patches
documenting in the change log that it is preparing to add ddp. Its
difficult digesting both at times.

Or Gerlitz Aug. 4, 2021, 1:51 p.m. UTC | #8

On Fri, Jul 23, 2021 at 10:59 PM Sagi Grimberg <sagi@grimberg.me> wrote:

> [.. ] It is difficult to review.

> The order should be:

> 1. ulp_ddp interface

> 2. nvme-tcp changes

> 3. mlx5e changes


.. and this is exactly how the series is organized, for v6 we will drop the
TX offload part and stick to completing the review on the RX offload part.

> Also even beyond grouping patches together I have 2 requests:

> 1. Please consolidate ddp routines under a single ifdef (also minimize

> the ifdef in call-sites).


ok, will make an effort to be better in that respect

> 2. When consolidating functions, try to do this as prep patches

> documenting in the change log that it is preparing to add ddp. Its

> difficult digesting both at times.


to clarify, you would like patch #5 "nvme-tcp: Add DDP offload control path"
to only add the call sites and if-not-deffed implementation for the added knobs:

nvme_tcp_offload_socket
nvme_tcp_unoffload_socket
nvme_tcp_offload_limits
nvme_tcp_resync_response

and a 2nd patch to add the if-yes-deffed implementation?

This makes sense, however IMHO repeating this prep exercise for
the data-path patch (#6 "nvme-tcp: Add DDP data-path") doesn't
seem to provide notable value  b/c you will only see two call sites
for the two added empty knobs:

nvme_tcp_setup_ddp
nvme_tcp_teardown_ddp

but whatever you prefer, so.. let us know

Sagi Grimberg Aug. 6, 2021, 7:46 p.m. UTC | #9

On 8/4/21 6:51 AM, Or Gerlitz wrote:
> On Fri, Jul 23, 2021 at 10:59 PM Sagi Grimberg <sagi@grimberg.me> wrote:

> 

>> [.. ] It is difficult to review.

>> The order should be:

>> 1. ulp_ddp interface

>> 2. nvme-tcp changes

>> 3. mlx5e changes

> 

> .. and this is exactly how the series is organized, for v6 we will drop the

> TX offload part and stick to completing the review on the RX offload part.

> 

>> Also even beyond grouping patches together I have 2 requests:

>> 1. Please consolidate ddp routines under a single ifdef (also minimize

>> the ifdef in call-sites).

> 

> ok, will make an effort to be better in that respect

> 

>> 2. When consolidating functions, try to do this as prep patches

>> documenting in the change log that it is preparing to add ddp. Its

>> difficult digesting both at times.

> 

> to clarify, you would like patch #5 "nvme-tcp: Add DDP offload control path"

> to only add the call sites and if-not-deffed implementation for the added knobs:

> 

> nvme_tcp_offload_socket

> nvme_tcp_unoffload_socket

> nvme_tcp_offload_limits

> nvme_tcp_resync_response

> 

> and a 2nd patch to add the if-yes-deffed implementation?

> 

> This makes sense, however IMHO repeating this prep exercise for

> the data-path patch (#6 "nvme-tcp: Add DDP data-path") doesn't

> seem to provide notable value  b/c you will only see two call sites

> for the two added empty knobs:

> 

> nvme_tcp_setup_ddp

> nvme_tcp_teardown_ddp

> 

> but whatever you prefer, so.. let us know


I was more referring to routines that now grew the ddp path
and changed in the same time like:
nvme_tcp_complete_request
nvme_tcp_consume_skb
etc..

Or Gerlitz Aug. 10, 2021, 1:37 p.m. UTC | #10

On Fri, Aug 6, 2021 at 10:46 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> On 8/4/21 6:51 AM, Or Gerlitz wrote:

> > On Fri, Jul 23, 2021 at 10:59 PM Sagi Grimberg <sagi@grimberg.me> wrote:


> >> [.. ] It is difficult to review.

> >> The order should be:

> >> 1. ulp_ddp interface

> >> 2. nvme-tcp changes

> >> 3. mlx5e changes

> >

> > .. and this is exactly how the series is organized, for v6 we will drop the

> > TX offload part and stick to completing the review on the RX offload part.

> >

> >> Also even beyond grouping patches together I have 2 requests:

> >> 1. Please consolidate ddp routines under a single ifdef (also minimize

> >> the ifdef in call-sites).

> >

> > ok, will make an effort to be better in that respect

> >

> >> 2. When consolidating functions, try to do this as prep patches

> >> documenting in the change log that it is preparing to add ddp. Its

> >> difficult digesting both at times.

> >

> > to clarify, you would like patch #5 "nvme-tcp: Add DDP offload control path"

> > to only add the call sites and if-not-deffed implementation for the added knobs:

> >

> > nvme_tcp_offload_socket

> > nvme_tcp_unoffload_socket

> > nvme_tcp_offload_limits

> > nvme_tcp_resync_response

> >

> > and a 2nd patch to add the if-yes-deffed implementation?

> >

> > This makes sense, however IMHO repeating this prep exercise for

> > the data-path patch (#6 "nvme-tcp: Add DDP data-path") doesn't

> > seem to provide notable value  b/c you will only see two call sites

> > for the two added empty knobs:

> >

> > nvme_tcp_setup_ddp

> > nvme_tcp_teardown_ddp

> >

> > but whatever you prefer, so.. let us know


> I was more referring to routines that now grew the ddp path

> and changed in the same time like:


> nvme_tcp_complete_request


not sure to follow on this one.. It's added on patch #6 "nvme-tcp: Add
DDP data-path"
and then used twice in the same patch replacing calls to nvme_try_complete_req
and then to nvme_complete_rq -- so how want this  be broken to prep/usage?

> nvme_tcp_consume_skb


this routine was born due to the ddp_ prefix addition to the iov copy
iter functions, which we are now removing due to feedback from Al

> etc..

[v5,net-next,00/36] nvme-tcp receive and tarnsmit offloads

Message

Comments