mbox series

[RFC,v2,00/11] Device Memory TCP

Message ID 20230810015751.3297321-1-almasrymina@google.com
Headers show
Series Device Memory TCP | expand

Message

Mina Almasry Aug. 10, 2023, 1:57 a.m. UTC
Changes in RFC v2:
------------------

The sticking point in RFC v1[1] was the dma-buf pages approach we used to
deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
that attempts to resolve this by implementing scatterlist support in the
networking stack, such that we can import the dma-buf scatterlist
directly. This is the approach proposed at a high level here[2].

Detailed changes:
1. Replaced dma-buf pages approach with importing scatterlist into the
   page pool.
2. Replace the dma-buf pages centric API with a netlink API.
3. Removed the TX path implementation - there is no issue with
   implementing the TX path with scatterlist approach, but leaving
   out the TX path makes it easier to review.
4. Functionality is tested with this proposal, but I have not conducted
   perf testing yet. I'm not sure there are regressions, but I removed
   perf claims from the cover letter until they can be re-confirmed.
5. Added Signed-off-by: contributors to the implementation.
6. Fixed some bugs with the RX path since RFC v1.

Any feedback welcome, but specifically the biggest pending questions
needing feedback IMO are:

1. Feedback on the scatterlist-based approach in general.
2. Netlink API (Patch 1 & 2).
3. Approach to handle all the drivers that expect to receive pages from
   the page pool (Patch 6).

[1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.com/T/
[2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@mail.gmail.com/

----------------------

* TL;DR:

Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
from device memory efficiently, without bouncing the data to a host memory
buffer.

* Problem:

A large amount of data transfers have device memory as the source and/or
destination. Accelerators drastically increased the volume of such transfers.
Some examples include:
- ML accelerators transferring large amounts of training data from storage into
  GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
  TPU compute time, improving data transfer throughput & efficiency can help
  improving GPU/TPU utilization.

- Distributed training, where ML accelerators, such as GPUs on different hosts,
  exchange data among them.

- Distributed raw block storage applications transfer large amounts of data with
  remote SSDs, much of this data does not require host processing.

Today, the majority of the Device-to-Device data transfers the network are
implemented as the following low level operations: Device-to-Host copy,
Host-to-Host network transfer, and Host-to-Device copy.

The implementation is suboptimal, especially for bulk data transfers, and can
put significant strains on system resources, such as host memory bandwidth,
PCIe bandwidth, etc. One important reason behind the current state is the
kernel’s lack of semantics to express device to network transfers.

* Proposal:

In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:

1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.

Packet _payloads_ go directly from the NIC to device memory for receive and from
device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
normally. The NIC _must_ support header split to achieve this.

Advantages:

- Alleviate host memory bandwidth pressure, compared to existing
 network-transfer + device-copy semantics.

- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
  of the PCIe tree, compared to traditional path which sends data through the
  root complex.

* Patch overview:

** Part 1: netlink API

Gives user ability to bind dma-buf to an RX queue.

** Part 2: scatterlist support

Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers, and
page pool) operate on pages. We have 2 options:

1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to process scatterlist.

Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.

** part 3: page pool support

We piggy back on page pool memory providers proposal:
https://github.com/kuba-moo/linux/tree/pp-providers

It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory
TCP changes from the driver.

** part 4: support for unreadable skb frags

Page pool iovs are not accessible by the host; we implement changes
throughput the networking stack to correctly handle skbs with unreadable
frags.

** Part 5: recvmsg() APIs

We define user APIs for the user to send and receive device memory.

Not included with this RFC is the GVE devmem TCP support, just to
simplify the review. Code available here if desired:
https://github.com/mina/linux/tree/tcpdevmem

This RFC is built on top of net-next with Jakub's pp-providers changes
cherry-picked.

* NIC dependencies:

1. (strict) Devmem TCP require the NIC to support header split, i.e. the
   capability to split incoming packets into a header + payload and to put
   each into a separate buffer. Devmem TCP works by using device memory
   for the packet payload, and host memory for the packet headers.

2. (optional) Devmem TCP works better with flow steering support & RSS support,
   i.e. the NIC's ability to steer flows into certain rx queues. This allows the
   sysadmin to enable devmem TCP on a subset of the rx queues, and steer
   devmem TCP traffic onto these queues and non devmem TCP elsewhere.

The NIC I have access to with these properties is the GVE with DQO support
running in Google Cloud, but any NIC that supports these features would suffice.
I may be able to help reviewers bring up devmem TCP on their NICs.

* Testing:

The series includes a udmabuf kselftest that show a simple use case of
devmem TCP and validates the entire data path end to end without
a dependency on a specific dmabuf provider.

** Test Setup

Kernel: net-next with this RFC and memory provider API cherry-picked
locally.

Hardware: Google Cloud A3 VMs.

NIC: GVE with header split & RSS & flow steering support.

Mina Almasry (11):
  net: add netdev netlink api to bind dma-buf to a net device
  netdev: implement netlink api to bind dma-buf to netdevice
  netdev: implement netdevice devmem allocator
  memory-provider: updates to core provider API for devmem TCP
  memory-provider: implement dmabuf devmem memory provider
  page-pool: add device memory support
  net: support non paged skb frags
  net: add support for skbs with unreadable frags
  tcp: implement recvmsg() RX path for devmem TCP
  net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages
  selftests: add ncdevmem, netcat for devmem TCP

 Documentation/netlink/specs/netdev.yaml |  27 ++
 include/linux/netdevice.h               |  61 +++
 include/linux/skbuff.h                  |  54 ++-
 include/linux/socket.h                  |   1 +
 include/net/page_pool.h                 | 186 ++++++++-
 include/net/sock.h                      |   2 +
 include/net/tcp.h                       |   5 +-
 include/uapi/asm-generic/socket.h       |   6 +
 include/uapi/linux/netdev.h             |  10 +
 include/uapi/linux/uio.h                |  10 +
 net/core/datagram.c                     |   6 +
 net/core/dev.c                          | 214 ++++++++++
 net/core/gro.c                          |   2 +-
 net/core/netdev-genl-gen.c              |  14 +
 net/core/netdev-genl-gen.h              |   1 +
 net/core/netdev-genl.c                  | 103 +++++
 net/core/page_pool.c                    | 171 ++++++--
 net/core/skbuff.c                       |  80 +++-
 net/core/sock.c                         |  36 ++
 net/ipv4/tcp.c                          | 196 ++++++++-
 net/ipv4/tcp_input.c                    |  13 +-
 net/ipv4/tcp_ipv4.c                     |   7 +
 net/ipv4/tcp_output.c                   |   5 +-
 net/packet/af_packet.c                  |   4 +-
 tools/include/uapi/linux/netdev.h       |  10 +
 tools/net/ynl/generated/netdev-user.c   |  41 ++
 tools/net/ynl/generated/netdev-user.h   |  46 ++
 tools/testing/selftests/net/.gitignore  |   1 +
 tools/testing/selftests/net/Makefile    |   5 +
 tools/testing/selftests/net/ncdevmem.c  | 534 ++++++++++++++++++++++++
 30 files changed, 1787 insertions(+), 64 deletions(-)
 create mode 100644 tools/testing/selftests/net/ncdevmem.c

Comments

Jason Gunthorpe Aug. 10, 2023, 4:06 p.m. UTC | #1
On Thu, Aug 10, 2023 at 12:29:08PM +0200, Christian König wrote:
> Am 10.08.23 um 03:57 schrieb Mina Almasry:
> > Changes in RFC v2:
> > ------------------
> > 
> > The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> > that attempts to resolve this by implementing scatterlist support in the
> > networking stack, such that we can import the dma-buf scatterlist
> > directly.
> 
> Impressive work, I didn't thought that this would be possible that "easily".
> 
> Please note that we have considered replacing scatterlists with simple
> arrays of DMA-addresses in the DMA-buf framework to avoid people trying to
> access the struct page inside the scatterlist.
> 
> It might be a good idea to push for that first before this here is finally
> implemented.
> 
> GPU drivers already convert the scatterlist used to arrays of DMA-addresses
> as soon as they get them. This leaves RDMA and V4L as the other two main
> users which would need to be converted.

Oh that would be a nightmare for RDMA.

We need a standard based way to have scalable lists of DMA addresses :(

> > 2. Netlink API (Patch 1 & 2).
> 
> How does netlink manage the lifetime of objects?

And access control..

Jason
Jason Gunthorpe Aug. 10, 2023, 6:58 p.m. UTC | #2
On Thu, Aug 10, 2023 at 11:44:53AM -0700, Mina Almasry wrote:

> Someone will correct me if I'm wrong but I'm not sure netlink itself
> will do (sufficient) access control. However I meant for the netlink
> API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add
> this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN
> check in netdev_bind_dmabuf_to_queue() in the next iteration.

Can some other process that does not have the netlink fd manage to
recv packets that were stored into the dmabuf?

Jason
Mina Almasry Aug. 11, 2023, 1:56 a.m. UTC | #3
On Thu, Aug 10, 2023 at 11:58 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 10, 2023 at 11:44:53AM -0700, Mina Almasry wrote:
>
> > Someone will correct me if I'm wrong but I'm not sure netlink itself
> > will do (sufficient) access control. However I meant for the netlink
> > API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add
> > this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN
> > check in netdev_bind_dmabuf_to_queue() in the next iteration.
>
> Can some other process that does not have the netlink fd manage to
> recv packets that were stored into the dmabuf?
>

The process needs to have the dma-buf fd to receive packets, and not
necessarily the netlink fd. It should be possible for:

1. a CAP_NET_ADMIN process to create a dma-buf, bind it using a
netlink fd, then share the dma-buf with another normal process that
receives packets on it.
2. a normal process to create a dma-buf, share it with a privileged
CAP_NET_ADMIN process that creates the binding via the netlink fd,
then the owner of the dma-buf can receive data on the dma-buf fd.
3. a CAP_NET_ADMIN creates the dma-buf and creates the binding itself
and receives data.

We in particular plan to use devmem TCP in the first mode, but this
detail is specific to us so I've largely neglected from describing it
in the cover letter. If our setup is interesting:
the CAP_NET_ADMIN process I describe in #1 is a 'tcpdevmem daemon'
which allocates the GPU memory, creates a dma-buf, creates an RX queue
binding, and shares the dma-buf with the ML application(s) running on
our instance. The ML application receives data on the dma-buf via
recvmsg().

The 'tcpdevmem daemon' takes care of the binding but also configures
RSS & flow steering. The dma-buf fd sharing happens over a unix domain
socket.
Christian König Aug. 11, 2023, 11:02 a.m. UTC | #4
Am 10.08.23 um 20:44 schrieb Mina Almasry:
> On Thu, Aug 10, 2023 at 3:29 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 10.08.23 um 03:57 schrieb Mina Almasry:
>>> Changes in RFC v2:
>>> ------------------
>>>
>>> The sticking point in RFC v1[1] was the dma-buf pages approach we used to
>>> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
>>> that attempts to resolve this by implementing scatterlist support in the
>>> networking stack, such that we can import the dma-buf scatterlist
>>> directly.
>> Impressive work, I didn't thought that this would be possible that "easily".
>>
>> Please note that we have considered replacing scatterlists with simple
>> arrays of DMA-addresses in the DMA-buf framework to avoid people trying
>> to access the struct page inside the scatterlist.
>>
> FWIW, I'm not doing anything with the struct pages inside the
> scatterlist. All I need from the scatterlist are the
> sg_dma_address(sg) and the sg_dma_len(sg), and I'm guessing the array
> you're describing will provide exactly those, but let me know if I
> misunderstood.

Your understanding is perfectly correct.

>
>> It might be a good idea to push for that first before this here is
>> finally implemented.
>>
>> GPU drivers already convert the scatterlist used to arrays of
>> DMA-addresses as soon as they get them. This leaves RDMA and V4L as the
>> other two main users which would need to be converted.
>>
>>>    This is the approach proposed at a high level here[2].
>>>
>>> Detailed changes:
>>> 1. Replaced dma-buf pages approach with importing scatterlist into the
>>>      page pool.
>>> 2. Replace the dma-buf pages centric API with a netlink API.
>>> 3. Removed the TX path implementation - there is no issue with
>>>      implementing the TX path with scatterlist approach, but leaving
>>>      out the TX path makes it easier to review.
>>> 4. Functionality is tested with this proposal, but I have not conducted
>>>      perf testing yet. I'm not sure there are regressions, but I removed
>>>      perf claims from the cover letter until they can be re-confirmed.
>>> 5. Added Signed-off-by: contributors to the implementation.
>>> 6. Fixed some bugs with the RX path since RFC v1.
>>>
>>> Any feedback welcome, but specifically the biggest pending questions
>>> needing feedback IMO are:
>>>
>>> 1. Feedback on the scatterlist-based approach in general.
>> As far as I can see this sounds like the right thing to do in general.
>>
>> Question is rather if we should stick with scatterlist, use array of
>> DMA-addresses or maybe even come up with a completely new structure.
>>
> As far as I can tell, it should be trivial to switch this device
> memory TCP implementation to anything that provides:
>
> 1. DMA-addresses (sg_dma_address() equivalent)
> 2. lengths (sg_dma_len() equivalent)
>
> if you go that route. Specifically, I think it will be pretty much a
> localized change to netdev_bind_dmabuf_to_queue() implemented in this
> patch:
> https://lore.kernel.org/netdev/ZNULIDzuVVyfyMq2@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f

Thanks, that's exactly what I wanted to hear.

>
>>> 2. Netlink API (Patch 1 & 2).
>> How does netlink manage the lifetime of objects?
>>
> Netlink itself doesn't handle the lifetime of the binding. However,
> the API I implemented unbinds the dma-buf when the netlink socket is
> destroyed. I do this so that even if the user process crashes or
> forgets to unbind, the dma-buf will still be unbound once the netlink
> socket is closed on the process exit. Details in this patch:
> https://lore.kernel.org/netdev/ZNULIDzuVVyfyMq2@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f

I need to double check the details, but at least of hand that sounds 
sufficient for the lifetime requirements of DMA-buf.

Thanks,
Christian.

>
> On Thu, Aug 10, 2023 at 9:07 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> On Thu, Aug 10, 2023 at 12:29:08PM +0200, Christian König wrote:
>>> Am 10.08.23 um 03:57 schrieb Mina Almasry:
>>>> Changes in RFC v2:
>>>> ------------------
>>>>
>>>> The sticking point in RFC v1[1] was the dma-buf pages approach we used to
>>>> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
>>>> that attempts to resolve this by implementing scatterlist support in the
>>>> networking stack, such that we can import the dma-buf scatterlist
>>>> directly.
>>> Impressive work, I didn't thought that this would be possible that "easily".
>>>
>>> Please note that we have considered replacing scatterlists with simple
>>> arrays of DMA-addresses in the DMA-buf framework to avoid people trying to
>>> access the struct page inside the scatterlist.
>>>
>>> It might be a good idea to push for that first before this here is finally
>>> implemented.
>>>
>>> GPU drivers already convert the scatterlist used to arrays of DMA-addresses
>>> as soon as they get them. This leaves RDMA and V4L as the other two main
>>> users which would need to be converted.
>> Oh that would be a nightmare for RDMA.
>>
>> We need a standard based way to have scalable lists of DMA addresses :(
>>
>>>> 2. Netlink API (Patch 1 & 2).
>>> How does netlink manage the lifetime of objects?
>> And access control..
>>
> Someone will correct me if I'm wrong but I'm not sure netlink itself
> will do (sufficient) access control. However I meant for the netlink
> API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add
> this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN
> check in netdev_bind_dmabuf_to_queue() in the next iteration.
David Ahern Aug. 14, 2023, 1:12 a.m. UTC | #5
On 8/9/23 7:57 PM, Mina Almasry wrote:
> Changes in RFC v2:
> ------------------
> 
> The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> that attempts to resolve this by implementing scatterlist support in the
> networking stack, such that we can import the dma-buf scatterlist
> directly. This is the approach proposed at a high level here[2].
> 
> Detailed changes:
> 1. Replaced dma-buf pages approach with importing scatterlist into the
>    page pool.
> 2. Replace the dma-buf pages centric API with a netlink API.
> 3. Removed the TX path implementation - there is no issue with
>    implementing the TX path with scatterlist approach, but leaving
>    out the TX path makes it easier to review.
> 4. Functionality is tested with this proposal, but I have not conducted
>    perf testing yet. I'm not sure there are regressions, but I removed
>    perf claims from the cover letter until they can be re-confirmed.
> 5. Added Signed-off-by: contributors to the implementation.
> 6. Fixed some bugs with the RX path since RFC v1.
> 
> Any feedback welcome, but specifically the biggest pending questions
> needing feedback IMO are:
> 
> 1. Feedback on the scatterlist-based approach in general.
> 2. Netlink API (Patch 1 & 2).
> 3. Approach to handle all the drivers that expect to receive pages from
>    the page pool (Patch 6).
> 
> [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.com/T/
> [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@mail.gmail.com/
> 
> ----------------------
> 
> * TL;DR:
> 
> Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> from device memory efficiently, without bouncing the data to a host memory
> buffer.
> 
> * Problem:
> 
> A large amount of data transfers have device memory as the source and/or
> destination. Accelerators drastically increased the volume of such transfers.
> Some examples include:
> - ML accelerators transferring large amounts of training data from storage into
>   GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
>   TPU compute time, improving data transfer throughput & efficiency can help
>   improving GPU/TPU utilization.
> 
> - Distributed training, where ML accelerators, such as GPUs on different hosts,
>   exchange data among them.
> 
> - Distributed raw block storage applications transfer large amounts of data with
>   remote SSDs, much of this data does not require host processing.
> 
> Today, the majority of the Device-to-Device data transfers the network are
> implemented as the following low level operations: Device-to-Host copy,
> Host-to-Host network transfer, and Host-to-Device copy.
> 
> The implementation is suboptimal, especially for bulk data transfers, and can
> put significant strains on system resources, such as host memory bandwidth,
> PCIe bandwidth, etc. One important reason behind the current state is the
> kernel’s lack of semantics to express device to network transfers.
> 
> * Proposal:
> 
> In this patch series we attempt to optimize this use case by implementing
> socket APIs that enable the user to:
> 
> 1. send device memory across the network directly, and
> 2. receive incoming network packets directly into device memory.
> 
> Packet _payloads_ go directly from the NIC to device memory for receive and from
> device memory to NIC for transmit.
> Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
> normally. The NIC _must_ support header split to achieve this.
> 
> Advantages:
> 
> - Alleviate host memory bandwidth pressure, compared to existing
>  network-transfer + device-copy semantics.
> 
> - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
>   of the PCIe tree, compared to traditional path which sends data through the
>   root complex.
> 
> * Patch overview:
> 
> ** Part 1: netlink API
> 
> Gives user ability to bind dma-buf to an RX queue.
> 
> ** Part 2: scatterlist support
> 
> Currently the standard for device memory sharing is DMABUF, which doesn't
> generate struct pages. On the other hand, networking stack (skbs, drivers, and
> page pool) operate on pages. We have 2 options:
> 
> 1. Generate struct pages for dmabuf device memory, or,
> 2. Modify the networking stack to process scatterlist.
> 
> Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
> 
> ** part 3: page pool support
> 
> We piggy back on page pool memory providers proposal:
> https://github.com/kuba-moo/linux/tree/pp-providers
> 
> It allows the page pool to define a memory provider that provides the
> page allocation and freeing. It helps abstract most of the device memory
> TCP changes from the driver.
> 
> ** part 4: support for unreadable skb frags
> 
> Page pool iovs are not accessible by the host; we implement changes
> throughput the networking stack to correctly handle skbs with unreadable
> frags.
> 
> ** Part 5: recvmsg() APIs
> 
> We define user APIs for the user to send and receive device memory.
> 
> Not included with this RFC is the GVE devmem TCP support, just to
> simplify the review. Code available here if desired:
> https://github.com/mina/linux/tree/tcpdevmem
> 
> This RFC is built on top of net-next with Jakub's pp-providers changes
> cherry-picked.
> 
> * NIC dependencies:
> 
> 1. (strict) Devmem TCP require the NIC to support header split, i.e. the
>    capability to split incoming packets into a header + payload and to put
>    each into a separate buffer. Devmem TCP works by using device memory
>    for the packet payload, and host memory for the packet headers.
> 
> 2. (optional) Devmem TCP works better with flow steering support & RSS support,
>    i.e. the NIC's ability to steer flows into certain rx queues. This allows the
>    sysadmin to enable devmem TCP on a subset of the rx queues, and steer
>    devmem TCP traffic onto these queues and non devmem TCP elsewhere.
> 
> The NIC I have access to with these properties is the GVE with DQO support
> running in Google Cloud, but any NIC that supports these features would suffice.
> I may be able to help reviewers bring up devmem TCP on their NICs.
> 
> * Testing:
> 
> The series includes a udmabuf kselftest that show a simple use case of
> devmem TCP and validates the entire data path end to end without
> a dependency on a specific dmabuf provider.
> 
> ** Test Setup
> 
> Kernel: net-next with this RFC and memory provider API cherry-picked
> locally.
> 
> Hardware: Google Cloud A3 VMs.
> 
> NIC: GVE with header split & RSS & flow steering support.

This set seems to depend on Jakub's memory provider patches and a netdev
driver change which is not included. For the testing mentioned here, you
must have a tree + branch with all of the patches. Is it publicly available?

It would be interesting to see how well (easy) this integrates with
io_uring. Besides avoiding all of the syscalls for receiving the iov and
releasing the buffers back to the pool, io_uring also brings in the
ability to seed a page_pool with registered buffers which provides a
means to get simpler Rx ZC for host memory.

Overall I like the intent and possibilities for extensions, but a lot of
details are missing - perhaps some are answered by seeing an end-to-end
implementation.
Mina Almasry Aug. 14, 2023, 2:11 a.m. UTC | #6
On Sun, Aug 13, 2023 at 6:12 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 8/9/23 7:57 PM, Mina Almasry wrote:
> > Changes in RFC v2:
> > ------------------
> >
> > The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> > that attempts to resolve this by implementing scatterlist support in the
> > networking stack, such that we can import the dma-buf scatterlist
> > directly. This is the approach proposed at a high level here[2].
> >
> > Detailed changes:
> > 1. Replaced dma-buf pages approach with importing scatterlist into the
> >    page pool.
> > 2. Replace the dma-buf pages centric API with a netlink API.
> > 3. Removed the TX path implementation - there is no issue with
> >    implementing the TX path with scatterlist approach, but leaving
> >    out the TX path makes it easier to review.
> > 4. Functionality is tested with this proposal, but I have not conducted
> >    perf testing yet. I'm not sure there are regressions, but I removed
> >    perf claims from the cover letter until they can be re-confirmed.
> > 5. Added Signed-off-by: contributors to the implementation.
> > 6. Fixed some bugs with the RX path since RFC v1.
> >
> > Any feedback welcome, but specifically the biggest pending questions
> > needing feedback IMO are:
> >
> > 1. Feedback on the scatterlist-based approach in general.
> > 2. Netlink API (Patch 1 & 2).
> > 3. Approach to handle all the drivers that expect to receive pages from
> >    the page pool (Patch 6).
> >
> > [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.com/T/
> > [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@mail.gmail.com/
> >
> > ----------------------
> >
> > * TL;DR:
> >
> > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> > from device memory efficiently, without bouncing the data to a host memory
> > buffer.
> >
> > * Problem:
> >
> > A large amount of data transfers have device memory as the source and/or
> > destination. Accelerators drastically increased the volume of such transfers.
> > Some examples include:
> > - ML accelerators transferring large amounts of training data from storage into
> >   GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
> >   TPU compute time, improving data transfer throughput & efficiency can help
> >   improving GPU/TPU utilization.
> >
> > - Distributed training, where ML accelerators, such as GPUs on different hosts,
> >   exchange data among them.
> >
> > - Distributed raw block storage applications transfer large amounts of data with
> >   remote SSDs, much of this data does not require host processing.
> >
> > Today, the majority of the Device-to-Device data transfers the network are
> > implemented as the following low level operations: Device-to-Host copy,
> > Host-to-Host network transfer, and Host-to-Device copy.
> >
> > The implementation is suboptimal, especially for bulk data transfers, and can
> > put significant strains on system resources, such as host memory bandwidth,
> > PCIe bandwidth, etc. One important reason behind the current state is the
> > kernel’s lack of semantics to express device to network transfers.
> >
> > * Proposal:
> >
> > In this patch series we attempt to optimize this use case by implementing
> > socket APIs that enable the user to:
> >
> > 1. send device memory across the network directly, and
> > 2. receive incoming network packets directly into device memory.
> >
> > Packet _payloads_ go directly from the NIC to device memory for receive and from
> > device memory to NIC for transmit.
> > Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
> > normally. The NIC _must_ support header split to achieve this.
> >
> > Advantages:
> >
> > - Alleviate host memory bandwidth pressure, compared to existing
> >  network-transfer + device-copy semantics.
> >
> > - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
> >   of the PCIe tree, compared to traditional path which sends data through the
> >   root complex.
> >
> > * Patch overview:
> >
> > ** Part 1: netlink API
> >
> > Gives user ability to bind dma-buf to an RX queue.
> >
> > ** Part 2: scatterlist support
> >
> > Currently the standard for device memory sharing is DMABUF, which doesn't
> > generate struct pages. On the other hand, networking stack (skbs, drivers, and
> > page pool) operate on pages. We have 2 options:
> >
> > 1. Generate struct pages for dmabuf device memory, or,
> > 2. Modify the networking stack to process scatterlist.
> >
> > Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
> >
> > ** part 3: page pool support
> >
> > We piggy back on page pool memory providers proposal:
> > https://github.com/kuba-moo/linux/tree/pp-providers
> >
> > It allows the page pool to define a memory provider that provides the
> > page allocation and freeing. It helps abstract most of the device memory
> > TCP changes from the driver.
> >
> > ** part 4: support for unreadable skb frags
> >
> > Page pool iovs are not accessible by the host; we implement changes
> > throughput the networking stack to correctly handle skbs with unreadable
> > frags.
> >
> > ** Part 5: recvmsg() APIs
> >
> > We define user APIs for the user to send and receive device memory.
> >
> > Not included with this RFC is the GVE devmem TCP support, just to
> > simplify the review. Code available here if desired:
> > https://github.com/mina/linux/tree/tcpdevmem
> >
> > This RFC is built on top of net-next with Jakub's pp-providers changes
> > cherry-picked.
> >
> > * NIC dependencies:
> >
> > 1. (strict) Devmem TCP require the NIC to support header split, i.e. the
> >    capability to split incoming packets into a header + payload and to put
> >    each into a separate buffer. Devmem TCP works by using device memory
> >    for the packet payload, and host memory for the packet headers.
> >
> > 2. (optional) Devmem TCP works better with flow steering support & RSS support,
> >    i.e. the NIC's ability to steer flows into certain rx queues. This allows the
> >    sysadmin to enable devmem TCP on a subset of the rx queues, and steer
> >    devmem TCP traffic onto these queues and non devmem TCP elsewhere.
> >
> > The NIC I have access to with these properties is the GVE with DQO support
> > running in Google Cloud, but any NIC that supports these features would suffice.
> > I may be able to help reviewers bring up devmem TCP on their NICs.
> >
> > * Testing:
> >
> > The series includes a udmabuf kselftest that show a simple use case of
> > devmem TCP and validates the entire data path end to end without
> > a dependency on a specific dmabuf provider.
> >
> > ** Test Setup
> >
> > Kernel: net-next with this RFC and memory provider API cherry-picked
> > locally.
> >
> > Hardware: Google Cloud A3 VMs.
> >
> > NIC: GVE with header split & RSS & flow steering support.
>
> This set seems to depend on Jakub's memory provider patches and a netdev
> driver change which is not included. For the testing mentioned here, you
> must have a tree + branch with all of the patches. Is it publicly available?
>

Yes, the net-next based branch is right here:
https://github.com/mina/linux/tree/tcpdevmem

Here is the git log of that branch:
https://github.com/mina/linux/commits/tcpdevmem

FWIW, it's already linked from the (long) cover letter, at the end of
the '* Patch overview:' section.

The branch includes all you mentioned above. The netdev driver I'm
using in the GVE. It also includes patches to implement header split &
flow steering for GVE (being upstreamed separately), and some debug
changes.

> It would be interesting to see how well (easy) this integrates with
> io_uring. Besides avoiding all of the syscalls for receiving the iov and
> releasing the buffers back to the pool, io_uring also brings in the
> ability to seed a page_pool with registered buffers which provides a
> means to get simpler Rx ZC for host memory.
>
> Overall I like the intent and possibilities for extensions, but a lot of
> details are missing - perhaps some are answered by seeing an end-to-end
> implementation.
David Laight Aug. 15, 2023, 1:38 p.m. UTC | #7
From: Mina Almasry
> Sent: 10 August 2023 02:58
...
> * TL;DR:
> 
> Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> from device memory efficiently, without bouncing the data to a host memory
> buffer.

Doesn't that really require peer-to-peer PCIe transfers?
IIRC these aren't supported by many root hubs and have
fundamental flow control and/or TLP credit problems.

I'd guess they are also pretty incompatible with IOMMU?

I can see how you might manage to transmit frames from
some external memory (eg after encryption) but surely
processing receive data that way needs the packets
be filtered by both IP addresses and port numbers before
being redirected to the (presumably limited) external
memory.

OTOH isn't the kernel going to need to run code before
the packet is actually sent and just after it is received?
So all you might gain is a bit of latency?
And a bit less utilisation of host memory??
But if your system is really limited by cpu-memory bandwidth
you need more cache :-)

So how much benefit is there over efficient use of host
memory bounce buffers??

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Willem de Bruijn Aug. 15, 2023, 2:41 p.m. UTC | #8
On Tue, Aug 15, 2023 at 9:38 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Mina Almasry
> > Sent: 10 August 2023 02:58
> ...
> > * TL;DR:
> >
> > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> > from device memory efficiently, without bouncing the data to a host memory
> > buffer.
>
> Doesn't that really require peer-to-peer PCIe transfers?
> IIRC these aren't supported by many root hubs and have
> fundamental flow control and/or TLP credit problems.
>
> I'd guess they are also pretty incompatible with IOMMU?

Yes, this is a form of PCI_P2PDMA and all the limitations of that apply.

> I can see how you might manage to transmit frames from
> some external memory (eg after encryption) but surely
> processing receive data that way needs the packets
> be filtered by both IP addresses and port numbers before
> being redirected to the (presumably limited) external
> memory.

This feature depends on NIC receive header split. The TCP/IP headers
are stored to host memory, the payload to device memory.

Optionally, on devices that do not support explicit header-split, but
do support scatter-gather I/O, if the header size is constant and
known, that can be used as a weak substitute. This has additional
caveats wrt unexpected traffic for which payload must be host visible
(e.g., ICMP).

> OTOH isn't the kernel going to need to run code before
> the packet is actually sent and just after it is received?
> So all you might gain is a bit of latency?
> And a bit less utilisation of host memory??
> But if your system is really limited by cpu-memory bandwidth
> you need more cache :-)
>
>
> So how much benefit is there over efficient use of host
> memory bounce buffers??

Among other things, on a PCIe tree this makes it possible to load up
machines with many NICs + GPUs.
Pavel Begunkov Aug. 17, 2023, 6 p.m. UTC | #9
On 8/14/23 02:12, David Ahern wrote:
> On 8/9/23 7:57 PM, Mina Almasry wrote:
>> Changes in RFC v2:
>> ------------------
...
>> ** Test Setup
>>
>> Kernel: net-next with this RFC and memory provider API cherry-picked
>> locally.
>>
>> Hardware: Google Cloud A3 VMs.
>>
>> NIC: GVE with header split & RSS & flow steering support.
> 
> This set seems to depend on Jakub's memory provider patches and a netdev
> driver change which is not included. For the testing mentioned here, you
> must have a tree + branch with all of the patches. Is it publicly available?
> 
> It would be interesting to see how well (easy) this integrates with
> io_uring. Besides avoiding all of the syscalls for receiving the iov and
> releasing the buffers back to the pool, io_uring also brings in the
> ability to seed a page_pool with registered buffers which provides a
> means to get simpler Rx ZC for host memory.

The patchset sounds pretty interesting. I've been working with David Wei
(CC'ing) on io_uring zc rx (prototype polishing stage) all that is old
similar approaches based on allocating an rx queue. It targets host
memory and device memory as an extra feature, uapi is different, lifetimes
are managed/bound to io_uring. Completions/buffers are returned to user via
a separate queue instead of cmsg, and pushed back granularly to the kernel
via another queue. I'll leave it to David to elaborate

It sounds like we have space for collaboration here, if not merging then
reusing internals as much as we can, but we'd need to look into the
details deeper.

> Overall I like the intent and possibilities for extensions, but a lot of
> details are missing - perhaps some are answered by seeing an end-to-end
> implementation.
Mina Almasry Aug. 17, 2023, 10:18 p.m. UTC | #10
On Thu, Aug 17, 2023 at 11:04 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 8/14/23 02:12, David Ahern wrote:
> > On 8/9/23 7:57 PM, Mina Almasry wrote:
> >> Changes in RFC v2:
> >> ------------------
> ...
> >> ** Test Setup
> >>
> >> Kernel: net-next with this RFC and memory provider API cherry-picked
> >> locally.
> >>
> >> Hardware: Google Cloud A3 VMs.
> >>
> >> NIC: GVE with header split & RSS & flow steering support.
> >
> > This set seems to depend on Jakub's memory provider patches and a netdev
> > driver change which is not included. For the testing mentioned here, you
> > must have a tree + branch with all of the patches. Is it publicly available?
> >
> > It would be interesting to see how well (easy) this integrates with
> > io_uring. Besides avoiding all of the syscalls for receiving the iov and
> > releasing the buffers back to the pool, io_uring also brings in the
> > ability to seed a page_pool with registered buffers which provides a
> > means to get simpler Rx ZC for host memory.
>
> The patchset sounds pretty interesting. I've been working with David Wei
> (CC'ing) on io_uring zc rx (prototype polishing stage) all that is old
> similar approaches based on allocating an rx queue. It targets host
> memory and device memory as an extra feature, uapi is different, lifetimes
> are managed/bound to io_uring. Completions/buffers are returned to user via
> a separate queue instead of cmsg, and pushed back granularly to the kernel
> via another queue. I'll leave it to David to elaborate
>
> It sounds like we have space for collaboration here, if not merging then
> reusing internals as much as we can, but we'd need to look into the
> details deeper.
>

I'm happy to look at your implementation and collaborate on something
that works for both use cases. Feel free to share unpolished prototype
so I can start having a general idea if possible.

> > Overall I like the intent and possibilities for extensions, but a lot of
> > details are missing - perhaps some are answered by seeing an end-to-end
> > implementation.
>
> --
> Pavel Begunkov
David Wei Aug. 23, 2023, 10:52 p.m. UTC | #11
On 17/08/2023 15:18, Mina Almasry wrote:
> On Thu, Aug 17, 2023 at 11:04 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 8/14/23 02:12, David Ahern wrote:
>>> On 8/9/23 7:57 PM, Mina Almasry wrote:
>>>> Changes in RFC v2:
>>>> ------------------
>> ...
>>>> ** Test Setup
>>>>
>>>> Kernel: net-next with this RFC and memory provider API cherry-picked
>>>> locally.
>>>>
>>>> Hardware: Google Cloud A3 VMs.
>>>>
>>>> NIC: GVE with header split & RSS & flow steering support.
>>>
>>> This set seems to depend on Jakub's memory provider patches and a netdev
>>> driver change which is not included. For the testing mentioned here, you
>>> must have a tree + branch with all of the patches. Is it publicly available?
>>>
>>> It would be interesting to see how well (easy) this integrates with
>>> io_uring. Besides avoiding all of the syscalls for receiving the iov and
>>> releasing the buffers back to the pool, io_uring also brings in the
>>> ability to seed a page_pool with registered buffers which provides a
>>> means to get simpler Rx ZC for host memory.
>>
>> The patchset sounds pretty interesting. I've been working with David Wei
>> (CC'ing) on io_uring zc rx (prototype polishing stage) all that is old
>> similar approaches based on allocating an rx queue. It targets host
>> memory and device memory as an extra feature, uapi is different, lifetimes
>> are managed/bound to io_uring. Completions/buffers are returned to user via
>> a separate queue instead of cmsg, and pushed back granularly to the kernel
>> via another queue. I'll leave it to David to elaborate
>>
>> It sounds like we have space for collaboration here, if not merging then
>> reusing internals as much as we can, but we'd need to look into the
>> details deeper.
>>
> 
> I'm happy to look at your implementation and collaborate on something
> that works for both use cases. Feel free to share unpolished prototype
> so I can start having a general idea if possible.

Hi I'm David and I am working with Pavel on this. We will have something to
share with you on the mailing list before the end of the week.

I'm also preparing a submission for NetDev conf. I wonder if you and others at
Google plan to present there as well? If so, then we may want to coordinate our
submissions and talks (if accepted).

Please let me know this week, thanks!

> 
>>> Overall I like the intent and possibilities for extensions, but a lot of
>>> details are missing - perhaps some are answered by seeing an end-to-end
>>> implementation.
>>
>> --
>> Pavel Begunkov
> 
> 
>
David Ahern Aug. 24, 2023, 3:35 a.m. UTC | #12
On 8/23/23 3:52 PM, David Wei wrote:
> I'm also preparing a submission for NetDev conf. I wonder if you and others at
> Google plan to present there as well? If so, then we may want to coordinate our
> submissions and talks (if accepted).

personally, I see them as related but separate topics. Mina's proposal
as infra that io_uring builds on. Both are interesting and needed
discussions.