Message ID | 20240305020153.2787423-1-almasrymina@google.com |
---|---|
Headers | show |
Series | Device Memory TCP | expand |
On Tue, Mar 5, 2024 at 4:55 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2024/3/5 10:01, Mina Almasry wrote: > > ... > > > > > The netdev_dmabuf_binding struct is refcounted, and releases its > > resources only when all the refs are released. > > > > Signed-off-by: Willem de Bruijn <willemb@google.com> > > Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > > > --- > > > > RFC v6: > > - Validate rx queue index > > - Refactor new functions into devmem.c (Pavel) > > It seems odd that the functions or stucts in a file called devmem.c > are named after 'dmabuf' instead of 'devmem'. > So my intention with this naming that devmem.c contains all the functions for all devmem tcp specific support. Currently the only devmem we support is dmabuf. In the future, other devmem may be supported and it can fit nicely in devmem.c. For example, if we want to extend devmem TCP to support NVMe devices, we need to add support for p2pdma, maybe, and we can add that support under the devmem.c umbrella rather than add new files. But I can rename to dmabuf.c if there is strong objection to the current name. > > > > ... > > > diff --git a/include/net/netmem.h b/include/net/netmem.h > > index d8b810245c1d..72e932a1a948 100644 > > --- a/include/net/netmem.h > > +++ b/include/net/netmem.h > > @@ -8,6 +8,16 @@ > > #ifndef _NET_NETMEM_H > > #define _NET_NETMEM_H > > > > +#include <net/devmem.h> > > + > > +/* net_iov */ > > + > > +struct net_iov { > > + struct dmabuf_genpool_chunk_owner *owner; > > +}; > > + > > +/* netmem */ > > + > > /** > > * typedef netmem_ref - a nonexistent type marking a reference to generic > > * network memory. > > diff --git a/net/core/Makefile b/net/core/Makefile > > index 821aec06abf1..592f955c1241 100644 > > --- a/net/core/Makefile > > +++ b/net/core/Makefile > > @@ -13,7 +13,7 @@ obj-y += dev.o dev_addr_lists.o dst.o netevent.o \ > > neighbour.o rtnetlink.o utils.o link_watch.o filter.o \ > > sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \ > > fib_notifier.o xdp.o flow_offload.o gro.o \ > > - netdev-genl.o netdev-genl-gen.o gso.o > > + netdev-genl.o netdev-genl-gen.o gso.o devmem.o > > > > obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > index fe054cbd41e9..bbea1b252529 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -155,6 +155,9 @@ > > #include <net/netdev_rx_queue.h> > > #include <net/page_pool/types.h> > > #include <net/page_pool/helpers.h> > > +#include <linux/genalloc.h> > > +#include <linux/dma-buf.h> > > +#include <net/devmem.h> > > > > #include "dev.h" > > #include "net-sysfs.h" > > diff --git a/net/core/devmem.c b/net/core/devmem.c > > new file mode 100644 > > index 000000000000..779ad990971e > > --- /dev/null > > +++ b/net/core/devmem.c > > @@ -0,0 +1,293 @@ > > +// SPDX-License-Identifier: GPL-2.0-or-later > > +/* > > + * Devmem TCP > > + * > > + * Authors: Mina Almasry <almasrymina@google.com> > > + * Willem de Bruijn <willemdebruijn.kernel@gmail.com> > > + * Kaiyuan Zhang <kaiyuanz@google.com > > + */ > > + > > +#include <linux/types.h> > > +#include <linux/mm.h> > > +#include <linux/netdevice.h> > > +#include <trace/events/page_pool.h> > > +#include <net/netdev_rx_queue.h> > > +#include <net/page_pool/types.h> > > +#include <net/page_pool/helpers.h> > > +#include <linux/genalloc.h> > > +#include <linux/dma-buf.h> > > +#include <net/devmem.h> > > + > > +/* Device memory support */ > > + > > +#ifdef CONFIG_DMA_SHARED_BUFFER > > I still think it is worth adding its own config for devmem or dma-buf > for networking, thinking about the embeded system. > FWIW Willem did weigh on this previously and said he prefers to have it unguarded by a CONFIG, but I will submit to whatever the consensus here. It shouldn't be a huge deal to add a CONFIG technically speaking. > > +static void netdev_dmabuf_free_chunk_owner(struct gen_pool *genpool, > > + struct gen_pool_chunk *chunk, > > + void *not_used) > > It seems odd to still keep the netdev_ prefix as it is not really related > to netdev, perhaps use 'net_' or something better. > Yes, thanks for catching. I can change to net_devmem_ maybe or net_dmabuf_*. > > +{ > > + struct dmabuf_genpool_chunk_owner *owner = chunk->owner; > > + > > + kvfree(owner->niovs); > > + kfree(owner); > > +} > > + > > +void __netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding) > > +{ > > + size_t size, avail; > > + > > + gen_pool_for_each_chunk(binding->chunk_pool, > > + netdev_dmabuf_free_chunk_owner, NULL); > > + > > + size = gen_pool_size(binding->chunk_pool); > > + avail = gen_pool_avail(binding->chunk_pool); > > + > > + if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu", > > + size, avail)) > > + gen_pool_destroy(binding->chunk_pool); > > + > > + dma_buf_unmap_attachment(binding->attachment, binding->sgt, > > + DMA_BIDIRECTIONAL); > > For now DMA_FROM_DEVICE seems enough as tx is not supported yet. > Yes, good catch. I suspect we want to reuse this code for TX path. But for now, I'll test with DMA_FROM_DEVICE and if I see no issues I'll apply this change. > > + dma_buf_detach(binding->dmabuf, binding->attachment); > > + dma_buf_put(binding->dmabuf); > > + xa_destroy(&binding->bound_rxq_list); > > + kfree(binding); > > +} > > + > > +static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx) > > +{ > > + void *new_mem; > > + void *old_mem; > > + int err; > > + > > + if (!dev || !dev->netdev_ops) > > + return -EINVAL; > > + > > + if (!dev->netdev_ops->ndo_queue_stop || > > + !dev->netdev_ops->ndo_queue_mem_free || > > + !dev->netdev_ops->ndo_queue_mem_alloc || > > + !dev->netdev_ops->ndo_queue_start) > > + return -EOPNOTSUPP; > > + > > + new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx); > > + if (!new_mem) > > + return -ENOMEM; > > + > > + err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, &old_mem); > > + if (err) > > + goto err_free_new_mem; > > + > > + err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem); > > + if (err) > > + goto err_start_queue; > > + > > + dev->netdev_ops->ndo_queue_mem_free(dev, old_mem); > > + > > + return 0; > > + > > +err_start_queue: > > + dev->netdev_ops->ndo_queue_start(dev, rxq_idx, old_mem); > > It might worth mentioning why queue start with old_mem will always > success here as the return value seems to be ignored here. > So the old queue, we stopped it, and if we fail to bring up the new queue, then we want to start the old queue back up to get the queue back to a workable state. I don't see what we can do to recover if restarting the old queue fails. Seems like it should be a requirement that the driver tries as much as possible to keep the old queue restartable. I can improve this by at least logging or warning if restarting the old queue fails. > > + > > +err_free_new_mem: > > + dev->netdev_ops->ndo_queue_mem_free(dev, new_mem); > > + > > + return err; > > +} > > + > > +/* Protected by rtnl_lock() */ > > +static DEFINE_XARRAY_FLAGS(netdev_dmabuf_bindings, XA_FLAGS_ALLOC1); > > + > > +void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding) > > +{ > > + struct netdev_rx_queue *rxq; > > + unsigned long xa_idx; > > + unsigned int rxq_idx; > > + > > + if (!binding) > > + return; > > + > > + if (binding->list.next) > > + list_del(&binding->list); > > The above does not seems to be a good pattern to delete a entry, is > there any reason having a checking before the list_del()? seems like > defensive programming? > I think I needed to apply this condition to handle the case where netdev_unbind_dmabuf() is called when binding->list is not initialized or is empty. netdev_nl_bind_rx_doit() will call unbind to free a partially allocated binding in error paths, so, netdev_unbind_dmabuf() may be called with a partially initialized binding. This is why we check for binding->list is initialized here and check that rxq->binding == binding below. The main point is that netdev_unbind_dmabuf() may be asked to unbind a partially bound dmabuf due to error paths. Maybe a comment here will test this better. I will double confirm the check is needed for the error paths in netdev_nl_bind_rx_doit(). > > + > > + xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) { > > + if (rxq->binding == binding) { > > It seems like defensive programming here too? > > > + /* We hold the rtnl_lock while binding/unbinding > > + * dma-buf, so we can't race with another thread that > > + * is also modifying this value. However, the driver > > + * may read this config while it's creating its > > + * rx-queues. WRITE_ONCE() here to match the > > + * READ_ONCE() in the driver. > > + */ > > + WRITE_ONCE(rxq->binding, NULL); > > + > > + rxq_idx = get_netdev_rx_queue_index(rxq); > > + > > + netdev_restart_rx_queue(binding->dev, rxq_idx); > > + } > > + } > > + > > + xa_erase(&netdev_dmabuf_bindings, binding->id); > > + > > + netdev_dmabuf_binding_put(binding); > > +} > > + >
On Tue, Mar 5, 2024, at 21:00, Mina Almasry wrote: > On Tue, Mar 5, 2024 at 1:05 AM Arnd Bergmann <arnd@arndb.de> wrote: >> On Tue, Mar 5, 2024, at 03:01, Mina Almasry wrote: > > A key goal of this patch series is that the kernel does not try to > parse the skb frags that reside in the dma-buf for that precise > reason. This is achieved using patch "net: add support for skbs with > unreadable frags" which disables the kernel touching the payload in > these skbs, and "tcp: RX path for devmem TCP" which implements a uapi > where the kernel hands the data in the dmabuf to the userspace via a > cmsg that gives the user a pointer to the data in the dmabuf (offset + > size). > > So really AFACT the only restriction here is that the NIC should be > able to DMA into the dmabuf that we're attaching, and dma_buf_attach() > fails in this scenario so we're covered there. Ok, makes sense. Thanks for the clarification. Arnd
On 2024/3/6 3:38, Mina Almasry wrote: > On Tue, Mar 5, 2024 at 4:54 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: >> >> On 2024/3/5 10:01, Mina Almasry wrote: >> >> ... >> >>> >>> Perf - page-pool benchmark: >>> --------------------------- >>> >>> bench_page_pool_simple.ko tests with and without these changes: >>> https://pastebin.com/raw/ncHDwAbn >>> >>> AFAIK the number that really matters in the perf tests is the >>> 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 >>> cycles without the changes but there is some 1 cycle noise in some >>> results. >>> >>> With the patches this regresses to 9 cycles with the changes but there >>> is 1 cycle noise occasionally running this test repeatedly. >>> >>> Lastly I tried disable the static_branch_unlikely() in >>> netmem_is_net_iov() check. To my surprise disabling the >>> static_branch_unlikely() check reduces the fast path back to 8 cycles, >>> but the 1 cycle noise remains. >>> >> >> The last sentence seems to be suggesting the above 1 ns regresses is caused >> by the static_branch_unlikely() checking? > > Note it's not a 1ns regression, it's looks like maybe a 1 cycle > regression (slightly less than 1ns if I'm reading the output of the > test correctly): > > # clean net-next > time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) > 2.993 ns (step:0) > > # with patches > time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) > 3.679 ns (step:0) > > # with patches and with diff that disables static branching: > time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) > 3.248 ns (step:0) > > I do see noise in the test results between run and run, and any > regression (if any) is slightly obfuscated by the noise, so it's a bit > hard to make confident statements. So far it looks like a ~0.25ns > regression without static branch and about ~0.65ns with static branch. > > Honestly when I saw all 3 results were within some noise I did not > investigate more, but if this looks concerning to you I can dig > further. I likely need to gather a few test runs to filter out the > noise and maybe investigate the assembly my compiler is generating to > maybe narrow down what changes there. Yes, that is confusing enough that need more investigation. >
On 2024/3/7 6:10, Mina Almasry wrote: ... >>>>> +static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx) >>>>> +{ >>>>> + void *new_mem; >>>>> + void *old_mem; >>>>> + int err; >>>>> + >>>>> + if (!dev || !dev->netdev_ops) >>>>> + return -EINVAL; >>>>> + >>>>> + if (!dev->netdev_ops->ndo_queue_stop || >>>>> + !dev->netdev_ops->ndo_queue_mem_free || >>>>> + !dev->netdev_ops->ndo_queue_mem_alloc || >>>>> + !dev->netdev_ops->ndo_queue_start) >>>>> + return -EOPNOTSUPP; >>>>> + >>>>> + new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx); >>>>> + if (!new_mem) >>>>> + return -ENOMEM; >>>>> + >>>>> + err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, &old_mem); >>>>> + if (err) >>>>> + goto err_free_new_mem; >>>>> + >>>>> + err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem); >>>>> + if (err) >>>>> + goto err_start_queue; >>>>> + >>>>> + dev->netdev_ops->ndo_queue_mem_free(dev, old_mem); >>>>> + >>>>> + return 0; >>>>> + >>>>> +err_start_queue: >>>>> + dev->netdev_ops->ndo_queue_start(dev, rxq_idx, old_mem); >>>> >>>> It might worth mentioning why queue start with old_mem will always >>>> success here as the return value seems to be ignored here. >>>> >>> >>> So the old queue, we stopped it, and if we fail to bring up the new >>> queue, then we want to start the old queue back up to get the queue >>> back to a workable state. >>> >>> I don't see what we can do to recover if restarting the old queue >>> fails. Seems like it should be a requirement that the driver tries as >>> much as possible to keep the old queue restartable. >> >> Is it possible that we may have the 'old_mem' leaking if the driver >> fails to restart the old queue? how does the driver handle the >> firmware cmd failure for ndo_queue_start()? it seems a little >> tricky to implement it. >> > > I'm not sure what we can do to meaningfully recover from failure to > restarting the old queue, except log it so the error is visible. In > theory because we have not modifying any queue configurations > restarting it would be straight forward, but since it's dealing with > hardware then any failures are possible. Yes, we may need to have a clear semantics of how should the driver implement the above interface, for example if the driver should free the memory when fail to start a queue or the driver should restart the queue when fail to stop a queue? Otherwise we may have different driver implementing different behavior.
On Mon, 4 Mar 2024 18:01:36 -0800 Mina Almasry wrote: > + * void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx); > + * Allocate memory for an RX queue. The memory returned in the form of > + * a void * can be passed to ndo_queue_mem_free() for freeing or to > + * ndo_queue_start to create an RX queue with this memory. > + * > + * void (*ndo_queue_mem_free)(struct net_device *dev, void *); > + * Free memory from an RX queue. > + * > + * int (*ndo_queue_start)(struct net_device *dev, int idx, void *); > + * Start an RX queue at the specified index. > + * > + * int (*ndo_queue_stop)(struct net_device *dev, int idx, void **); > + * Stop the RX queue at the specified index. > */ > struct net_device_ops { > int (*ndo_init)(struct net_device *dev); > @@ -1679,6 +1693,16 @@ struct net_device_ops { > int (*ndo_hwtstamp_set)(struct net_device *dev, > struct kernel_hwtstamp_config *kernel_config, > struct netlink_ext_ack *extack); > + void * (*ndo_queue_mem_alloc)(struct net_device *dev, > + int idx); > + void (*ndo_queue_mem_free)(struct net_device *dev, > + void *queue_mem); > + int (*ndo_queue_start)(struct net_device *dev, > + int idx, > + void *queue_mem); > + int (*ndo_queue_stop)(struct net_device *dev, > + int idx, > + void **out_queue_mem); The queue configuration object was quite an integral part of the design, I'm slightly worried that it's not here :) Also we may want to rename the about-to-be-merged ops from netdev_stat_ops and netdev_queue_ops, and add these there? https://lore.kernel.org/all/20240306195509.1502746-2-kuba@kernel.org/ Very excited to hear that you made progress on this and ported GVE over!
On Thu, Mar 7, 2024 at 5:30 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 4 Mar 2024 18:01:36 -0800 Mina Almasry wrote: > > + * void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx); > > + * Allocate memory for an RX queue. The memory returned in the form of > > + * a void * can be passed to ndo_queue_mem_free() for freeing or to > > + * ndo_queue_start to create an RX queue with this memory. > > + * > > + * void (*ndo_queue_mem_free)(struct net_device *dev, void *); > > + * Free memory from an RX queue. > > + * > > + * int (*ndo_queue_start)(struct net_device *dev, int idx, void *); > > + * Start an RX queue at the specified index. > > + * > > + * int (*ndo_queue_stop)(struct net_device *dev, int idx, void **); > > + * Stop the RX queue at the specified index. > > */ > > struct net_device_ops { > > int (*ndo_init)(struct net_device *dev); > > @@ -1679,6 +1693,16 @@ struct net_device_ops { > > int (*ndo_hwtstamp_set)(struct net_device *dev, > > struct kernel_hwtstamp_config *kernel_config, > > struct netlink_ext_ack *extack); > > + void * (*ndo_queue_mem_alloc)(struct net_device *dev, > > + int idx); > > + void (*ndo_queue_mem_free)(struct net_device *dev, > > + void *queue_mem); > > + int (*ndo_queue_start)(struct net_device *dev, > > + int idx, > > + void *queue_mem); > > + int (*ndo_queue_stop)(struct net_device *dev, > > + int idx, > > + void **out_queue_mem); > > The queue configuration object was quite an integral part of the design, > I'm slightly worried that it's not here :) That was a bit of a simplification I'm making since we just want to restart the queue. I thought it was OK to define some minimal version here and extend it later with configuration? Because in this context all we really need is to restart the queue, yes? If extending with some configuration is a must please let me know what configuration struct you're envisioning. Were you envisioning a stub? Or some real configuration struct that we just don't use at the moment? Or one that we use for this use case somehow? > Also we may want to rename > the about-to-be-merged ops from netdev_stat_ops and netdev_queue_ops, > and add these there? > > https://lore.kernel.org/all/20240306195509.1502746-2-kuba@kernel.org/ > Yeah, that sounds reasonable! Thanks! We could also keep the netdev_stat_ops and add new netdev_queue_ops alongside them if you prefer. > Very excited to hear that you made progress on this and ported GVE over! Actually, we're still discussing but it looks like my GVE queue API implementation I proposed earlier may be a no-go. Likely someone from the GVE team will follow up here with this piece, probably in a separate series. For now I'm carrying my POC for the GVE implementation out of tree with the rest of the driver changes: https://github.com/mina/linux/commit/501b734c80186545281e9edb1bf313f5a2d8cbee
On Thu, 7 Mar 2024 18:08:24 -0800 Mina Almasry wrote: > On Thu, Mar 7, 2024 at 5:30 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 4 Mar 2024 18:01:36 -0800 Mina Almasry wrote: > > > + * void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx); > > > + * Allocate memory for an RX queue. The memory returned in the form of > > > + * a void * can be passed to ndo_queue_mem_free() for freeing or to > > > + * ndo_queue_start to create an RX queue with this memory. > > > + * > > > + * void (*ndo_queue_mem_free)(struct net_device *dev, void *); > > > + * Free memory from an RX queue. > > > + * > > > + * int (*ndo_queue_start)(struct net_device *dev, int idx, void *); > > > + * Start an RX queue at the specified index. > > > + * > > > + * int (*ndo_queue_stop)(struct net_device *dev, int idx, void **); > > > + * Stop the RX queue at the specified index. > > > */ > > > struct net_device_ops { > > > int (*ndo_init)(struct net_device *dev); > > > @@ -1679,6 +1693,16 @@ struct net_device_ops { > > > int (*ndo_hwtstamp_set)(struct net_device *dev, > > > struct kernel_hwtstamp_config *kernel_config, > > > struct netlink_ext_ack *extack); > > > + void * (*ndo_queue_mem_alloc)(struct net_device *dev, > > > + int idx); > > > + void (*ndo_queue_mem_free)(struct net_device *dev, > > > + void *queue_mem); > > > + int (*ndo_queue_start)(struct net_device *dev, > > > + int idx, > > > + void *queue_mem); > > > + int (*ndo_queue_stop)(struct net_device *dev, > > > + int idx, > > > + void **out_queue_mem); > > > > The queue configuration object was quite an integral part of the design, > > I'm slightly worried that it's not here :) > > That was a bit of a simplification I'm making since we just want to > restart the queue. I thought it was OK to define some minimal version > here and extend it later with configuration? Because in this context > all we really need is to restart the queue, yes? Right, I think it's perfectly fine for the time being. It works, and is internal to the kernel. > If extending with some configuration is a must please let me know what > configuration struct you're envisioning. Were you envisioning a stub? > Or some real configuration struct that we just don't use at the > moment? Or one that we use for this use case somehow? I had some ideas about storing the configuration as rules, instead of directly in struct netdev_rx_queue. E.g. default queue length = 2000, but for select queues you may want a different length. But application binding to a queue would always take precedence, so even if the ideas ever materialize there will be no uAPI change. > > Also we may want to rename > > the about-to-be-merged ops from netdev_stat_ops and netdev_queue_ops, > > and add these there? > > > > https://lore.kernel.org/all/20240306195509.1502746-2-kuba@kernel.org/ > > Yeah, that sounds reasonable! Thanks! We could also keep the > netdev_stat_ops and add new netdev_queue_ops alongside them if you > prefer. Up to you, after some soul searching we renamed the uAPI to call these stats qstats, I just forgot to rename the op struct. But it doesn't matter much. > > Very excited to hear that you made progress on this and ported GVE over! > > Actually, we're still discussing but it looks like my GVE queue API > implementation I proposed earlier may be a no-go. Likely someone from > the GVE team will follow up here with this piece, probably in a > separate series. Well, it's going to be ready when it's ready :) Speaking of things which can be merged independently, feel free to post patch 3, maybe it can make v6.9.. > For now I'm carrying my POC for the GVE implementation out of tree > with the rest of the driver changes: > > https://github.com/mina/linux/commit/501b734c80186545281e9edb1bf313f5a2d8cbee
On Mon, 4 Mar 2024 18:01:40 -0800 Mina Almasry wrote: > + if (!dev || !dev->netdev_ops) > + return -EINVAL; too defensive > + if (!dev->netdev_ops->ndo_queue_stop || > + !dev->netdev_ops->ndo_queue_mem_free || > + !dev->netdev_ops->ndo_queue_mem_alloc || > + !dev->netdev_ops->ndo_queue_start) > + return -EOPNOTSUPP; > + > + new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx); > + if (!new_mem) > + return -ENOMEM; > + > + err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, &old_mem); > + if (err) > + goto err_free_new_mem; > + > + err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem); > + if (err) > + goto err_start_queue; > + > + dev->netdev_ops->ndo_queue_mem_free(dev, old_mem); nice :) > + rxq = __netif_get_rx_queue(dev, rxq_idx); > + > + if (rxq->binding) nit: a few places have an empty line between call and error check > + return -EEXIST; > + if (!capable(CAP_NET_ADMIN)) > + return -EPERM; this can be a flag on the netlink policy, no? flags: [ admin-perm ] on the op > + dmabuf = dma_buf_get(dmabuf_fd); > + if (IS_ERR_OR_NULL(dmabuf)) > + return -EBADFD; > + hdr = genlmsg_put(rsp, info->snd_portid, info->snd_seq, genlmsg_iput() > +static int netdev_netlink_notify(struct notifier_block *nb, unsigned long state, > + void *_notify) > +{ > + struct netlink_notify *notify = _notify; > + struct netdev_dmabuf_binding *rbinding; > + > + if (state != NETLINK_URELEASE || notify->protocol != NETLINK_GENERIC) > + return NOTIFY_DONE; > + > + rtnl_lock(); > + > + list_for_each_entry(rbinding, &netdev_rbinding_list, list) { > + if (rbinding->owner_nlportid == notify->portid) { > + netdev_unbind_dmabuf(rbinding); > + break; > + } > + } > + > + rtnl_unlock(); > + > + return NOTIFY_OK; > +} While you were not looking we added three new members to the netlink family: * @sock_priv_size: the size of per-socket private memory * @sock_priv_init: the per-socket private memory initializer * @sock_priv_destroy: the per-socket private memory destructor You should be able to associate state with a netlink socket and have it auto-destroyed if the socket closes. LMK if that doesn't work for you, I was hoping it would fit nicely. I just realized now that the code gen doesn't know how to spit those members out, but I'll send a patch tomorrow, you can hack it manually until that gets in.
On 2024-03-04 18:01, Mina Almasry wrote: > This API enables the net stack to reset the queues used for devmem. > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > --- > include/linux/netdevice.h | 24 ++++++++++++++++++++++++ > 1 file changed, 24 insertions(+) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index c41019f34179..3105c586355d 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -1435,6 +1435,20 @@ struct netdev_net_notifier { > * struct kernel_hwtstamp_config *kernel_config, > * struct netlink_ext_ack *extack); > * Change the hardware timestamping parameters for NIC device. > + * > + * void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx); > + * Allocate memory for an RX queue. The memory returned in the form of > + * a void * can be passed to ndo_queue_mem_free() for freeing or to > + * ndo_queue_start to create an RX queue with this memory. > + * > + * void (*ndo_queue_mem_free)(struct net_device *dev, void *); > + * Free memory from an RX queue. > + * > + * int (*ndo_queue_start)(struct net_device *dev, int idx, void *); > + * Start an RX queue at the specified index. > + * > + * int (*ndo_queue_stop)(struct net_device *dev, int idx, void **); > + * Stop the RX queue at the specified index. > */ > struct net_device_ops { > int (*ndo_init)(struct net_device *dev); > @@ -1679,6 +1693,16 @@ struct net_device_ops { > int (*ndo_hwtstamp_set)(struct net_device *dev, > struct kernel_hwtstamp_config *kernel_config, > struct netlink_ext_ack *extack); > + void * (*ndo_queue_mem_alloc)(struct net_device *dev, > + int idx); > + void (*ndo_queue_mem_free)(struct net_device *dev, > + void *queue_mem); > + int (*ndo_queue_start)(struct net_device *dev, > + int idx, > + void *queue_mem); > + int (*ndo_queue_stop)(struct net_device *dev, > + int idx, > + void **out_queue_mem); > }; I'm working to port bnxt over to using this API. What are your thoughts on maybe pulling this out and use bnxt to drive it? > > /**
On Fri, Mar 8, 2024 at 3:48 PM David Wei <dw@davidwei.uk> wrote: > > On 2024-03-04 18:01, Mina Almasry wrote: > > This API enables the net stack to reset the queues used for devmem. > > > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > > > --- > > include/linux/netdevice.h | 24 ++++++++++++++++++++++++ > > 1 file changed, 24 insertions(+) > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index c41019f34179..3105c586355d 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -1435,6 +1435,20 @@ struct netdev_net_notifier { > > * struct kernel_hwtstamp_config *kernel_config, > > * struct netlink_ext_ack *extack); > > * Change the hardware timestamping parameters for NIC device. > > + * > > + * void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx); > > + * Allocate memory for an RX queue. The memory returned in the form of > > + * a void * can be passed to ndo_queue_mem_free() for freeing or to > > + * ndo_queue_start to create an RX queue with this memory. > > + * > > + * void (*ndo_queue_mem_free)(struct net_device *dev, void *); > > + * Free memory from an RX queue. > > + * > > + * int (*ndo_queue_start)(struct net_device *dev, int idx, void *); > > + * Start an RX queue at the specified index. > > + * > > + * int (*ndo_queue_stop)(struct net_device *dev, int idx, void **); > > + * Stop the RX queue at the specified index. > > */ > > struct net_device_ops { > > int (*ndo_init)(struct net_device *dev); > > @@ -1679,6 +1693,16 @@ struct net_device_ops { > > int (*ndo_hwtstamp_set)(struct net_device *dev, > > struct kernel_hwtstamp_config *kernel_config, > > struct netlink_ext_ack *extack); > > + void * (*ndo_queue_mem_alloc)(struct net_device *dev, > > + int idx); > > + void (*ndo_queue_mem_free)(struct net_device *dev, > > + void *queue_mem); > > + int (*ndo_queue_start)(struct net_device *dev, > > + int idx, > > + void *queue_mem); > > + int (*ndo_queue_stop)(struct net_device *dev, > > + int idx, > > + void **out_queue_mem); > > }; > > I'm working to port bnxt over to using this API. What are your thoughts > on maybe pulling this out and use bnxt to drive it? > Sure thing, go for it! Thanks! I think we've going to have someone from GVE working on this in parallel. I see no issue with us aligning on what the core-net ndos would look like and implementing those in parallel for both drivers. We're not currently planning to make any changes to the ndos besides applying Jakub's feedback from this thread. If you find a need to deviate from this, let us know and we'll work on staying in line with that. Thanks!
On Tue, Mar 5, 2024 at 11:38 AM Mina Almasry <almasrymina@google.com> wrote: > > On Tue, Mar 5, 2024 at 4:54 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > > > On 2024/3/5 10:01, Mina Almasry wrote: > > > > ... > > > > > > > > Perf - page-pool benchmark: > > > --------------------------- > > > > > > bench_page_pool_simple.ko tests with and without these changes: > > > https://pastebin.com/raw/ncHDwAbn > > > > > > AFAIK the number that really matters in the perf tests is the > > > 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 > > > cycles without the changes but there is some 1 cycle noise in some > > > results. > > > > > > With the patches this regresses to 9 cycles with the changes but there > > > is 1 cycle noise occasionally running this test repeatedly. > > > > > > Lastly I tried disable the static_branch_unlikely() in > > > netmem_is_net_iov() check. To my surprise disabling the > > > static_branch_unlikely() check reduces the fast path back to 8 cycles, > > > but the 1 cycle noise remains. > > > > > > > The last sentence seems to be suggesting the above 1 ns regresses is caused > > by the static_branch_unlikely() checking? > > Note it's not a 1ns regression, it's looks like maybe a 1 cycle > regression (slightly less than 1ns if I'm reading the output of the > test correctly): > > # clean net-next > time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) > 2.993 ns (step:0) > > # with patches > time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) > 3.679 ns (step:0) > > # with patches and with diff that disables static branching: > time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) > 3.248 ns (step:0) > > I do see noise in the test results between run and run, and any > regression (if any) is slightly obfuscated by the noise, so it's a bit > hard to make confident statements. So far it looks like a ~0.25ns > regression without static branch and about ~0.65ns with static branch. > > Honestly when I saw all 3 results were within some noise I did not > investigate more, but if this looks concerning to you I can dig > further. I likely need to gather a few test runs to filter out the > noise and maybe investigate the assembly my compiler is generating to > maybe narrow down what changes there. > I did some more investigation here to gather more data to filter out the noise, and recorded the summary here: https://pastebin.com/raw/v5dYRg8L Long story short, the page_pool benchmark results are consistent with some outlier noise results that I'm discounting here. Currently page_pool fast path is at 8 cycles [ 2115.724510] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 3.187 ns (step:0) - (measurement period time:0.031870585 sec time_interval:31870585) - (invoke count:10000000 tsc_interval:86043192) and with this patch series it degrades to 10 cycles, or about a 0.7ns degradation or so: [ 498.226127] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 10 cycles(tsc) 3.944 ns (step:0) - (measurement period time:0.039442539 sec time_interval:39442539) - (invoke count:10000000 tsc_interval:106485268) I took the time to dig into where the degradation comes from, and to my surprise we can shave off 1 cycle in perf by removing the static_branch_unlikely check in netmem_is_net_iov() like so: diff --git a/include/net/netmem.h b/include/net/netmem.h index fe354d11a421..2b4310ac1115 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -122,8 +122,7 @@ typedef unsigned long __bitwise netmem_ref; static inline bool netmem_is_net_iov(const netmem_ref netmem) { #ifdef CONFIG_PAGE_POOL - return static_branch_unlikely(&page_pool_mem_providers) && - (__force unsigned long)netmem & NET_IOV; + return (__force unsigned long)netmem & NET_IOV; #else return false; #endif With this change, the fast path is 9 cycles, only a 1 cycle (~0.35ns) regression: [ 199.184429] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) 3.552 ns (step:0) - (measurement period time:0.035524013 sec time_interval:35524013) - (invoke count:10000000 tsc_interval:95907775) I did some digging with YiFei on why the static_branch_unlikely appears to be causing a 1 cycle regression, but could not get an answer that makes sense. The # of instructions in page_pool_return_page() with the static_branch_unlikely and without is about the same in the compiled .o file, and my understanding is that static_branch will cause code re-writing anyway so looking at the compiled code may not be representative. Worthy of note is that I get ~95% line rate of devmem TCP regardless of the static_branch_unlikely() or not, so impact of the static_branch is not large enough to be measurable end-to-end. I'm thinking I want to drop the static_branch_unlikely() in the next RFC since it doesn't improve the end-to-end throughput number and is resulting in a measurable improvement in the page pool benchmark.
On 2024/3/26 8:28, Mina Almasry wrote: > On Tue, Mar 5, 2024 at 11:38 AM Mina Almasry <almasrymina@google.com> wrote: >> >> On Tue, Mar 5, 2024 at 4:54 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: >>> >>> On 2024/3/5 10:01, Mina Almasry wrote: >>> >>> ... >>> >>>> >>>> Perf - page-pool benchmark: >>>> --------------------------- >>>> >>>> bench_page_pool_simple.ko tests with and without these changes: >>>> https://pastebin.com/raw/ncHDwAbn >>>> >>>> AFAIK the number that really matters in the perf tests is the >>>> 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 >>>> cycles without the changes but there is some 1 cycle noise in some >>>> results. >>>> >>>> With the patches this regresses to 9 cycles with the changes but there >>>> is 1 cycle noise occasionally running this test repeatedly. >>>> >>>> Lastly I tried disable the static_branch_unlikely() in >>>> netmem_is_net_iov() check. To my surprise disabling the >>>> static_branch_unlikely() check reduces the fast path back to 8 cycles, >>>> but the 1 cycle noise remains. >>>> >>> >>> The last sentence seems to be suggesting the above 1 ns regresses is caused >>> by the static_branch_unlikely() checking? >> >> Note it's not a 1ns regression, it's looks like maybe a 1 cycle >> regression (slightly less than 1ns if I'm reading the output of the >> test correctly): >> >> # clean net-next >> time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) >> 2.993 ns (step:0) >> >> # with patches >> time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) >> 3.679 ns (step:0) >> >> # with patches and with diff that disables static branching: >> time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) >> 3.248 ns (step:0) >> >> I do see noise in the test results between run and run, and any >> regression (if any) is slightly obfuscated by the noise, so it's a bit >> hard to make confident statements. So far it looks like a ~0.25ns >> regression without static branch and about ~0.65ns with static branch. >> >> Honestly when I saw all 3 results were within some noise I did not >> investigate more, but if this looks concerning to you I can dig >> further. I likely need to gather a few test runs to filter out the >> noise and maybe investigate the assembly my compiler is generating to >> maybe narrow down what changes there. >> > > I did some more investigation here to gather more data to filter out > the noise, and recorded the summary here: > > https://pastebin.com/raw/v5dYRg8L > > Long story short, the page_pool benchmark results are consistent with > some outlier noise results that I'm discounting here. Currently > page_pool fast path is at 8 cycles > > [ 2115.724510] time_bench: Type:tasklet_page_pool01_fast_path Per > elem: 8 cycles(tsc) 3.187 ns (step:0) - (measurement period > time:0.031870585 sec time_interval:31870585) - (invoke count:10000000 > tsc_interval:86043192) > > and with this patch series it degrades to 10 cycles, or about a 0.7ns > degradation or so: Even if the absolute value for the overhead is small, we seems have a degradation of about 20% for tasklet_page_pool01_fast_path testcase, which seems scary. I am assuming that every page is recyclable for tasklet_page_pool01_fast_path testcase, and that code path matters for page_pool, it would be good to remove any additional checking for that code path. And we already have pool->has_init_callback checking when we have to use a new page, it may make sense to refactor that to share the same checking for provider to avoid the overhead as much as possible. Also, I am not sure if it really matter that much, as with the introducing of netmem_is_net_iov() checking spreading in the networking, the overhead might add up for other case too. > > [ 498.226127] time_bench: Type:tasklet_page_pool01_fast_path Per > elem: 10 cycles(tsc) 3.944 ns (step:0) - (measurement period > time:0.039442539 sec time_interval:39442539) - (invoke count:10000000 > tsc_interval:106485268) > > I took the time to dig into where the degradation comes from, and to > my surprise we can shave off 1 cycle in perf by removing the > static_branch_unlikely check in netmem_is_net_iov() like so: > > diff --git a/include/net/netmem.h b/include/net/netmem.h > index fe354d11a421..2b4310ac1115 100644 > --- a/include/net/netmem.h > +++ b/include/net/netmem.h > @@ -122,8 +122,7 @@ typedef unsigned long __bitwise netmem_ref; > static inline bool netmem_is_net_iov(const netmem_ref netmem) > { > #ifdef CONFIG_PAGE_POOL > - return static_branch_unlikely(&page_pool_mem_providers) && > - (__force unsigned long)netmem & NET_IOV; > + return (__force unsigned long)netmem & NET_IOV; > #else > return false; > #endif > > With this change, the fast path is 9 cycles, only a 1 cycle (~0.35ns) > regression: > > [ 199.184429] time_bench: Type:tasklet_page_pool01_fast_path Per > elem: 9 cycles(tsc) 3.552 ns (step:0) - (measurement period > time:0.035524013 sec time_interval:35524013) - (invoke count:10000000 > tsc_interval:95907775) > > I did some digging with YiFei on why the static_branch_unlikely > appears to be causing a 1 cycle regression, but could not get an > answer that makes sense. The # of instructions in > page_pool_return_page() with the static_branch_unlikely and without is > about the same in the compiled .o file, and my understanding is that > static_branch will cause code re-writing anyway so looking at the > compiled code may not be representative. > > Worthy of note is that I get ~95% line rate of devmem TCP regardless > of the static_branch_unlikely() or not, so impact of the static_branch > is not large enough to be measurable end-to-end. I'm thinking I want > to drop the static_branch_unlikely() in the next RFC since it doesn't > improve the end-to-end throughput number and is resulting in a > measurable improvement in the page pool benchmark. >
On Tue, Mar 26, 2024 at 5:47 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2024/3/26 8:28, Mina Almasry wrote: > > On Tue, Mar 5, 2024 at 11:38 AM Mina Almasry <almasrymina@google.com> wrote: > >> > >> On Tue, Mar 5, 2024 at 4:54 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > >>> > >>> On 2024/3/5 10:01, Mina Almasry wrote: > >>> > >>> ... > >>> > >>>> > >>>> Perf - page-pool benchmark: > >>>> --------------------------- > >>>> > >>>> bench_page_pool_simple.ko tests with and without these changes: > >>>> https://pastebin.com/raw/ncHDwAbn > >>>> > >>>> AFAIK the number that really matters in the perf tests is the > >>>> 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 > >>>> cycles without the changes but there is some 1 cycle noise in some > >>>> results. > >>>> > >>>> With the patches this regresses to 9 cycles with the changes but there > >>>> is 1 cycle noise occasionally running this test repeatedly. > >>>> > >>>> Lastly I tried disable the static_branch_unlikely() in > >>>> netmem_is_net_iov() check. To my surprise disabling the > >>>> static_branch_unlikely() check reduces the fast path back to 8 cycles, > >>>> but the 1 cycle noise remains. > >>>> > >>> > >>> The last sentence seems to be suggesting the above 1 ns regresses is caused > >>> by the static_branch_unlikely() checking? > >> > >> Note it's not a 1ns regression, it's looks like maybe a 1 cycle > >> regression (slightly less than 1ns if I'm reading the output of the > >> test correctly): > >> > >> # clean net-next > >> time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) > >> 2.993 ns (step:0) > >> > >> # with patches > >> time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) > >> 3.679 ns (step:0) > >> > >> # with patches and with diff that disables static branching: > >> time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) > >> 3.248 ns (step:0) > >> > >> I do see noise in the test results between run and run, and any > >> regression (if any) is slightly obfuscated by the noise, so it's a bit > >> hard to make confident statements. So far it looks like a ~0.25ns > >> regression without static branch and about ~0.65ns with static branch. > >> > >> Honestly when I saw all 3 results were within some noise I did not > >> investigate more, but if this looks concerning to you I can dig > >> further. I likely need to gather a few test runs to filter out the > >> noise and maybe investigate the assembly my compiler is generating to > >> maybe narrow down what changes there. > >> > > > > I did some more investigation here to gather more data to filter out > > the noise, and recorded the summary here: > > > > https://pastebin.com/raw/v5dYRg8L > > > > Long story short, the page_pool benchmark results are consistent with > > some outlier noise results that I'm discounting here. Currently > > page_pool fast path is at 8 cycles > > > > [ 2115.724510] time_bench: Type:tasklet_page_pool01_fast_path Per > > elem: 8 cycles(tsc) 3.187 ns (step:0) - (measurement period > > time:0.031870585 sec time_interval:31870585) - (invoke count:10000000 > > tsc_interval:86043192) > > > > and with this patch series it degrades to 10 cycles, or about a 0.7ns > > degradation or so: > > Even if the absolute value for the overhead is small, we seems have a > degradation of about 20% for tasklet_page_pool01_fast_path testcase, > which seems scary. > > I am assuming that every page is recyclable for tasklet_page_pool01_fast_path > testcase, and that code path matters for page_pool, it would be good to > remove any additional checking for that code path. > We can remove the usage of static_branch_unlikely in the net_iov check, which reduces the overhead to 1 cycle (8->9), only 12.5% overhead. The addition of the static_branch_unlikely is not improving the performance of devmem TCP anyway. From previous discussions with Jesper he deemed a 1 cycle degradation acceptable, but he hasn't commented in a while, he may have changed his mind but so far no complaints. We can additionally only add the check only if CONFIG_SHARED_DMA_BUFFER is enabled. I've tested that and the fast path goes back to 8 cycles (0 overhead). If CONFIG_SHARED_DMA_BUFFER is not enabled then netmem can't be dmabuf anyway, so no reason to check. > And we already have pool->has_init_callback checking when we have to use > a new page, it may make sense to refactor that to share the same checking > for provider to avoid the overhead as much as possible. > > Also, I am not sure if it really matter that much, as with the introducing > of netmem_is_net_iov() checking spreading in the networking, the overhead > might add up for other case too.